Spring Apache Kafka - springcavaj

What is Kafka?

Apache Kafka is an open source distributed and fault-tolerant event streaming platform used to collect, store and process real time data streams at scale. It has different use cases like distribution logging, stream processing and publisher-subscriber messaging. Most of the companies used Kafka nowadays. According to Kafka’s official website, nowadays 10 out of 10 Manufacturing companies, 7out of 10 Banks, 10 out of 10 Insurance companies, 8 out of 10 Telecom companies and many more used Kafka as event streaming system.

What is an Event?

Event is the thing that has happened. Like Internet of Things (IOT), Change in the status of Business Process, User Interaction, Output of a Microservice. In technical terms, we can define event as :
Event = Notification + State.

What is Event Streaming?

Event Streaming is the process of capturing data in real-time from event sources like database, sensors, mobile devices, cloud services, software applications in the form of streams of events. These events can be stored for later retrieval for manipulating, processing and reacting to the event streams in real-time and routing those events to different destination. In other words we can say that by using event streaming process we can ensure that the right data is at the right place at the right time.

Apache Kafka is an Event Streaming platform

To publish (write) and subscribe to (read) streams of events
Store those events durably for later use
Process those events as they occur or retrospectively

Difference with Traditional Messaging System

Traditional Messaging System	Apache Kafka
Limited Scalability	Scalable
Transient in-memory persistence	Messages also stored in replicated logs
Lower throughput	Higher throughput

How does Apache Kafka works?

Apache Kafka uses the concept of server–client architecture. It means it consists of servers and clients that communicate via a high performance network protocol known as TCP Network protocol.

Different Terminologies

Apache Kafka Topic

Apache Kafka is a messaging system where the producers send messages and the consumers consume those messages. Producers send the messages to Apache Kafka Topics. From these topics, consumers consume those messages. Topic have unique names which are used by the producers and consumers for sending and consuming messages. Topics are the abstraction level where data lives within Kafka. Each topic is backed by logs which are partitioned and distributed. We can say topic is similar to that of table in a database.

Apache Kafka Broker

The physical/virtual machines or servers where topics reside are called Brokers. It basically runs on a machine and has access to file systems where logs are maintained. We can say that, a single topic having multiple partition running on different brokers. Also we can add more brokers to scale the application.

Advantages of Adding more Kafka Broker

We can add more Kafka Brokers to scale up the application. And the advantages we will get are as follows:

Clustering – All the Kafka Brokers work unitedly
Distributed – Data is distributed among all the brokers
Fault Tolerant – Maintains replicated copies of data. Suppose, if any broker goes down then it will not affect the working of Kafka as it is in cluster and distributed. It is done by setting the replication factor > 1
Application Scaling – Increase the throughput by horizontally scaling

Apache Zookeeper

Apache Zookeeper is a management tool which manages the cluster. We can consider this as a central repository where the applications publish data and consume data out of it. The main function of Apache Zookeeper is to keep the system working as a single unit by applying synchronization, serialization, coordination and selection of leader nodes.

Kafka Message Structure

In Kafka data resides in Topic. A Topic is similar to a Database. But there is one difference between Topic and Database is that: Database can store data permanently but Apache Kafka can store the messages in distributed logs at most 7 days. Default is 7 days which is 168 hours.

Each Message has 3 parts:

Timestamp Identifier
A Unique Message Id
And the Message payload in binary

The last read message position is called Offset. Consumers keep track of the successfully processed messages. It should also be remembered that a Kafka Topic should have minimum 1 partition to avail higher level throughput and scale fault tolerance.

In the above picture it is clearly visualized that, Producer are publishing immutable messages in the Kafka Topic in an ordered manner where the older message is at 0th position. So total no. of messages according to the above diagram is 8. Consumer 1 stops reading after the 3rd message at that time the current offset value is 3. Consumer 2 stops reading after the 5th message, at that time the current offset value is 5. While lastly the Consumer 3 reads the last message as exists in the Topic, at that time the current offset value is 8. So we can conclude here the last offset value is 8. And this all are managed by the Consumers.

Advantages of using Apache Kafka

There are 4 advantages of using Apache Kafka are as follows:

High Throughput – The platform which process messages at a very fast speed which can exceed beyond 100k/secs. And the data is processed in a partitioned and ordered manner.
Durability – In Kafka messages can be stored. By default it is 168 hours. But it can be configured. This ensures re-processing if required.
Scalability – In Kafka scalability can be configured at any levels like multiple Producers can publish in the same topic, Topics can be partitioned and the Consumers can be grouped to consume individual partitions.
Fault Tolerance – It is a distributed architecture with several nodes running together to serve the cluster. Topics are replicated. failure of node won’t impact because of the integration with ZooKeeper which provides accurate information to both the producer and consumer. Each Topic internally has its own leader. Failure of the leader ensures new leader selection.

Different Apache Kafka Components Implementations

Spring Apache Kafka – Producer & Consumer

Spring Kafka Confluent – Set Up

Spring Kafka Confluent – Producer & Consumer

Spring Apache Kafka Connect Implementations

Spring Mongo Kafka Connector