What is Kafka?
Apache Kafka is an open source distributed and fault-tolerant event streaming platform used to collect, store and process real time data streams at scale. It has different use cases like distribution logging, stream processing and publisher-subscriber messaging. Most of the companies used Kafka nowadays. According to Kafka’s official website, nowadays 10 out of 10 Manufacturing companies, 7out of 10 Banks, 10 out of 10 Insurance companies, 8 out of 10 Telecom companies and many more used Kafka as event streaming system.
What is an Event?
Event is the thing that has happened. Like Internet of Things (IOT), Change in the status of Business Process, User Interaction, Output of a Microservice. In technical terms, we can define event as :
Event = Notification + State.
What is Event Streaming?
Event Streaming is the process of capturing data in real-time from event sources like database, sensors, mobile devices, cloud services, software applications in the form of streams of events. These events can be stored for later retrieval for manipulating, processing and reacting to the event streams in real-time and routing those events to different destination. In other words we can say that by using event streaming process we can ensure that the right data is at the right place at the right time.
Apache Kafka is an Event Streaming platform
- To publish (write) and subscribe to (read) streams of events
- Store those events durably for later use
- Process those events as they occur or retrospectively
Difference with Traditional Messaging System
Traditional Messaging System | Apache Kafka |
Limited Scalability | Scalable |
Transient in-memory persistence | Messages also stored in replicated logs |
Lower throughput | Higher throughput |
How does Apache Kafka works?
Apache Kafka uses the concept of server–client architecture. It means it consists of servers and clients that communicate via a high performance network protocol known as TCP Network protocol.
Different Terminologies
Apache Kafka Topic
Apache Kafka is a messaging system where the producers send messages and the consumers consume those messages. Producers send the messages to Apache Kafka Topics. From these topics, consumers consume those messages. Topic have unique names which are used by the producers and consumers for sending and consuming messages. Topics are the abstraction level where data lives within Kafka. Each topic is backed by logs which are partitioned and distributed. We can say topic is similar to that of table in a database.
Apache Kafka Broker
The physical/virtual machines or servers where topics reside are called Brokers. It basically runs on a machine and has access to file systems where logs are maintained. We can say that, a single topic having multiple partition running on different brokers. Also we can add more brokers to scale the application.
Advantages of Adding more Kafka Broker
We can add more Kafka Brokers to scale up the application. And the advantages we will get are as follows:
- Clustering – All the Kafka Brokers work unitedly
- Distributed – Data is distributed among all the brokers
- Fault Tolerant – Maintains replicated copies of data. Suppose, if any broker goes down then it will not affect the working of Kafka as it is in cluster and distributed. It is done by setting the replication factor > 1
- Application Scaling – Increase the throughput by horizontally scaling
Apache Zookeeper
Apache Zookeeper is a management tool which manages the cluster. We can consider this as a central repository where the applications publish data and consume data out of it. The main function of Apache Zookeeper is to keep the system working as a single unit by applying synchronization, serialization, coordination and selection of leader nodes.
Kafka Message Structure
In Kafka data resides in Topic. A Topic is similar to a Database. But there is one difference between Topic and Database is that: Database can store data permanently but Apache Kafka can store the messages in distributed logs at most 7 days. Default is 7 days which is 168 hours.
Each Message has 3 parts:
- Timestamp Identifier
- A Unique Message Id
- And the Message payload in binary
The last read message position is called Offset. Consumers keep track of the successfully processed messages. It should also be remembered that a Kafka Topic should have minimum 1 partition to avail higher level throughput and scale fault tolerance.
In the above picture it is clearly visualized that, Producer are publishing immutable messages in the Kafka Topic in an ordered manner where the older message is at 0th position. So total no. of messages according to the above diagram is 8. Consumer 1 stops reading after the 3rd message at that time the current offset value is 3. Consumer 2 stops reading after the 5th message, at that time the current offset value is 5. While lastly the Consumer 3 reads the last message as exists in the Topic, at that time the current offset value is 8. So we can conclude here the last offset value is 8. And this all are managed by the Consumers.
Advantages of using Apache Kafka
There are 4 advantages of using Apache Kafka are as follows:
- High Throughput – The platform which process messages at a very fast speed which can exceed beyond 100k/secs. And the data is processed in a partitioned and ordered manner.
- Durability – In Kafka messages can be stored. By default it is 168 hours. But it can be configured. This ensures re-processing if required.
- Scalability – In Kafka scalability can be configured at any levels like multiple Producers can publish in the same topic, Topics can be partitioned and the Consumers can be grouped to consume individual partitions.
- Fault Tolerance – It is a distributed architecture with several nodes running together to serve the cluster. Topics are replicated. failure of node won’t impact because of the integration with ZooKeeper which provides accurate information to both the producer and consumer. Each Topic internally has its own leader. Failure of the leader ensures new leader selection.
Pingback: Apache Kafka – Producer and Consumer - springcavaj