Apache Kafka is a powerful distributed event streaming platform widely used for building real-time data pipelines and streaming applications. Originally developed by LinkedIn and later open-sourced through the Apache Software Foundation, Kafka has become an essential tool for companies handling massive data streams across various industries, from finance and telecommunications to e-commerce and IoT.
At its core, Kafka provides a reliable, scalable way to handle real-time event processing with low latency. In this guide, we’ll dive into how Kafka operates and demonstrate how to set it up with Node.js to build robust, event-driven applications. Whether you’re streaming logs, tracking user activity, or syncing data across services, this blog will equip you with the foundational knowledge and practical steps to start using Kafka effectively in your Node.js applications. We will be using KafkaJS to connect Node.js to Kafka.
Prerequisites:
Before proceeding with this tutorial, you should have a basic understanding of the following:
Advantages of Kafka:
Here are some of the key advantages of using Apache Kafka:
High Throughput: Kafka is designed to handle large volumes of data with minimal latency, enabling the processing of millions of messages per second. Its efficient I/O operations make it highly scalable for high-throughput use cases.
Fault Tolerance: Kafka replicates data across multiple brokers to withstand hardware failures without data loss. This replication provides data redundancy and enhances resilience, making it reliable for critical applications.
Scalability: Kafka’s distributed architecture allows it to easily scale horizontally by adding more brokers to the cluster. This flexibility is ideal for organizations with growing data requirements.
Real-time Data Processing: Kafka provides low-latency, real-time data streaming capabilities, which makes it suitable for applications like monitoring, data pipelines, and analytics where real-time data processing is essential.
Durability: Messages in Kafka are written to disk and can be retained for as long as needed, allowing consumers to reprocess events and handle historical data. This makes Kafka great for event sourcing and audit logging.
Decoupling of Producers and Consumers: Kafka allows you to decouple data sources (producers) from the data consumers, making it easy to integrate various systems without dependencies or bottlenecks. Producers and consumers can operate independently at their own rates.
Kafka concepts and terminology:
We need to learn a few concepts and terminology to understand how Kafka works. Kafka is a distributed system with servers and clients communicating via a high-performance TCP network protocol.
Servers: Kafka is run as a cluster of one or more servers. Servers store and manage events. Servers in Kafka have two roles broker and controller. A server can have a single role or both roles. We will see the differences between these roles later.
Events: An event records the fact that “something happened” in the world or your business. We read and write data in Kafka through events.
Producers: Producers are client applications that publish (write) events to Kafka.
Consumers: Consumers are client applications that subscribe to (read and process) particular events.
Topics: Events are organized and durably stored in topics. Simply put, a topic is similar to a folder in a filesystem; the events are the files in that folder.
Partitions: Topics are partitioned, meaning a topic is spread over a number of “buckets” located on different Kafka servers. This distributed placement of data is very important for scalability because it allows client applications to both read and write the data from/to many servers at the same time. It also makes Kafka flat-tolerant.
In the above image, 2 different producers are publishing events on the topic’s partitions. There are 4 partitions. The two producers are independent of each other and they can both write on the same partition. Events with the same key are written to the same partition. Source: documentation.
The broker server stores the events. The client applications connect to broker nodes to send and receive events.
The controller is responsible for managing the states of partitions and replicas and for performing administrative tasks like reassigning partitions. At any given time there is only one controller broker in your cluster.
Older versions of Kafka depended on Zookeeper for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Over the years there have been many attempts to remove the dependency of zookeeper for Kafka. The current version of Kafka at the time of writing this article is 3.8, which still supports Zookeeper.
Version 3.8 supports Apache Kafka Raft (KRaft) which is the consensus protocol that was introduced in KIP-500 to remove Kafka’s dependency on ZooKeeper for metadata management. Kafka version 4.0 aims to completely remove the dependency on Zookeeper.
Setting up Kafka:
The quickest way to setup Kafka is by using docker. However, you can follow the documentation to setup Kafka without docker. In this article, we will create a single Kafka server that will act as the broker and the controller. We will run 2 NodeJS applications, a producer and a consumer, to show how to use Kafka. We won’t be using Zookeeper in our Kafka integration.
Step 1: Create a docker-compose file
Open your favorite editor and create a file called docker-compose.yml. Paste the following content inside it.
version: "3.9"
services:
kafka:
image: apache/kafka:3.8.0
hostname: broker
container_name: broker
ports:
- '9092:9092'
environment:
KAFKA_NODE_ID: 1
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: 'CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT'
KAFKA_ADVERTISED_LISTENERS: 'PLAINTEXT_HOST://localhost:9092,PLAINTEXT://broker:19092'
KAFKA_PROCESS_ROLES: 'broker,controller'
KAFKA_CONTROLLER_QUORUM_VOTERS: '1@broker:29093'
KAFKA_LISTENERS: 'CONTROLLER://:29093,PLAINTEXT_HOST://:9092,PLAINTEXT://:19092'
KAFKA_INTER_BROKER_LISTENER_NAME: 'PLAINTEXT'
KAFKA_CONTROLLER_LISTENER_NAMES: 'CONTROLLER'
CLUSTER_ID: '4L6g3nShT-eMCtK--X86sw'
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
KAFKA_NODE_ID is a unique ID for a Kafka server. It can be any valid integer.
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP defines key/value pairs for the security protocol to use per listener name. Here PLAINTEXT refers to HTTP.
KAFKA_ADVERTISED_LISTENERS are the addresses that clients use to connect.
KAFKA_PROCESS_ROLES is the role of this Kafka server. Since we are using a single node the server is both the broker and the controller.
KAFKA_CONTROLLER_QUORUM_VOTERS To understand the use of this variable you need to know that when a Kafka cluster having multiple servers gets initialized there will be a voting system that determines which server is the controller. This variable is used to pass a list of servers with the controller role that can be elected as the controller.
KAFKA_LISTENERS is a comma-separated list of listeners and the host/IP and Kafka port to which Kafka binds for listening. These are what interfaces Kafka binds to. This is for both brokers and controllers.
KAFKA_INTER_BROKER_LISTENER_NAME Kafka brokers communicate between themselves, usually on the internal network. This is the name that the other brokers can use to talk to this server. It might not make sense for a single node, but these are mandatory variables.
KAFKA_CONTROLLER_LISTENER_NAMES This is a comma-separated list of the names of the listeners used by the controller.
CLUSTER_ID This is the unique ID of a cluster.
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR This configuration allows us to set the number of replicas of the offsets topic. We are setting this to one since we are using a single server.
KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS The delay used by the GroupCoordinator to start the initial rebalance when the first member joins an empty group.
To start the Kafka container, in the terminal, you can run the following command in the same directory where the docker-compose.yml file is present:
docker compose up
Step 2: Initializing the Kafka Top
Once the Kafka container is ready, we need to initialize the topics that Kafka needs to handle. This needs to be done separately, there are articles and StackOverflow posts that explain how to automate this. We won’t be covering that in this article. You can run the following command to initialize the Kafka topic.
docker exec -i broker bash /opt/kafka/bin/kafka-topics.sh --create --topic test-topic --bootstrap-server localhost:9092
In the above command, we are executing a shell script that is present inside the docker container to create a topic. You can refer to the documentation for more details regarding the different shell scripts that are available and their uses. Don’t start the producer or the consumer before running the above script, it will throw an error.
Step 3: Setting up the Producer
Create a new file called producer.js and Paste the following code:
const { Kafka, Partitioners } = require("kafkajs");
const kafka = new Kafka({
clientId: "my-app",
brokers: ["localhost:9092"],
});
(async function () {
const producer = kafka.producer({
createPartitioner: Partitioners.DefaultPartitioner,
});
await producer.connect();
await producer.send({
topic: "test-topic",
messages: [{ value: "Hello KafkaJS junk!" }],
});
await producer.disconnect();
})();
const kafka = new Kafka({
clientId: “my-app”,
brokers: [“localhost:9092”],
});
The clientId can be any string. We have passed the list of brokers the application can connect
const producer = kafka.producer({
createPartitioner: Partitioners.DefaultPartitioner,
});
This is used to initialize a Kafka producer. We also instruct Kafka to create a default partition.
await producer.send({
topic: “test-topic”,
messages: [{ value: “Hello KafkaJS!” }],
});
We have triggered a Kafka event on the topic test-topic, notice that this is the same topic name we used to initialize Kafka earlier. We are sending the message “Hello KafkaJS!”
The above script will trigger an event and will exit immediately. You can run the above script using:
node producer.js
Step 3: Setting up the Consumer
Before setting up the consumer we need to understand what a consumer group is. Consumer groups allow Kafka consumers to work together and process events from a topic in parallel. Consumers are assigned a subset of partitions from a topic or set of topics and can parallelize the processing of those events. Kafka gives the assurance that a single Kafka event won’t be consumed by multiple consumers. You can refer to this link for a detailed explanation of consumer groups.
Create a file called consumer.js, and paste the following code:
const { Kafka } = require("kafkajs");
const kafka = new Kafka({
clientId: "my-app",
brokers: ["localhost:9092"],
});
(async function () {
const consumer = kafka.consumer({ groupId: "test-group" });
await consumer.connect();
await consumer.subscribe({ topic: "test-topic", fromBeginning: true });
await consumer.run({
eachMessage: async ({ topic, partition, message }) => {
console.log({
value: message.value.toString(),
});
},
});
})();
const consumer = kafka.consumer({ groupId: “test-group” });
We have created a Kafka consumer and have specified the consumer group this consumer belongs.
await consumer.subscribe({ topic: “test-topic”, fromBeginning: true });
Here we are subscribing to the test-topic. By setting fromBeginning to true we are instructing Kafka to send any pending events that are already in the queue. IffromBeginning is set to false then the consumer will only consume the new events that occur after the consumer started. You can test this by running the producer first and then the consumer.
await consumer.run({
eachMessage: async ({ topic, partition, message }) => {
console.log({
value: message.value.toString(),
});
},
});
Here we have setup the listener.
Run the following command to start the consumer.
node consumer.js
Unlike the producer script, the consumer script will keep listening to events.
Conclusion:
In conclusion, Kafka and Node.js together, powered by KafkaJS, offer a practical and powerful solution for modern, scalable applications. As your application grows, you can explore advanced Kafka features such as partitioning strategies, consumer groups, and stream processing, enhancing both functionality and reliability.
You can refer to this GitHub repo for the entire source code.
Let's collaborate to turn your business challenges into AI-powered success stories.
Get Started