problem of high availability caused byKafka downtime


with a Kafka outage.

I work for a fintech company, but instead of using RabbitMQ, which is more popular in the field of financial payments, the company has adopted Kafka, which was designed for log processing from the beginning, so I have always been curious about the high availability implementation and guarantee of Kafka. Since its deployment, Kafka, which is used internally in the system, has been running stably without being unavailable.

However, recently, system testers often report that Kafka consumers occasionally do not receive messages, and log in to the management interface to find that one of the three nodes is down and hangs up. But according to the concept of high availability, how can the availability of three nodes and two nodes cause consumers in the entire cluster to not receive messages?

To solve this problem, we must start with the highly available implementation of Kafka.

Kafka’s multi-copy redundancy design

is whether it is a traditional relational database-based system or distributed such as zookeeper, redis, Kafka, HDFS The way to achieve high availability is usually to adopt a redundant design, through redundancy to solve the problem of node downtime and unavailability. Let’s start with a brief understanding of a few concepts of Kafka:

physical models

logistic model

    >Broker (node): Kafka service node, simply put, a broker is a Kafka server, a physical node.


  • : In Kafka, messages are categorized by topic, each topic has a Topic Name, producers send messages to a specific topic according to the Topic Name, and consumers also consume from the corresponding topic according to the Topic Name.
  • Partition: A topic is a unit of message classification, but each topic can be subdivided into one or more partitions  (Partitions), a partition can belong to only one topic. For example, message 1 and message 2 are sent to topic 1, which may enter the same partition or enter different partitions (so different partitions under the same topic contain different messages), and then they will be sent to the corresponding broker node of the partition.
  • Offset (offset): The partition can be regarded as a queue that only enters and does not exit (Kafka only guarantees that the messages in a partition are ordered), the message will be appended to the tail of the queue, and each message will have an offset after entering the partition to identify the position of the message in the partition, and the consumer wants to consume the message to be identified by the offset.

In fact, based on the above concepts, did you also somewhat guess that Kafka’s multi-copy redundancy design is implemented? Don’t worry, let’s keep looking down.

Prior to Kafka 0.8, there was no multi-replica redundancy mechanism, and once a node was down, all partitions on that node could no longer be consumed. This is equivalent to the loss of some data sent to the topic.

The introduction of Replica Reporter after version 0.8 solves the problem of data loss after downtime very well. Replicas are units of data for each partition in a topic, and the data of each partition is synchronized to other physical nodes to form multiple replicas.

Each copy of a partition consists of a Leader copy, which is elected by all replicas, and multiple Follower copies, all of which are Follower copies. When the producer writes or the consumer reads, it will only deal with the leader, and the follower will pull the data for data synchronization after writing the data.

It’s that simple? Yes, based on the multi-replica architecture diagram above, Kafka’s high availability is achieved. When a broker hangs up, worry that the partition on that broker still has replicas on other broker nodes . What if it’s Leader that hangs up? Then in the Follower to elect a Leader, producers and consumers can play happily with the new Leader, this is high availability.

You may also be wondering, how many copies are enough? What if there is no full sync between Follower and Leader? What are the election rules for a leader after a node goes down?

Direct Conclusion:

How many copies are enough?

The more replicas there are, the more to ensure the high availability of Kafka, but more replicas means more consumption of network and disk resources, and performance will decrease.

What if there is no full sync between Follower and Lead?

The Follower and Leader are not completely synchronous, but they are not completely asynchronous, but rather use an ISR mechanism (In-Sync Replica). )。 Each Leader dynamically maintains an ISR list that stores the Followers that are basically synchronized with the Leader. If a Follower does not initiate a pull data request to the Leader due to network, GC, etc., and the Follower is out of sync with the Leader, it will be kicked out of the ISR list. So, the Followers in the ISR list are copies that keep up with the Leader.

What are the election rules for a leader after a node goes down?

There are many distributed election rules, such as Zookeeper’s Zab, Raft, Viewstamped Replication, Microsoft’s PacificA, etc. Kafka’s Leader election idea is very simple, based on the ISR list we mentioned above, when the downtime will be looked up from all replicas in order, if the found replica is in the ISR list, it is elected as Leader. In addition, it is necessary to ensure that the former leader is already abdicated, otherwise there will be a split brain situation (there are two leaders). How to guarantee? Kafka guarantees that there is only one leader by setting up a controller.

The Ack parameter determines the degree of reliabilityIn addition, here is a necessary knowledge point for the

interview test Kafka high availability: request.required.asks parameter. The parameter Asks is an important configuration of the producer client, and this parameter can be set when sending messages. This parameter has three configurable values: 0, 1, All.

The first is set to 0

means that after the producer sends the message,

we don’t care whether the message is dead or alive, there is a little bit of the meaning of forgetting after sending it, and it is not responsible if it is spoken. If you are not responsible for nature, you may lose this message, and then you will lose usability.

The second is set to 1


which means that after the producer sends the message, as long as the message is successfully conveyed to the leader, it does not matter whether other followers are synchronized or not. There is a situation where the Leader just received the message, and the Follower went down before the Synchronize Broker, but the producer already thought that the message was sent successfully, then the message was lost. Note that

setting to 1 is the default configuration of Kafka, which shows that

the default configuration of Kafka is

not so highly available, but a trade-off between high availability and high throughput.

The third is set to All (or -1),

which means that after the producer sends the message, not only the Leader must receive it, but also the Follower in the ISR list must be synchronized, and the producer will send the task message successfully.

Think further, won’t Asks=All lose messages? The answer is no. When only Leader is left in the ISR list, Asks=All is equivalent to Asks=1, in this case, if the node is down, can you still ensure that the data is not lost? Therefore, data is not lost only if there are two copies of the ISR in Asks=All.

After going around in a big circle to solve the problem and understanding the high availability mechanism of Kafka, finally returning to our initial

problem itself, why is one of Kafka’s nodes unavailable after it goes down?

The number of broker nodes I configured in my dev test environment is 3, the number of Topic is 3 replicas, the number of partitions is 6, and the Asks parameter is 1.

When one of the three nodes goes down, what does the cluster do first? That’s right, as we said above, the cluster finds that the leader with the partition has failed, and the leader must be re-elected from the ISR list. If the ISR list is empty, is it not available? No, choose one of the surviving replicas of the partition as the leader, but this has the potential risk of data loss.

Therefore, as long as the number of

subject replicas is set to the same as the number of brokers, Kafka’s multi-copy redundancy design can ensure high availability and will not be unavailable after a downtime (but it should be noted that Kafka has a protection policy, Kafka will stop when more than half of the nodes are unavailable). So if you think about it, is there a topic with a copy number of 1 on Kafka?


problem is with __consumer_offset, __consumer_offset is a topic automatically created by Kafka to store the offset information of consumer consumption, the default partition The number is 50. This is the topic, and its default number of copies is 1. If all partitions exist on the same machine, that’s a clear single point of failure! When you broker the partition where __consumer_offset is stored to Kill, you will find that all consumers have stopped spending. How to solve this problem?

The first point is that you need to delete the __consumer_offset, note that this topic is Kafka’s built-in topic, you can’t delete it with a command, I delete it by deleting logs.

Second, you need to change the number of copies of __consumer_offset to 3 by setting offsets.topic.replication.factor to 3. By making the __consumer_offset redundant replica, the consumer consumption problem after a node is also redundant.

Finally, I am confused about why the __consumer_offset partitions appear only stored on one broker and not distributed across each broker, if there are friends who understand please advise ~


public number (zhisheng). reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitoring < keywords such as span class="js_darkmode__148"> to view more articles corresponding to keywords.