id=”js_tags” class=”article-tag__list”> included in the collection #Kafka
id=”js_article-tag-card__right” class=”article-tag-card__right”> 4
Let’s talk about the excellent design in Kafka today, hoping to improve your design ability and coding ability!
The
role of the message system
should be clear to most small partners, use oil to pack the box for example
so the messaging system is what we call a repository in the figure above , can act as a cache in the intermediate process, and realize the role of decoupling.
To introduce a scenario, we know that the log processing of China Mobile, China Unicom, and China Telecom is handed over to outsourcing to do big data analysis, assuming that their logs are now handed over to the system you do to do user portrait analysis.
According to the role of the message system
just mentioned, we know that the message system is actually a simulated cache, and only plays the role of a cache rather than a real cache, and the data is still stored on disk instead of memory.
1.Topic Topic
kafka learned the design of the database, and designed the topic (topic) in it, which is similar to the table of the relational database
At this time, I need to get the data of China Mobile, then directly listen to TopicA can be
2.Partition
partition kafka There is also a concept called Partition (partition), the partition is specifically manifested on the server at the beginning is a directory, there are multiple partitions under a topic, these partitions will be stored on different servers, or in other words, In fact, different directories are created on different hosts.
The main information about these partitions is stored in .log file. It is similar to partitioning in a database to improve performance.
As for why performance is improved, it is simple, multiple partitions and multiple threads, and multiple threads processing in parallel will certainly be much better than single threading
Topic and partition are like the concept of table and region in HBASE, table is only a logical concept, the real storage of data is region, these regions will be distributed on each server, corresponding to kafka, the same, topic is also a logical concept, and partition is a distributed storage unit. This design is the basis for ensuring massive data processing.
We can compare, if HDFS does not have a block design, a 100T file can only be placed on a server alone, it will directly occupy the entire server, after the introduction of blocks, large files can be scattered and stored on different servers.
Note: 1. Partitions will have a single point of failure, so we will set the number of replicas for each partition
2. The
number of partitions is 0 3.
Producer – the producer
who sends data to the message system is the producer
4.Consumer – The
consumer
who reads the data from kafka is the consumer
5.Message –
The
data we process in the message kafka is called a message
II.Kafka’s Cluster Architecture
Create a topic for TopicA, and store 3 partitions on different servers, that is, under the broker. Topic is a logical concept, and you cannot directly draw the related unit of the topic
href=” http://mp.weixin.qq.com/s?__biz=MzIxMTE0ODU5NQ==&mid=2650247198&idx=1&sn=487051e81d4c919d6d89ddb75031849b&chksm=8f5afa42b82d7354da22cbf1a6806931b7179fac2a3558598790bbaa2e7e3568bc466207e300&scene=21#wechat_redirect”>
did not have a replica mechanism before version 0.8, so data will be lost in the face of server downtime, so try to avoid using Kafka Replica before this version
– replica
partitions in kafka In order to ensure data security, each partition can set multiple replicas.
At this time, we set 3 replicas for partitions 0, 1, and 2 (in fact, it is more appropriate to set up two replicas).
And in fact, each copy has a role, they will choose a copy as the leader, and the rest as the follower, our producer when sending data, is directly sent to the leader partition,
and then the follower partition will go to the leader to synchronize the data by itself, when consumers consume data, It is also to consume data from the leader.
Consumer Group
– Consumer Group
We will specify a group.id in the code when consuming data, this id represents the name of the consumer group, and even if this group.id is not set, the system will default
conf.setProperty( "group.id", "tellYourDream")
Some of the messaging systems we know are generally designed in such a way that as long as one consumer consumes the data in the messaging system, all the rest of the consumers can no longer consume this data. But kafka is not like this, for example, now consumerA consumes the data in a topicA.
consumerA:
group.id = aconsumerB: group.id = a
consumerC: group.id = b
consumerD: group.id = B
then let consumerB also consume TopicA’s data, it is not consumed, but we re-specify another group.id in consumerC, consumerC can consume TopicA’s data. ConsumerD is also not consumable, so in kafka, different groups can have a unique consumer to consume the same topic of data.
Therefore, consumer groups exist to allow multiple consumers to consume information in parallel, and they will not consume the same message, as follows, consumerA, B, C is a consumer group that does not interfere with each other
: a consumerA consumerB consumerC
As shown in the figure, because it was mentioned earlier that consumers will directly establish contact with the leader, they consume three leaders respectively, so a partition will not let multiple consumers in the consumer group to consume, but in the case of unsaturated consumers, a consumer can consume data in multiple partitions.
is familiar with a rule: in the big data distributed file system, 95% of the architecture is master-slave, and some are equivalent architectures, such as ElasticSearch.
Kafka is also a master-slave architecture, the master node is called controller, the rest are slave nodes, and the controller needs to cooperate with zookeeper to manage the entire kafka cluster.
How Kafka and Zookeeper work together Kafka
relies heavily on the Zookeeper cluster (so the previous ZooKeeper article was somewhat useful). All brokers will be registered with zookeeper when they are launched, the purpose is to elect a controller, this election process is very simple and rude, that is, a process of who comes first, and there are no algorithm problems involved.
So what do you do after becoming a controller, it will listen to multiple directories in zookeeper?
For example, there is a directory /brokers/,
and other slave nodes register with this directory (that is, create their own subdirectories on this directory) themselves, at this time the naming convention is generally their id number, such as /brokers/0,1,2 When registering, each node must expose its host name, port number, etc., at this time the controller will read the data of the registered slave node (through the listening mechanism), Generate cluster metadata information, and then distribute this information to other servers so that other servers can sense the presence of other members of the cluster.
At this time,
simulating a scenario, we create a topic (in fact, it is to create a directory on zookeeper/topics/topicA), kafka will generate the partition scheme in this directory, at this time the controller will hear this change, it will synchronize the meta information of this directory, and then also delegate to its slave nodes, through this method let the entire cluster know this partition scheme, At this point, each node creates a directory and waits for the partition replica to be created. This is also the management mechanism of the entire cluster.
Meal Time
1.What is the good performance of Kafka?”
(1) Sequential
write Every time the operating system reads and writes data from the disk, it needs to be addressed first, that is, it is
necessary to find the physical location of the data on the disk, and then read and write data, if it is an mechanical hard disk, the addressing takes a long time.
In the design of Kafka, the data is actually stored on the disk, and in general, the data will be stored on the memory to perform well. However, kafka is written sequentially, appending data is appended to the end, the performance of disk sequential write is extremely high, in the case of a certain number of disks and a certain number of revolutions, the basic and memory speed is
consistent with random writing is to modify the data in a certain location in the file, and the performance will be lower.
(2) Zero
copy
Let’s first look at the situation of non-zero copy
You can see that the copy of data from memory to the kafka service process piece, and then to the socket cache piece, the whole process takes a lot of time, kafka uses Linux’s sendFile technology (NIO), eliminating process switching and a data copy, making performance better.
2. Log segment storage
Kafka stipulates that the maximum number of .log files in a partition is 1G, and the purpose of this restriction is to facilitate the loading of .log into memory to operate
00000000000000000000.index 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
.log
00000000000000000000.timeindex00000000000005367851.index
00000000000005367851.log
00000000000005367851.timeindex00000000000009936472.index
00000000000009936472.log
00000000000009936472. A number such as the 9936472 timeindex
represents the starting offset contained in the log segment file, which means that at least nearly 10 million pieces of data have been written to this partition.
Kafka broker has a parameter, log.segment.bytes, which limits the size of each log segment file, the maximum is 1GB, a log segment file is full, it will automatically open a new log segment file to write,
to avoid a single file is too large, affecting the read and write performance of the file, this process is called log rolling, the log segment file that is being written, called an active log segment.
If you read the previous two articles about HDFS, you will find that NameNode’s edits log also has limitations, so these frameworks will take these issues into account.
design
Kafka’s network design is related to Kafka’s tuning, which is why it supports high concurrency
First of all, the client sends all requests to an Acceptor first, there will be 3 threads in the broker (the default is 3), these 3 threads are called processors, Acceptor will not do any processing on the client’s request, directly encapsulated into socketChannels sent to these processors to form a queue, the way to send is polling, that is, first send to the first processor, Then give the second, the third, and back to the first. When the consumer thread consumes these socketchannels, it will fetch one request request after another, and these request requests will be accompanied by data.
By default, there are 8 threads in the thread pool, which are used to process requests, parse requests, and write to disk if request is a write request. The read words return the result. The processor reads the response data from the response before returning it to the client. This is Kafka’s three-tier network architecture.
So if we need to enhance kafka, increase the processor and increase the processing threads in the thread pool, we can achieve the effect. The part of request and response actually plays a caching effect, considering that the processors generate requests too quickly and the number of threads is not enough to deal with them in time.
So this is an enhanced version of the reactor network threading model.
Author: Say what you wish
Original:
https://juejin.cn/post/6844903999066341384 end
public number (zhisheng) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitoring < keywords such as span class="js_darkmode__148"> to view more articles corresponding to keywords.
like + Looking, less bugs 👇