Hello, I’m Goku~ 

This article introduces the reader to two excellent distributed message flow platforms: Kafka and Pulsar.

Send a book at the end of the article!

Apache Kafka (Kafka) is a distributed message flow platform developed by LinkedIn and opened source in 2011. Kafka is written in Scala and Java and has become one of the most popular distributed message flow platforms. Kafka is based on a publish/subscribe model and features high throughput, persistence, horizontal scalability, and support for streaming data processing.

Apache Pulsar (Pulsar for short) is the “Next Generation Cloud-Native Distributed Message Flow Platform” developed by Yahoo, which was open sourced in 2016 and is now growing rapidly. Pulsar integrates messaging, storage, and lightweight functional computing, and adopts the architecture design of separation of computing and storage, which supports multi-tenant, persistent storage, and multi-room cross-region data replication, and has the characteristics of streaming data storage such as strong consistency, high throughput, low latency, and high scalability.

Both Kafka and Pulsar are excellent distributed message flow platforms, and both provide the following basic capabilities:

(1) Message system: Kafka and Pulsar can implement a message system based on the publish/subscribe pattern, the message system can realize the program driven by the message – the producer is responsible for generating and sending messages to the message system, the message system delivers the message to the consumer, and the consumer receives the message and executes its own logic. 

This message-driven mechanism has the following advantages:

System decoupling: producers and consumers are logically decoupled and do not interfere with each other. If you need to add new processing logic to the message, you only need to add a new consumer, which is very convenient.

Traffic shaving as a message buffer, the message system caches the traffic peak of the upstream service (producer) at low cost, and the downstream service (consumer) reads data from the message queue and processes it according to its own processing capacity, so as to avoid the downstream service crashing due to a large amount of request traffic.

Data redundancy: The message system caches the data until the data is processed, avoiding data loss caused by downstream services being unable to process data in a timely manner due to crashes going offline and network blockages.

(2) Storage system: Kafka and Pulsar can store a large amount of data, and the client controls where it reads the data, so they can also be used as storage systems to store a large amount of historical data.

(3) Real-time streaming data pipeline: Kafka and Pulsar can build real-time streaming data pipelines, streaming data pipelines load data from MySQL, MongoDB and other data sources into Kafka and Pulsar, and other systems or applications can stably obtain data from Kafka and Pulsar without the need to dock with MySQL and other data sources. To do this, Kafka provides the Kafka Connect module and Pulsar provides the Pulsar IO module, both of which can build real-time streaming data pipelines.

(4) Stream computing applications: Stream computing applications continuously acquire stream data from Kafka and Pulsar, process the data, and finally output the processing results to Kafka and Pulsar (or other systems). Stream computing applications often need to perform complex data transformations on streaming data according to business requirements, such as streaming data aggregation or joining. To that end, Kafka provides the Kafka Streams module and Pulsar provides the Pulsar Functions module, both of which enable streaming computing applications. In addition, Kafka and Pulsar can also be combined with popular distributed computing engines such as Spark and Flink to build real-time streaming applications that process large-scale data in real time.

High throughput, low latency: They both have the ability to handle large-scale message flows with high throughput and the ability to process messages with low latency. This is also the goal pursued by most message flow platforms.

Persistence, consistency: Both Kafka and Pulsar support message persistence storage and provide data backup (copy) functions to ensure data security and data consistency, both of which are excellent distributed storage systems.

High scalability (scalability): Both Kafka and Pulsar are distributed systems that store data shards in clusters of machines and support the scaling of clusters to support large-scale data.

Failover (fault tolerance): Kafka and Pulsar support failover, that is, after a node in the cluster goes offline due to a failure, it does not affect the normal operation of the cluster, which is also a necessary function of an excellent distributed system.

Although Kafka and Pulsar provide similar basic functions, but their design, architecture, and implementation are not the same, this book will delve into how Kafka and Pulsar can achieve a distributed, highly scalable, high-throughput, low-latency message flow platform. In addition, this book will also introduce the application practice of Kafka and Pulsar connectors, stream computing engines, and other functions.

Both Kafka and Pulsar are considered to be a simple messaging system, and the message flow flow is shown in the following figure.

The diagram shows four basic concepts in the messaging system. They are present in both Kafka and Pulsar and have the same meaning.

Message: Data entities in Kafka and Pulsar.

Producer: The app that publishes the message.

Consumer Consumer: An app that subscribes to messages.

Topic Topic: Kafka and Pulsar divide a certain type of message into a topic, the topic is a logical grouping of messages, and the messages of different topics do not interfere with each other.

The following is an example to illustrate the above concepts. Suppose there is a user service that creates a topic “userTopic”, and whenever a new user registers, the user service sends a message to the topic with the message content “New User Registration”. There are currently two services that subscribe to messages for this topic: the Benefit Service and the Permissions Service. After the benefit service receives the message, it is responsible for creating benefits for new users. After the permissions service receives the message, it is responsible for assigning permissions to the new user. The message in this example is the data entity sent by the user service, and the producer is the user service. Consumers are rights services and rights services. The basic concept of ka

Here are some of the basic concepts of Kafka.

Kafka Consumer Group: Kafka divides multiple consumers into a logical grouping, which is a consumer group. This concept is more important, combined with the above example to illustrate, in Kafka, all consumers of the rights service can join a rights consumer group, rightsGroup, and all consumers of the permission service can join a rights consumption group guthority Group. Different consumers do not interfere with each other in consumer news.

Broker: Kafka service node, can be understood as a Kafka service node or service process (hereinafter collectively referred to as Broker node), multiple broker nodes can form a broker cluster.

Partition: Kafka defines the concept of partitioning, a topic consists of one or more partitions, Kafka divides the messages of a topic into different partitions and stores different partitions into different brokers, thus achieving distributed storage (typical data sharding idea), each partition has a corresponding subscript, and the subscript starts from 0.

Replica: Each partition in Kafka has one or more copies, where there are 1 leader copy, 0 or more follow copies, each of which holds the entire contents of that partition. Kafka saves different copies of a partition to different broker nodes to keep the data safe. The Kafka replica synchronization mechanism will be analyzed in detail later in this book.

AR(Assigned Replicas): A list of partitioned replicas, that is, a list of brokers where all replicas of a partition reside.

ISR: All replicas in a partition that are in some sync with the leader replica (i.e., not too far behind) make up the ISR (In-Sync Replicas) collection. The leader replica is included in the ISR collection, which can be thought of as a synchronized replica (not necessarily fully synchronized, but not too far behind).

ACK mechanism: ACK (message acknowledgement) mechanism is a very important mechanism in the message system, and the ACK mechanism of the message system is very similar to the ACK mechanism of HTTP. The message system ACK mechanism can be divided into two parts:

After mBroker receives the messages sent by the producer and successfully stores them, it returns a successful response (which can be understood as an ACK) to the producer, at which point the producer can assume that the message has been sent successfully, otherwise the producer may need to do some compensation operations, such as resending the message.

After the consumer receives the message delivered by the broker and successfully processes it, it returns the successful response to the broker, and after the broker receives the successful response of these consumptions, it can be considered that the consumer has successfully consumed the message, otherwise the broker may need to do some compensation operations, such as re-delivering the message. In this scenario, the consumer usually needs to send the successful consumption message location (or message ID, etc.) to the broker, and the broker needs to store these successful consumption locations so that the subsequent consumer can continue to consume from that location after the subsequent consumer restarts. This scenario is also our focus.

In Kafka, there is an offset offset for each message, and if a Kafka topic is understood as a simple array of messages, then the message offset can be understood as the index of that message in that array. The consumer sends the next offset of the latest successful consumption message to the Broker (meaning that the messages preceding the offset have been successfully consumed), and the Broker stores these offsets to record the consumer’s latest consumption location. For ease of description, the offset in the consumer-submitted ACK information is referred to later in this book as the ACK offset.

In addition, both Kafka and Pulsar use ZooKeeper to store metadata and complete operations such as distributed collaboration, a distributed collaboration service that focuses on collaborating activities between multiple distributed processes, helping developers focus on the core logic of the application without worrying about the distributed nature of the application. Later in this book, we will analyze in detail what ZooKeeper has provided to Kafka and Pulsar. Kafka 2.8 began to provide KRaft module, allowing Kafka to operate independently of ZooKeeper deployment, and the design and implementation of this module will be analyzed in detail later in this book.

The following diagram shows the infrastructure of a Kafka cluster.

The basic concepts of Pulsar are described below

Pulsar subscription groups: Pulsar can bind multiple consumers into a single subscription group, similar to Kafka’s consumer group. Also using the previous example of “user service” to illustrate, in Pulsar, all consumers of the benefit service can bind a rights subscription group rightsSubscription, while all consumers of the rights service can bind a permission subscription group guthoritySubscription, and the consumption messages between different subscription groups do not interfere with each other.

Non-partitioned topics, partitioned topics: Each partition in Kafka is bound to a broker, while each topic in Pulsar is bound to a broker, and messages for a topic are fixed to the corresponding broker node. There is also the concept of “partitioned theme” in Pulsar, which consists of a set of non-partitioned internal topics (hereinafter referred to as the internal topics that make up the partitioned topics in Pulsar), each internal topic is bound to a broker, so that a partitioned topic can send messages to multiple brokers, avoiding the performance of a single Pulsar topic being limited by a single broker node.

Broker: A service node in a Pulsar cluster. It should be noted that Pulsar adopts the architecture of computing and storage separation, so the Pulsar Broker node is only responsible for computing, not storage, Pulsar Broker node will complete data verification, load balancing and other work, and forward the message to the Bookie node.

Bookie: Pulsar implements storage capabilities using the BookKeeper service, and the nodes in BookKeeper are called Bookie nodes. The BookKeeper framework is a distributed log storage service framework, which will be analyzed in detail later in this book. The Bookie node in Pulsar is responsible for completing the message storage work.

Ledger: BookKeeper’s data collection, where producers write data to Ledger and consumers read data from Ledger. For data security, BookKeeper stores a ledger data in multiple Bookie nodes for data backup.

Entry: A unit of data in Ledger, where every piece of data in Ledger is an Entry. Ledger can be understood as a ledger, and Entry as an entry in the ledger.

Tenant, namespace: Pulsar defines the concept of tenant and namespace, Pulsar is a multi-tenant system, which assigns different resources to different tenants, and ensures that the data between different tenants is isolated from each other and does not interfere with each other, so that it can support multiple teams and multiple users to use a Pulsar service at the same time. Each tenant can also create multiple namespaces, logical groupings of namespaces as topics. Pulsar can be understood as a large house, each tenant is a room in the house, and the space of this room is divided into different areas (namespaces), and different areas store different objects. For example, a user service can create a tenant “user” that stores messages for the user service. The tenant can create multiple namespaces and hold different topics according to its own business scenarios, as shown in the following figure.

Cluster cluster: Pulsar defines a cluster concept for clusters, each Pulsar Broker node runs under a cluster cluster, and different cluster clusters can copy data to each other, thus achieving cross-region replication.

ACK mechanism: Similar to Kafka, Pulsar also needs to complete “the broker stores the message and returns a successful response to the producer” and “the consumer successfully processes the message and sends the ACK to the broker”. Each message in Pulsar has a message ID, and the Pulsar consumer sends the Message Id of successful consumption to the broker as the content of the ACK request.

The following diagram shows the infrastructure of a Pulsar cluster.

This article introduces the origins and development and system characteristics of Kafka and Pulsar, as well as the most basic core concepts in Kafka and Pulsar. If you want to learn more, “Understanding Kafka and Pulsar: The Practice and Analysis of Message Flow Platforms” will detail the specific meaning and role of these concepts, and will gradually supplement other key concepts in Kafka and Pulsar, and if the reader does not understand a concept, he can also read this book with questions.

Want to know more about Kafka and Pulsar?

Check out this book!

▊ “Understanding Kafka and Pulsar in Depth: Practice and Analysis of Message Flow Platforms”

Written by Liang Guobin

Details of how Kafka and Pulsar are used

An in-depth analysis of Kafka and Pulsar’s implementation principles

This book details how Kafka and Pulsar are used and provides an in-depth analysis of how they are implemented. By reading this book, readers can quickly get started and use Kafka and Pulsar, and gain a deeper understanding of how they are implemented.

This book introduces the use of Kafka and Pulsar through a large number of practical examples, including basic applications such as management scripts and client (producer, consumer), key configuration items, ACK submission methods, and advanced applications such as security mechanisms, cross-region replication mechanisms, connector/stream computing engines, and common monitoring and management platforms. These contents can help readers gain an in-depth understanding of how Kafka and Pulsar are used and complete their day-to-day management work. In addition, this book provides an in-depth analysis of the implementation principles of Kafka and Pulsar, including the design and implementation of clients (producers, consumers), the Broker network model, the topic (partition) distribution and load balancing mechanism, as well as disk storage and performance optimization schemes, data synchronization mechanisms, scaling and failover mechanisms. Finally, the book introduces the transaction mechanism of Kafka and Pulsar, and provides an in-depth analysis of the implementation of Kafka transactions and the KRaft module, a distributed collaboration component of Kafka. This section helps the reader easily understand how Kafka and Pulsar were architected and implemented.

Fan benefits

Scan the code or click to read the original article

Place an order at half price

Scan the code to snap it up

Activities:

1. Leave a message at the end of the article

2. Invite friends to like your message

3. Like the first three to get one each!

4, add my friends, send me the receipt information ~ my WeChat: passjava

Campaign ends on 2022-09-19 20:00