The author is David Kjerrumgaard, currently a contributor to Splunk, Apache Pulsar, and Apache NiFi projects. The translator is Sijia@StreamNative. Original link: https://searchdatamanagement.techtarget.com/post/Apache-Pulsar-vs-Kafka-and-other-data-processing-technologies, the translation is authorized.
Apache
Pulsar is a top-level project of the Apache Software Foundation, a next-generation cloud-native distributed message flow platform that integrates messaging, storage, and lightweight functional computing, adopts computing and storage separation architecture design, supports multi-tenant, persistent storage, and cross-region data replication in multiple data centers, and has the characteristics of streaming data storage such as strong consistency, high throughput, low latency, and high scalability.
GitHub address: http://github.com/apache/pulsar/
Compared with data processing middleware such as Kafka, how does the distributed messaging platform Apache Pulsar store data? Based on the architecture, this article compares the advantages and disadvantages of traditional data processing middleware such as Apache Kafka and Apache Pulsar, a distributed messaging platform.
Storage Expands
Apache Pulsar’s multi-tier architecture completely decouples the message service layer from the storage layer, allowing each tier to scale independently. Traditional distributed data processing middleware (e.g., Hadoop, Spark) processes and stores data on the same cluster node/instance. This design can reduce the amount of data transmitted through the network, making the architecture simpler and improving performance, but at the same time, scalability, elasticity, and O&M are affected.
Pulsar’s tiered architecture is unique among cloud-native solutions. Today, vastly increased network bandwidth provides a solid foundation for this architecture, facilitating the separation of compute and storage. Pulsar’s architecture decouples the service layer from the storage layer: stateless broker nodes are responsible for data services; The bookie node is responsible for data storage (Figure 1).
Figure 1. Decoupling
the service layer from the storage layer An
architecture that decouples the service layer from the storage layer has many advantages. First, the layers can be elastically scaled without affecting each other. With the elasticity of environments such as clouds and containers, layers can automatically scale up and down to dynamically adapt to traffic spikes. Second, system availability and manageability are improved by significantly reducing the complexity of cluster scaling and upgrading. Again, this design is container-friendly, making Pulsar the best solution for hosting cloud-native streaming systems. Apache Pulsar uses the highly scalable BookKeeper as the storage layer, which implements strong durability assurance and distributed data storage and replication, and natively supports cross-region replication.
Multi-tier designs make it easy to tier storage, offloading less frequently accessed data to low-cost persistent storage such as AWS S3, Azure cloud. Pulsar supports configuring predefined storage sizes or time periods to automatically offload data stored on local disks to cloud storage platforms, freeing up local disks while securely backing up event data.
Pulsar vs. Kafka Both Apache
Pulsar
and Apache Kafka share similar messaging concepts. The client interacts with both through topics, which are logically divided into partitions. In general, the infinite stream of data written to a topic is divided into partitions (groups of a specific number and equal size) so that the data is evenly distributed across the system and used by multiple clients simultaneously.
The essential difference between Apache Pulsar and Apache Kafka is the infrastructure of the bucket. Apache Kafka is a partitioned-based publish/subscribe system designed to operate as a monolithic architecture, with the service and storage layers on the same node.
Figure 2. Kafka
partitioned Kafka
storage: In Kafka based on partitioning, partitioned
data
is stored as a single contiguous data store on the leader node, and then replicated on the replica node (replica nodes can be preconfigured) to achieve multiple copies of the data. This design limits the capacity of the partition in two ways and expands the topic. First, because partitions can only be stored on local disks, the partition size depends on the largest single disk size on the host (the disk size of the “new” installation user is about 4 TB); Second, because data must be replicated, the size of the partition cannot exceed the minimum disk space on the replica node.
Figure 3. Kafka failure and expansion
assumes that the leader can be stored on a new node with a disk size of 4 TB and partitioned storage only, with both replica nodes having a storage capacity of 1 TB. After publishing 1 TB of data to a topic, Kafka detects that the replica node can no longer receive data and cannot continue to publish messages to the topic until the replica node frees up space (Figure 3). If the producer fails to buffer messages during an outage, data loss may result.
Faced with this problem, there are two solutions: delete the data on disk, store the existing replica nodes, but because the data from other topics may not have been consumed, it may cause data loss; Or add new nodes to your Kafka cluster and “rebalance” the partitions to use the new nodes as replica nodes. However, this requires re-replicating the entire 1 TB partition, which is time-consuming, error-prone, and costly in terms of network bandwidth and disk IO. In addition, offline replication scenarios are not desirable for programs with strict SLAs.
With Kafka, partition data needs to be rereplicated not only when scaling the cluster, but also other failures, such as replica failures, disk failures, machine failures, and so on. If there is no failure in production, we often overlook this drawback of Kafka.
Figure 4. Pulsar sharding
Pulsar storage
: Shard-based
In a
shard-based storage architecture, such as that used by Apache Pulsar, partitions are further divided into shards that can be rolled over based on pre-configured time or size. Shards are evenly distributed in the bookie of the storage layer to achieve multiple copies and expansion of data.
When bookie runs out of disk space and can’t continue writing to it, Kafka needs to rereplicate data, how does Pulsar deal with this scenario? Since the partitions are further fragmented, there is no need to copy the contents of the entire bookie into the new bookie. Until a new bookie is added, Pulsar can continue to receive new data shards and write to bookie where the storage capacity is not full. When you add a new bookie, traffic on the new node and the new partition automatically increases immediately, eliminating the need to re-replicate old data.
Figure 5. Pulsar failure and scaling
is shown in Figure 5, where message shards 4-7 are routed to other active bookie when the 4th bookie node no longer continues to receive new message shards. After the new bookie is added, the shards are automatically routed to the new bookie. Throughout the process, Pulsar is always running and can serve producers and consumers. In this case, Pulsar’s storage system is more flexible and highly scalable.