Author of this issue

Lu Zhijun

Senior data development engineer of digital warehouse platform

01 From BitMap to RoaringBitMap

In the context of massive data, data needs to be quickly evaluated, calculated, and intermediate stored, and a series of data structures specially prepared for big data have emerged. For example, HyperLogLog, BloomFilter, etc., can quickly use small storage to estimate the specified amount of data. These are probability-based algorithms that run fast but do not get an accurate amount of data.

BitMap can solve this kind of problem, is the data field and search engine in the very early emergence of data structures, for example, in Java BitSet can be used instead of HashSet to do accurate deduplication of numbers, in Redis there are also setbit and getbit direct operation BitMap, the underlying implementation is directly translated into 0|1 binary structure. However, there are two obvious problems:

1. The long[ ] array inside a BitSet is vector-based, that is, dynamically expands with the largest number held in the Set. The maximum length of the array is calculated as (maxValue−1)>>6+1, that is, when storing a larger value into it, it can directly make the memory occupy more than M level, and too large the range will lead to OOM (such as specifying Long.MAX_VALUE).

2. Take a BitMap to store 4 billion pieces of data as an example, based on the 32-bit Unsigned Int, about 232 bit=229B=29MB=512MB memory, but when the data is sparse, you also need to open up such a large memory space, you can’t play its storage efficiency.

In order to solve the problem that bitmaps are not suitable for sparse storage, many researchers have proposed various algorithms to compress sparse bitmaps to reduce memory consumption and improve efficiency. Among the more representative are WAH[1], EWAH[2], Concise[3], and RoaringBitmap[4]. The first three algorithms are all compressed based on Run-length encoding (RLE), and RoaringBitmap is an improved version of them, which is better, so this article focuses on it.

RBM currently has two versions, which are used to store 32-bit and 64-bit integers. Taking 32 bits as an example, RBM will first divide it into 16+16, the first 16 bits are used to store the Container location number, and the last 16 bits are used to store specific Container content. For example, the number 31 is stored in the Container at position 31/216=0. It is worth noting that the container is only opened when it needs to be created, rather than directly initializing the container in all locations.

There are several types of containers, with BitmapContainer[4] and ArrayContainer appearing when RBM was first proposed. The BitmapContainer type, as the name suggests, is to store 16-bit BitMap, which can store up to 65536 numbers, with a fixed space overhead of 8KB and a time overhead of O(log(1)); ArrayContainer is used to meet the sparse storage and exists, can be used to store double-byte type numbers, the maximum support 216/(2×8) = 4096 numbers, space overhead is (2+2×c)B, c is cardinality, time overhead is O(log(n)); RunContainer proposed in 18 [5] to use RLE-encoded compression for continuous sequences, such as sequences 11, 12, 13, 14, 15 will be compressed to (11,4), 4 means that there are 4 consecutive numbers after 11, in extreme cases, if the sequences are continuous, then only 4B is ultimately needed, but if the sequence is an odd or even sequence, it will not only not compress, but even double the expansion. Similar to ArrayContainer, RunContainer stores RLE-compressed data from a variable-length Unsigned Short array, and the spatial overhead is related to the number of consecutive series runs it stores (abbreviated r), which is (2+4×r)B, and the time overhead is O(log(n)). The last is the SharedContainer, when making copies between RBMs, sometimes it is not necessary to copy multiple copies of a container, you can use SharedContainer to point to the actual Container, which can satisfy multiple RBMs sharing the same Container storage space.

When inserting an element on a new Container with a specified position number, the ArrayContainer is stored by default; If it is inserted into a continuous sequence element, such as the addRange method that comes with the API, the RunContainer is generated; If it is a sequence of irregularities, the RBM’s Optimaze is selected based on the amount of space occupied by the ArrayContainer and RunContainer after insertion. When the capacity of the inserted ArrayContainer exceeds 4096 (beyond 8KB), RBM automatically converts it to BitmapContainer for storage; Conversely, if the number of deleted elements in the BitmapContainer is less than 4096, RBM decides to switch to ArrayContainer or RunContainer based on Optimaze.

Based on its excellent storage and computing performance, RBM has a place in various big data open source projects, as shown in the following figure, Kylin and Doris use RBM to store cross-cycle global dictionaries, Iceberg based on RBM for accurate field indexing, Clickhouse specifically designed RBM data types and so on. The following describes the data application of RBM in Station B.

02 Application 1: User Access Tag Model Building

User/device access related indicators are used very frequently, its historical addition, retention, access frequency, MAU and other calculations are more complex and changeable, from the perspective of general construction resources are huge, access indicators of different time spans are scattered, not very friendly to users, there is a need for a unified set of user access tagging schemes, quickly output arbitrary access statistics.

Taking retention computing as an example, there are several common scenarios in the industry.

 2.1 Common access tagging scenarios

Scenario 1: The most common operation is to join the access table directly, as shown in the following figure, the day, week, and month retention calculations need to maintain three sets of repeating logic.

Pros: Flexible, simple

Disadvantages: The development cost is large, such as calculating the next month’s retention requires the calculation of deduplication of the access users of this month and the previous month, respectively.

Solution 2: Implement retention calculations using an external engine, such as the Retention function that comes with ClickHouse

Advantages: Suitable for application-oriented scenarios to quickly produce retention indicators

Disadvantages: It needs to be synchronized from Hive to ClickHouse, and it cannot be associated with other commonly used Hive tables, and it is not suitable for the general base model

Solution 3: Convert the access record into 0 and 1 generic bitmap access tokens, which are physically stored in Array. If each log partition records a snapshot of the user’s access for 90 days, the two elements in the array can store access records for 30 days and 60 days respectively, where the first array element actually stores the access from yesterday to the first 30 days, and the second array element actually stores the previous 31 days to the first 90 days. Partitions are stored every 60 days, and a date partition is stored every day for the last 60 days. In the past 90 days, the query can be directly through the most recent log partition, if you want to check all the historical records, you only need to stitch the second array element stored in the history log partition for 60 days, and you can completely retain all the historical access records. Taking 3599 days of a total of ten years of data storage as an example, only 59 + 59 date partitions are required (59 * 60 + 59), while 3600 days only need 60 date partitions.

For this store, a series of customized UDFs can be made to calculate the corresponding access indicators. This is shown in the following figure:

Advantages: The query speed is extremely fast, such as querying whether there is access in the past 30 days to directly determine whether the first array element is greater than 0, and to determine whether there is access on a certain day in 90 days, you only need to perform operations on the array elements where the day is located.

Disadvantages: The universal bitmap also has a large compressed space, and it is more complicated to check the access situation of more than 90 days, and it is more complicated to stitch down different partition snapshots.

 2.2 Access tagging scheme based on RBM structure

When designing the retention table, the general retention scheme needs to precalculate (calculate first) the corresponding retention indicators in advance, and if it is necessary to add or change the corresponding indicators, additional operations are required. All historical access records can be directly packaged into RBM structures and stored, using the UDF packaged in advance (out of the box). Compared with the previous scheme 3, because all the historical access situations can be stored, there is no splicing problem, so not only can the history of any day of the retention can be calculated, but also can be used to calculate the history of any day of addition, DAU, MAU, retention, access frequency and so on.

A common access tag model can be defined as follows:

The specific implementation of this model can be various, in the Hive level need to use T-2 partition data and T-1 incremental data full join operation, if it is Iceberg, Hudi engine can be directly incremental update, faster.

The following is a series of UDFs developed to operate RBMs for easy calculation out of the box.

Using RBM to build a unified access tag model, the repeated read and write logic of accessing related tables is quickly converged on the data, ensuring the consistent calculation caliber, shortening the overall data link, and greatly accelerating the efficiency of the number of digits. A total of 7 downstream core tables are converged, and more than 30 related derivative models are offline, saving unnecessary storage and computing overhead.

03 App 2: Crowdpack fast sync

Tipan is a platform used by Station B to create and manage user label profiles and generate crowds for analysis and application, and RBM has a number of applications in it, such as storing enumerated labels, accelerating the calculation of the intersection and difference of labels, and storing the crowd packages generated by business circles for subsequent analysis. The following figure mainly shows the application of RBM crowdpacking in event analysis:

The generation of an RBM crowd, starting from the original Hive table, has a normal link as shown in the figure below. First synchronize the hive tag table (user ID + tag ID granularity) to the ClickHouse cluster, create a corresponding crowd array for each enumeration value for each enumeration value for the enumeration type tag, and then create the corresponding RBM table through the CK materialization view (the RBM value is binary in the CK, and the figure is converted to base64 for display). Since the amount of raw S0 data is often very large, a single label value corresponds to more than a billion users, which not only causes a lot of resource loss for data synchronization, but also requires very high cluster resources for ClickHouse.

If you can directly create RBM crowdpacks offline, you can greatly shorten the link and save intermediate computing resources. The new solution requires that the Hive version of the RBM crowd can be generated directly in the offline engine, and then directly from S0–> S3 through synchronous tasks. The UDF currently supported on offline engines is mainly implemented in Java, while ClickHouse is entirely Cpp development, which requires RBMs created offline to be compatible on CK.

Taking 64-bit as an example, CK after version 21.1 uses the roaring64map in CRoaring[6] to implement RBM, and its core storage structure is formatted as type mark bit (Byte), data length (VarInt), high 32-bit cardinality size (Long), and ByteArray (RoaringBitmap) of actual data. The first part uses 1 byte to distinguish whether the underlying storage is implemented in SmallSet or RBM (SmallSet stores up to 32), when the cardinality is less than 32, the bit 0 is marked, SmallSet is stored, and the label is 1 and continues to use RBM; The second part is used to store the length of the ByteArray, while the ByteArray is used to store the actual data, which is placed in the fourth part; The third part is used to identify the high 32-bit cardinality size.

A common implementation library for RBM in Java is RoaringBitmap[8], which has been used in systems such as Spark, Kylin, and Druid, and Roaring64NavigableMap is mainly used to operate 64-bit RBMs, using TreeMap multi-level storage for fast positioning, and its structure is as follows:

Analysis of its code shows that the Roaring64NavigableMap serialization storage consists of whether the long tag bit (Byte) + the high 32-bit cardinality size (Integer) + the actual data ByteArray (RoaringBitmap), the actual data serialization part of the consistent with ClickHouse. After knowing the similarities and differences between the two, it is easy to achieve structural transformation through structure addition and subtraction, and with the Spark serialization tool kyro5, you can quickly realize the hive detail crowd to a ClickHouse RBM-compatible UDAF, and then create an RBM crowd package in the offline data warehouse.

As a special reminder, getting high 32-bit cardinality in Roaring64NavigableMap is a private method that cannot be used directly by external calls, and can be obtained using Java reflection, such as:

ClickHouse local crowd table creation code example:

idx_bitmap utilizes ClickHouse’s materialized column function, which is automatically generated by the hive serialized storage field synchronized to CK. Taking the 1 billion population packet as an example, the entire link time is shortened by 85% (50min->8min) through the new scheme, and most of the overall overhead is that UDAF converts the detail label record into a CK-compatible RBM process, and the data synchronization part is actually less than half a minute, and a large number of resources are released.

04 Summary

This paper expounds the basic principles and performance analysis of RBM, compares the similarities and differences with the original BitMap, and combines with some big data application scenarios of Station B to do program transformation to achieve the purpose of cost reduction and efficiency increase, and convenience for downstream use. There are still many directions that can be done in big data, such as optimizing the bitmap in Redis through RBM, improving the performance of Flink storage and computing deduplication state through RBM, and using RBM to determine the existence of real-time feature populations, etc. I hope that this article will provide you with a new idea to solve similar problems, welcome to discuss at any time.

The above is today’s sharing content, if you have any ideas or questions, welcome to tell us in the message area Oh, if you like the content of this issue, please give us a thumbs up!

Reference Articles:

[1] Wu K, Stockinger K, Shoshani A. Breaking the curse of cardinality on bitmap indexes. SSDBM’08, Springer: Berlin, Heidelberg, 2008; 348–365.

[2] Lemire D, Kaser O, Aouiche K. Sorting improves word-aligned bitmap indexes. Data & Knowledge Engineering 2010; 69(1):3–28, doi:10.1016/j.datak.2009.08.006. 

[3] Colantonio A, Di Pietro R. Concise: Compressed ’n’ Composable Integer Set. Information Processing Letters 2010; 110(16):644–650, doi:10.1016/j.ipl.2010.05.018.

[4] Chambi S , Lemire D , Kaser O , et al. Better bitmap performance with Roaring bitmaps[J]. Software—practice & Experience, 2016, 46(5):709-719.

[5] Lemire D , Ssi-Yan-Kai G , Kaser O . Consistently faster and smaller compressed bitmaps with Roaring[J]. Software Practice and Experience, 2016, 46(11):1547-1569.