Several

recent projects have learned something about ES by using Elasticsearch (ES) to store data and search and analyze data. This article is compiled from my own technical sharing.

Image from Pexels This article will not focus on the use of distributed technologies and related APIs

in ES, but will focus on sharing the topic of “how to quickly retrieve ES”. This is also the part that I was most interested in ES before studying.

This article broadly includes the following

:

About Search

:

Closer to Inverted Index:

    What does

  • an inverted index look like (posting list→term dic→term index)

  • Some tricks about postings list (FOR, Roaring Bitmaps).

  • How to quickly do a union query?

Let’s imagine a scenario about search, suppose we want to search for an ancient poem with the word “before” in the content of the verse.

What is the difference between using a traditional relational database and ES implementation? If we use an RDBMS like MySQL to store ancient poems, we should use SQL like this to query:
select name from poems where  content like "%ago%"; 

This method, which we call sequential scanning, requires traversing all records for matching. Not only is it inefficient, but it doesn’t meet our expectations when searching.

For example, when we search for keywords such as “ABCD”, we usually also want to see the search results of “A”, “AB”, “CD”, “ABC”. So there are professional search engines, such as our protagonist ES today.

Search engine principle

The search principle of the search engine can be briefly summarized into the following steps:

  • content crawling, pause word filtering, such as some useless mood words like “of”, “the” and other mood words/connective

  • content segmentation, extract keywords

  • Establish an inverted index based on keywordsUsers

  • enter keywords to search

Here we introduce a concept, which is also the focus of our analysis today. It is also the core knowledge point of ES.
If you understand ES, you should know that ES can be said to be a wrapper for Lucene, and the implementation of inverted indexes in it is implemented through the API provided by the jar package lucene, so the following content about inverted indexes is actually the content of Lucene.

Inverted index


First of all, we must not forget the search requirements we mentioned earlier, let’s see what our above query requirements will become after establishing an inverted index.

In this way, as soon as we enter “before”, we can directly locate the ancient poem that meets the query conditions with the help of the inverted index.
Of course, this is just a very vernacular form to describe how an inverted index works briefly. In ES, what exactly is this inverted index, how it is stored, etc., these are the essence of the inverted index.

(1) Several concepts

are

described before entering into the following.

term: Keywords This thing is my own way of saying it, in ES, keywords are called term.

Postings list: Using the example above, {silent night thinking, wanglushan waterfall} is the list corresponding to the term “before”. In ES, these are described as a collection of all IDs that contain a particular term document.

Due to the characteristics that integers can be compressed efficiently, integers are most suitable for placing in the postings list as a unique identifier for documents, and ES will process these stored documents and convert them into a unique integer id.

Let’s talk about this id range, when storing data, in each shard, ES will store the data in a different segment, which is a smaller shard unit than shard, and these segments will be merged periodically.

Up to 2^31

documents are saved in each segment, and each document is assigned a unique id, from 0 to (2^31)-1.
The relevant nouns are all descriptions given by the official ES documentation, and the sources can be found in the subsequent reference materials.
(2)
The inverted index described above in the internal structure of the index is just a very rough model. Really to use in real production, of course, far from it.
In actual production scenarios, such as log analysis, which is most commonly used by ES, how many terms can be obtained after the log content is tokenized?
So how to quickly query the corresponding term in a large number of terms? Going through it is obviously unrealistic.

term dictionary

: So there is a term dictionary, ES in order to quickly find term, all the terms in an order, dichotomous search.
Doesn’t it feel a little familiar, isn’t this the MySQL indexing method, directly using the B+ tree to build an indexed dictionary to point to the indexed data.

term index: But the question arises again, where do you think Term Dictionary should be placed? It must be in memory, right? Disk IO is so slow. Just like MySQL indexes are stored in memory.

But what are the consequences if you put the entire term dictionary in memory? Memory burst….
Don’t forget that ES indexes all text fields by default, which inevitably consumes huge memory, so ES has deeply optimized for indexing.
While ensuring execution efficiency, the memory space occupied is reduced as much as possible. So there is a term index.
Term index: From the data structure classification, it is a “Trie tree”, which is what we often call a dictionary tree.
This is a data structure that specializes in string matching to solve the problem of quickly finding a string in a collection of strings.
This tree will not contain all term, it contains some prefix of term (this is also the use case of dictionary trees, the common prefix).
The term index allows you to quickly locate an offset of the term dictionary and then look up it in order from this location. Think of what the figure on the right represents.
How about, like instead of looking up an English dictionary, we locate the first word at the beginning of S, or locate the first word at the beginning of Sh, and then look up in a later order?
Lucene also made two optimizations here, one is that the term dictionary is stored in blocks on disk, and a block is compressed internally by a common prefix, such as words that start with ab can omit ab.
Second, the term index is stored in memory as a data structure of FST (finite state transducers).
FST has two advantages

:

  • small footprint: storage space is compressed by reusing word prefixes and suffixes in dictionaries.

  • Fast query speed: O(len(str)) query time complexity.

The theory of FST is more complicated, this article will not go into details, extended reading

:

https://www.shenyanchao.cn/blog/2018/ 12/ 04/lucene-fst/
OK, now we can get a rough idea of what a lucene inverted index looks like.

Some tricks about

postings list

In actual use, postings list still needs to solve several pain points:

    >postings list If not compressed, it can take up a lot of disk space.

  • Under union query, how to quickly find intersections and unions.

Some people may feel unnecessary about how to compress, “Isn’t the posting list already storing only the document ID?” Still need compression? But if the posting list has millions of doc IDs, compression is necessary.
For example, according to the dynasty to query ancient poems, as for why you need to ask for union, ES is specially used for searching, there will definitely be a lot of joint query needs (AND, OR). Following the above idea, we will first compress how to compress.

(1)

Compress Frame of Reference: In lucene, postings lists are required to be ordered integer arrays.

This has the benefit of being compressed by delta-encoding.

For example, there is now a list of IDs [73,

300, 302, 332, 343, 372], converted into the increment value of each id relative to the previous id (the previous id of the first id is 0 by default, and the increment is itself) The list is [73, 227, 2, 30, 11, 29].
In this new list, all IDs are less than 255, so each ID only needs one byte to store.

Actually ES will do it more finely:

class=”rich_pages wxw-img” src=”https://mmbiz.qpic.cn/mmbiz_jpg/MOwlO0INfQricCicia5icKdywEqsSQ167ep6LWaibrCpM7PR1jpfw0lHvUxfTZicApMjsB1yAx38eichv8iaeNcUbxNV6w/640?wx_fmt=jpeg”>

It divides all the documents into blocks, each containing exactly 256 documents, and then incrementally encodes each document separately.

Calculate the maximum number of bits

required to store all documents in the block to hold each ID, and place this number of bits in front of each block as a header. This technique is called Frame of Reference.
The image above is also an example from the official ES blog (assuming each block has only 3 files instead of 256).

The steps for FOR can be summarized as:

class=”rich_pages wxw-img” src=”https://mmbiz.qpic.cn/mmbiz_jpg/MOwlO0INfQricCicia5icKdywEqsSQ167ep6wSdKqbg4rIq5QlfzibO25hNbr4qQBZM5icoDTMOXYCB4FHEjp57KzC9g/640?wx_fmt=jpeg”>

After the final bit compression, the type of the integer array has been expanded from 4 types of fixed size (8, 16, 32, 64 bits) to 64 types of [1-64] bits.
In the above way, the space consumption of the posting list can be greatly saved and query performance can be improved. However, ES has done more work to improve the performance of filter filter queries, that is, caching.

Roaring Bitmaps (for filter cache): In ES, you can use filters to optimize queries, filter queries only deal with whether documents match or not, and do not involve document scoring operations. The results of the query can be cached.

For filter queries, ES provides a special cache, the filter cache, which is used to store the result set obtained by filters.
Caching filters don’t require much memory, it only holds one information, which documents match the filter. At the same time, it can be reused by other queries, which greatly improves the performance of queries.
The Frame Of Reference compression algorithm we mentioned above works well for postings lists, but not so well for filter caches that need to be stored in memory.

The filter cache

stores frequently used data, and the cache for filter is to speed up the processing efficiency and requires higher compression algorithms.
For such postings lists, ES uses different compression methods. So let’s take it step by step. First we know that postings list is an Integer array with compressed space.
Assuming there is such an array, what is our first idea of compression? In a bit way, each document corresponds to one of them, which is what we often call a bitmap, bitmap.

It is often used as an index in databases, query engines, and search engines, and bit operations such as AND intersection or union can be parallelized and more efficiently.

However, bitmaps have a significant drawback, regardless of the actual element base in the business, the memory space it occupies is constant.
That is, it does not apply to sparse storage. There are also many mature compression schemes for sparse bitmaps in the industry, and Lucene uses roaring bitmaps.

Let me describe this compression process in a simple way:

Split the doc id into 16 bits high and 16 bits lower. Aggregate the high bits (with the high bit as the key, value is all the low-bit arrays with the same high-bit), and use different containers (data structure) storage according to the amount of low-bit data (the length of the low-bit array aggregated by different high-bit is different).

    >len<4096 ArrayContainer direct deposit

  • len>=4096 BitmapContainer is stored using bitmap

The

source of the dividing line: The maximum total number of values is 2^16=65536. Suppose that 65536bit=8kb is required for bitmap storage, and in the direct storage method, a value of 2 bytes, 4K requires a total of 2byte*4K=8kb.

Therefore, when the total value < 4k, it is more space-saving to use the direct deposit method.

Spatial compression is mainly reflected in the <ul class=”list-paddingleft-2″> high-bit aggregation (assuming that there are 100w high-bit same values in the
  • data, which originally required 100w2byte, now only 12byte).

  • Low-bit compression

  • The disadvantage is that the speed of bitwise manipulation has an impact on native bitmaps. That’s trade-off. The art of balance.

    (2)

    After the joint query talks about compression, let’s talk about the joint query

    again. First of all, if the query has a filter cache, it is to directly use the filter cache to do the calculation, that is, the bitmap to do the AND or OR calculation.

    If the query filter is not cached, then use skip list to traverse the postings list on disk.

    These are three posting lists. We now need to merge them with AND relations to get the intersection of posting lists.
    First, select the shortest posting list, look for existence in the other two posting lists one by one, and finally get the intersection result.

    The

    process of traversal can skip some elements, for example, when we traverse to the green 13, we can skip the blue 3, because 3 is smaller than 13.
    Using skip list also brings a benefit, remember the previous said, postings list is stored in the disk in the FOR encoding method.
    It will divide all documents into many blocks, each block contains exactly 256 documents, and then incrementally encode each document separately, calculate the maximum number of bits required to store all documents in this block to save each ID, and put this number of bits in front of each block as a header.
    Because the encoding of this FOR has a decompression cost. With skip lists, in addition to skipping the cost of traversal, you also skip the process of complying with these compressed blocks, saving CPU.

    Let’s

    make a technical summary:
    (1) In order to quickly locate the target document, ES uses inverted index technology to optimize the search speed, although the space consumption is relatively large, but the search performance is greatly improved.
    (2) In order to be able to quickly locate a certain term in a huge number of terms, while saving memory usage and reducing disk IO reads.
    Lucene uses the inverted index structure of “term index→term dictionary→postings list” to put it into memory through FST compression to further improve search efficiency.
    (3) In order to reduce the disk consumption of postings list, lucene uses FOR (Frame of Reference) technology compression, which brings obvious compression effect.
    (4) ES’s filter statement uses Roaring Bitmap technology to cache search results, ensuring the speed of high-frequency filter queries and reducing storage space consumption.
    (5) In the joint query, in the case of filter cache, the native characteristics of the bitmap will be directly used to quickly intersect and sum to obtain the union query result, otherwise use skip list to intersect and union multiple postings lists, skip the traversal cost and save the cost of decompressing the CPU of some data.

    Elasticsearch’s indexing idea

    moves the contents of the disk into the memory as much as possible, reduces the number of random reads on the disk (while also taking advantage of the sequential read feature of the disk), combined with various compression algorithms. Use memory with a very demanding attitude.
    Therefore, it is important to note

    that

      > fields that do not require indexing must be clearly defined, because the default is automatically indexed.

    • In the same way, for fields of type String, those that do not need analysis also need to be clearly defined, because the default will also be analyzed.

    • It

    • is important to choose regular IDs, IDs that are too random (such as Java’s UUID) are not conducive to querying.

    Finally, technology selection is always accompanied by the consideration of business scenarios, each database has its own problems to solve (or areas of expertise), corresponding to its own data structure, and different usage scenarios and data structures, need to use different indexes, in order to maximize the purpose of speeding up queries.
    This article is about how Lucene implements inverted indexing, how to calculate every piece of memory, disk space, and how to speed up processing with weird bit operations.
    But if you think high, and then compare MySQL, you will find that although they are all indexes, the implementation is completely different. In general, a B-tree index is an index structure optimized for writing.
    When we don’t need to support fast updates, we can exchange pre-sorting and other ways for smaller storage space, faster retrieval speed, etc., at the cost of slow updates, just like ES.

    Author: Richard_Yi

    Editor: Tao Jialong

    Source: juejin.cn/post/6889020742366920712

    End



    public number (zhisheng) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitoring < keywords such as span class="js_darkmode__148"> to view more articles corresponding to keywords.

    like + Looking, less bugs 👇