Source: https://ricstudio.top/archives/es-lucene-reverted-index
“All problems in computer science can be solved by another level of indirection.”
– David J. Wheeler
, “The computer world is the art of trade-off.”

| Foreword
Several recent projects have used Elasticsearch (ES) to store data and search and analyze data, and have learned some about ES. This article is compiled from my own technical sharing.
This article will not focus on the
use of distributed technologies and related APIs in ES, but will focus on sharing the topic of “how to quickly retrieve ES“. This is also the part that I was most interested in ES before studying.
This article broadly covers the following:
2) Detailed study of inverted indexes
what
-
exactly an inverted index looks like (posting list -> term dic -> term index
)
-
Some tricks about postings lists (FOR, Roaring Bitmaps)
-
How to do union queries quickly?
Second
, about search,
first imagine a scenario about search, suppose we want to search for an ancient poem with the word “former” in the content of the verse

What is the difference between using a traditional relational database and ES implementation?
If we use an RDBMS like MySQL to store ancient poems, we should use such SQL to query.
select name from poems where content like "%ago%";
This method, which we call sequential scanning, requires traversing all records for matching.
Not only is it inefficient, but it also does not meet our expectations when searching, for example, when we search for keywords such as “ABCD”, we usually expect to see “A”, “AB”, “CD”, “ABC” search results.
So there are professional search engines, such as our protagonist today – ES.
Search engine principleThe
search principle of search engine can be divided into these steps if it is simply summarized

content
-
crawling, pause word filtering
such as some useless mood words like “of”, “the”, etc./connective
-
content segmentation, extract keywords
-
Inverted index based on keywords
-
Users enter keywords to search
Here we introduce a concept, which is also the focus of our dissection today – inverted indexing. It is also the core knowledge point of ES.
If you understand ES, you should know that ES can be said to be a wrapper for Lucene, and the implementation of inverted indexes in it is implemented through the API provided by the jar package lucene, so the following content about inverted indexes is actually the content of Lucene.
Third, inverted index, first of all, we must not forget the search requirements we mentioned earlier, let’s see what our above query requirements will become after the establishment of
inverted
In this way, as soon as we enter “before”, we can directly locate the ancient poem that meets the query conditions with the help of the inverted index.
Of course, this is just a very vernacular form to describe how an inverted index works briefly. In ES, what exactly is this inverted index, how it is stored, etc., these are the essence of the inverted index.
1. A few concepts
are described before proceeding to the following.
The
term keyword is
my own way of saying it, and in ES, the keyword is called term.
Postings list
still uses the above example, {silent night thinking, looking at Lushan waterfall}
is the list corresponding to the term “before”. In ES, these are described as a collection of all IDs that contain a particular term document. Due to the characteristics that integers can be compressed efficiently, integers are most suitable for placing in the postings list as a unique identifier for documents, and ES will process these stored documents and convert them into a unique integer id.
Let’s talk about this id range, when storing data, in each shard, ES will store the data in a different segment, which is a smaller shard unit than shard, and these segments will be merged periodically. Up to 2^31 documents are saved in each segment, and each document is assigned a unique id, from 0
to (2^31)-1
.
The relevant nouns are all descriptions given by the official ES documentation, and the sources can be found in the subsequent reference materials.
2
.
The inverted index described above is just a very rough model. Really to use in real production, of course, far from it.
In actual production scenarios, such as log analysis, which is most commonly used by ES, how many terms can be obtained after the log content is tokenized?
So how to quickly query the corresponding term in a large number of terms? Going through it is obviously unrealistic.
In
order to
quickly find term, ES arranges all the terms in an order, dichotomy to find. Doesn’t it feel a little familiar, isn’t this the MySQL indexing method, directly using the B+ tree to build an indexed dictionary to point to the indexed data.
term index
But the question arises again, where do you think Term Dictionary should be placed? It must be in memory, right? Disk IO is so slow. Just like MySQL indexes are stored in memory.
But what are the consequences if you put the entire term dictionary in memory?
Don’t forget that ES indexes all text fields by default, which inevitably consumes huge memory, so ES has deeply optimized for indexing. While ensuring execution efficiency, the memory space occupied is reduced as much as possible.
So there is a term index.
Term index is classified as a “Trie tree” from the data structure, which is what we often call a dictionary tree. This is a data structure that specializes in string matching to solve the problem of quickly finding a string in a collection of strings.
This tree will not contain all term, it contains some prefix of term (this is also the use case of dictionary trees, the common prefix). The term index allows you to quickly locate an offset of the term dictionary and then look up it in order from this location. Just like the figure on the right represents. (How, like we look up the English dictionary, we locate the first word at the beginning of S, or locate the first word at the beginning of Sh, and then query it in later order)
Lucene also made two optimizations here, one is that the term dictionary is stored in blocks on disk, and a block is compressed internally by a common prefix For example, words that start with ab can omit ab. Second, the term index is stored in memory as a data structure of FST (finite state transducers).
-
small footprint. By reusing word prefixes and suffixes in the dictionary, the storage space
-
is compressed and the query speed is fast. O(len(str)) query time complexity.
OK, now we can get a rough idea of what a lucene inverted index looks like.
Fourth, some tricks about postings list In actual use, postings
list
still needs to solve several pain points,
> Postings list If not compressed, it will take up very much disk space,
-
how to quickly intersections and unions under union query
Some people may feel unnecessary about how to compress, “Isn’t the posting list already storing only the document ID?” Still need compression? But if the posting list has millions of doc IDs, compression is necessary. (For example, looking up ancient poems by dynasty?) As for why you need to ask for union, ES is specifically used for searching, and there will definitely be a lot of joint query requirements (AND, OR).
Following the above idea, we will first compress how to compress.
1. Compress
Frame of Reference
In lucene, postings lists are required to be ordered integer arrays. This has the benefit of being compressed by delta-encoding.
For example, there is now a list of IDs [73, 300, 302, 332, 343, 372]
, converted into the increment value of each id relative to the previous id (the previous id of the first id is 0 by default, the increment is itself) The list is [73, 227, 2, 30, 11, 29].
In this new list, all IDs are less than 255, so each ID only needs one byte to store.
Actual ES will be more fine,
class=”rich_pages wxw-img” src=”https://mmbiz.qpic.cn/mmbiz_jpg/1J6IbIcPCLYGlFd20LeWuibF7wFHWlycWj7IibnvDehib1Eic9icJrXjCUbOeF8krhlicNN4Fj6yjf5fOaickCuC9ibEFQ/640?wx_fmt=jpeg”>
It will divide all documents into many blocks, each block contains exactly 256 documents, and then incrementally encode each document separately, calculate the maximum number of bits required to store all documents in this block to hold each ID, and put this number of bits in front of each block as a header. This technique is called Frame of Reference.
The image above is also an example from the official ES blog (assuming each block has only 3 files instead of 256).
The steps for FOR can be summarized as:
class=”rich_pages wxw-img” src=”https://mmbiz.qpic.cn/mmbiz_jpg/1J6IbIcPCLYGlFd20LeWuibF7wFHWlycWWRO7LpFxx14awj9iamPv6L3ib7D1KluDGy5mUvShwTE5iaZibTLVvJtGuw/640?wx_fmt=jpeg”>
After the final bit compression, the type of the integer array is expanded from 4 types of fixed size (8, 16, 32, 64 bits) to 64 types of [1-64] bits.
In the above way, the space consumption of the posting list can be greatly saved and query performance can be improved. However, ES has done more work to improve the performance of filter filter queries, that is, caching.
Roaring Bitmaps (for filter cache)
In ES, you can
use filters to optimize queries, filter queries only deal with whether documents match or not, do not involve document scoring operations, and the results of the query can be cached.
For filter queries, ES provides a special cache, the filter cache, which is used to store the result set obtained by filters. Caching filters don’t require much memory, it only holds one information, which documents match the filter. At the same time, it can be reused by other queries, which greatly improves the performance of queries.
The Frame Of Reference compression algorithm we mentioned above works well for postings lists, but not so well for filter caches that need to be stored in memory.
The filter cache
stores frequently used data, and the cache for filter is to speed up the processing efficiency and requires higher compression algorithms.
For such postings lists, ES uses different compression methods. So let’s take it step by step.
First we know that postings list is an Integer array with compressed space.
Assuming there is such an array, what is our first idea of compression? In a bit way, each document corresponds to one of them, which is what we often call a bitmap, bitmap.
It is often used as an index in databases, query engines, and search engines, and bit operations such as AND intersection or union can be parallelized and more efficiently.
However, bitmaps have a significant drawback, regardless of the actual element base in the business, the memory space it occupies is constant. That is, it does not apply to sparse storage. There are also many mature compression schemes for sparse bitmaps in the industry, and Lucene uses roaring bitmaps.
Let me describe this compression process in a simple way

Split the doc id into 16 bits high and 16 bits lower. Aggregate the high bits (with the high bit as the key, value is all the low-bit arrays with the same high-bit), and use different containers (data structure) storage according to the amount of low-bit data (the length of the low-bit array aggregated by different high-bit is different).
-
len<4096 ArrayContainer direct deposit
-
len>=4096 BitmapContainer is stored using bitmap
The
source of the dividing line: The maximum total number of values is 2^16=65536. Suppose that 65536bit
=8kb is required for bitmap storage, and in the direct storage method, a value of 2 bytes, 4K requires a total of 2byte*4K=8kb
.
Therefore, when the total value is < 4k, it is more space-saving to use the direct deposit method.
Spatial compression is mainly reflected in:
> high-bit aggregation (assuming that there are 100w high-bit same values in the data, the original need for 100w
*2byte, now only 1*2byte
)
The disadvantage of
-
is that the speed of bit manipulation is affected by the native bitmap.
That’s trade-off. The art of balance.
2. After the joint query is finished, let’s talk about the joint
query.
First of all, if the query has a filter cache, it is to directly use the filter cache to do the calculation, that is, the bitmap to do the AND or OR calculation.
If the query filter is not cached, then use skip list to traverse the postings list on disk.
These are three posting lists. We now need to merge them with AND relations to get the intersection of posting lists. First, select the shortest posting list, look for existence in the other two posting lists one by one, and finally get the intersection result. The process of traversal can skip some elements, for example, when we traverse to the green 13, we can skip the blue 3, because 3 is smaller than 13.
Using skip list also brings a benefit, remember the previous said, postings list is stored in the disk in the FOR encoding
method
It will divide all documents into many blocks, each block contains exactly 256 documents, and then incrementally encode each document separately, calculate the maximum number of bits required to store all documents in this block to save each ID, and put this number of bits in front of each block as a header.
Because the encoding of this FOR has a decompression cost. With skip lists, in addition to skipping the cost of traversal, you also skip the process of complying with these compressed blocks, saving CPU.
Fifth,
summary
Let’s make a technical summary (it feels a bit of Mr. Wang Gang’s taste 😂)
> In order to be able to quickly locate the target document, ES uses inverted indexing technology to optimize the search speed, although the space consumption is relatively large, but the search performance is greatly improved.
-
In order to be able to quickly locate a term in a huge number of terms, while saving memory usage and reducing disk IO reads, Lucene uses the inverted index structure of “term index -> term dictionary -> postings list” through FST Compression is put into memory to further improve search efficiency.
-
In order to reduce the disk consumption of the postings list, Lucene uses FOR (Frame of Reference) technology compression, which brings a very obvious compression effect.
-
ES’s filter statement uses Roaring Bitmap technology to cache search results, ensuring the speed of high-frequency filter queries and reducing storage space consumption.
-
In the joint query, in the case of a filter cache, the native characteristics of the bitmap will be directly used to quickly intersect and union to obtain the union query result, otherwise use skip list to intersect and union multiple postings lists, skip the traversal cost and save the decompression CPU cost of some data
Elasticsearch’s indexing idea
moves the contents of the disk into the memory as much as possible, reduces the number of random reads on the disk (while also taking advantage of the sequential read feature of the disk), combined with various compression algorithms. Use memory with a very demanding attitude.
Therefore, it is important to note
that fields that do not need to be indexed
Finally, technical selection is always accompanied by the consideration of business scenarios, each database has its own problems to solve (or areas of expertise), corresponding to its own data structure, and different usage scenarios and data structures, need to use different indexes, in order to maximize the purpose of speeding up the query.
Although this article is about how Lucene implements inverted indexes, how to carefully calculate each piece of memory, disk space, and how to speed up processing with weird bit operations, but think high, and then compare MySQL, you will find that although they are all indexes, they are completely different to implement. In a nutshell, a b-tree index is an index structure optimized for writing.
When we don’t need to support fast updates, we can exchange pre-sorting and other ways for smaller storage space, faster retrieval speed, etc., at the cost of slow updates, just like ES.
I hope this article can bring you some gains~
Reference document:
> https://www.elastic.co/cn/blog/frame-of-reference-and-roaring-bitmaps
-
https://www.elastic.co/cn/blog/found-elasticsearch-from-the-bottom-up
-
http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html
-
https://www.infoq.cn/article/database-timestamp-02
-
https://zhuanlan.zhihu.com/p/137574234


public number (zhisheng ) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitor keywords such as to view more articles corresponding to keywords.
like + Looking, less bugs 👇