Top-down, then bottom-up introduction to how ElasticSearch works, trying to answer the following question:
-
Why does adding more files compress the index?
-
Why does ElasticSearch take up a lot of memory?
> Why can’t my search *foo-bar* match foo-bar?
Elasticsearch version: Elasticsearch-2.2.0
content
illustrates
clusters on the
ElasticSearch
cloud
box in the cluster
Each white square box in the cloud represents a node, Node.
between > nodes
Directly on one or more nodes, multiple small green squares are combined to form an ElasticSearch index.
small square in the index
Under an index, the small green squares distributed across multiple nodes are called shards.
Shard=Lucene An
ElasticSearch Shard is essentially a Lucene Index.
Lucene is a Full Text search base (there are many other forms of search base), and ElasticSearch is built on top of Lucene. Much of what follows is actually how ElasticSearch works on Lucene.
Illustration of Lucene
Mini Index – segment
There are many small segments inside Lucene, and we can think of them as the mini-index inside Lucene.
inside the Segment
There are many data structures
< ul class="list-paddingleft-2"
> Inverted Index
Stored Fields
Document Values
Cache
most important Inverted Index
The Inverted Index consists of two main parts:
-
an ordered data dictionary (including the word Term and how often it appears).
-
Postings corresponding to the word Term (i.e. the file in which the word exists).
When we search, we first decompose the content of the search, and then find the corresponding Term in the dictionary to find the file content related to the search.
query “the fury”
AutoCompletion-Prefix
If you want to find letters that start with the letter “c”, you can simply find words such as “choice” and “coming” in the Inverted Index table through Binary Search.
expensive lookup
If you want to find all the words that contain the letter “our”, then the system scans the entire Inverted Index, which is very expensive.
in which case If we want to optimize, then the problem we face is how to generate the appropriate Term.
The translation of the problem
For problems such as the above, we may have several possible solutions:
- * suffix
-
-> xiffus *
If
we want to use the suffix as a search condition, we can do the reverse processing for Term.
-
(60.6384, 6.5017) -> u4u8gyykk
For GEO
location information, you can convert it to GEO Hash.
-
123 -> {1-hundreds, 12-tens, 123}
For simple numbers, multiple forms of Term can be generated for it.
Solving MisspellingA
Python library Solves misspellings by generating a tree state machine containing misspelling information for words.
Stored When
we want to find files that contain a specific header, the Inverted Index does not solve this problem well, so Lucene provides another data structure, Stored Fields, to solve this problem. Essentially, Stored Fields is a simple key-value pair. By default, ElasticSearch stores the JSON source of the entire file.
Document Values In order to sort, aggregate
Even so, we found that the above structure still cannot solve problems such as: sorting, aggregation, facet, because we may have to read a lot of information that we do not need.
So, another data structure solves this problem: Document Values. This structure is essentially a columnar storage structure that highly optimizes the storage structure with the same type of data.
In order to improve efficiency, ElasticSearch can read all a Document Value under the index into memory for operation, which greatly improves the access speed, but also consumes a lot of memory space.
In summary, these data structures, Inverted Index, Stored Fields, Document Values, and their caches, are all inside the segment.
When a search occurs
, Lucene searches all segments and returns the search results for each segment, which is finally combined to present to the customer.
Lucene has some features that make this process important:
-
Segments are immutable<
-
>Delete? When the deletion occurs, Lucene simply flags it as deleted, but the file will still be in its original place and will not change
-
Update. So for updates, essentially what it does is: delete first, and then re-index
ul class=”list-paddingleft-2″
which can be seen everywhere
, is
very good at compressing data, and basically all textbook compression methods can be found in Lucene.
Caching
all
Lucene will also cache all the information, which greatly improves its query efficiency.
Cached StoriesWhen
ElasticSearch indexes a file, it caches the file accordingly and periodically refreshes the data (every second) so that the file can be searched.
increase over time We would have a lot of segments,
So ElasticSearch merges these segments, and in the process, the segments are eventually deleted
which is why increasing the file may make the index smaller footprint , which causes a merge, and thus there may be more compression.
For example,
two chestnuts will merge
these two segments will eventually be deleted , and then merge into a new segment
At this point, the new segment is in the cold state in the cache, but most segments remain unchanged and in the warm state.
The above scenarios often happen inside Lucene Index.
search in Shard
ElasticSearch searches from Shard similar to Lucene Segment.
Unlike searching in Lucene Segments, Shard may be distributed across different nodes, so when searching and returning results, all information is transmitted over the network.
It should be noted that 1
search finds 2 shards
= 2 searches for shards separately
the processing of log files
When we want to search for logs generated on a specific date, the search efficiency is greatly improved by chunking and indexing the log files according to the timestamp.
It is also very convenient when we want to delete old data, just delete the old index.
in the above case Each index has two shards
how to scale
shard
does not split further , but shards may be transferred to different nodes
Therefore, if the pressure on the
cluster nodes grows to a certain extent, we may consider adding new nodes, which will require us to reindex all the data, which we do not want to see, so we need to consider how to balance the relationship between enough nodes and insufficient nodes when planning.
Node allocation is optimized with Shard
-
index information, determine which core node the request will be routed to
-
, which replica is available,
-
and so on
-
Filters can use query at any time
-
only when a score is needed
route Routing
Each node maintains a copy of the
routing table, so when a request goes to any one node, ElasticSearch has the ability to forward the request to the shard of the desired node for further processing.
a real request
Query
Query has a type filtered, and a multi_match query
aggregation
aggregate according to the author, The top 10 authors who get the information of the top 10 hits
request for
distribution
This request may be distributed to any node in the cluster
God node
then the node becomes the coordinator of the current request , it decides:
- based on the
before the real search
ElasticSearch converts Query into Lucene Query
then perform calculations in all segments
there is also a cache for the Filter condition itself
but queries are not cached , so if the same Query is executed repeatedly, the application itself needs to do the caching
=”https://mmbiz.qpic.cn/mmbiz_png/1flHOHZw6RuE6psM7llhcwBBCbNHJaxwstJJwoBZHwCeDm4y8chNX0WCOqkJaibUtuibZ4oUgAGc8MWIayTicALWg/640?wx_fmt=png”>
so <
ul class=”list-paddingleft-2″>
, and the
results are returned layer by layer along the downlink path.
reference
Reference source
: SlideShare: Elasticsearch
From the Bottom Up
Youtube: Elasticsearch from the bottom up
Wiki:
Document-term matrix Wiki:
Search engine indexing
Skip list
Standford Edu: Faster postings list intersection via skip pointers
StackOverflow: how an search index works when querying many words?
StackOverflow: how does lucene calculate intersection of documents so fast?
Lucene and its magical indexes
misspellings 2.0c: A tool to detect misspellings