Click on the top ofzhisheng” to follow, star or pin to grow together

Since I recently did a sharing of Elasticsearch within the company, this article is mainly to make a summary, hoping that this article will give readers a general understanding of what Elasticsearch does, its use and basic principles.

The

data

search engine in life is the retrieval of data, so let’s start with the data in life. The data in our life is generally divided into two types

:

structured data: also known as row data, is a two-dimensional table structure to logically express and implement the data, strictly follow the data format and length specifications, mainly through the relational database for storage and management. Refers to data with a fixed format or limited length, such as databases, metadata, etc.

Unstructured data: also known as full-text data, indefinite length or no fixed format, not suitable for database two-dimensional tables to express, including all formats of office documents, XML, HTML, Word documents, emails, all kinds of reports, pictures and frequency, video information, etc.

Note: If you want to make a more granular distinction, XML and HTML can be divided into semi-structured data. Because they also have their own specific label format, they can be processed as structured data as needed, or as unstructured data can be extracted from plain text.

According to the two data classifications, search is also correspondingly divided into two types:

for structured data, because they have a specific structure, we can generally store and search through the two-dimensional table (Table) of the relational database (MySQL, Oracle, etc.), You can also build indexes.

For unstructured data, that is, the search for full-text data, there are two main methods

: sequential

scanning: through the text name can also understand its approximate search method, that is, according to the sequential scanning method to query specific keywords.

For example, give you a newspaper and let you find out where the word “peace” in that newspaper has appeared. You’ll definitely need to scan the newspaper from start to finish and mark out which sections the keyword appeared in and where.

This method is undoubtedly the most time-consuming and inefficient, if the newspaper layout is small, and there are more sections or even multiple newspapers, it will be almost the same when you have scanned your eyes.

Full-text search: Sequential scanning of unstructured data is slow, can we optimize it? Isn’t it okay to make our unstructured data have a certain structure?

Extract a part of the

information in unstructured data, reorganize it to make it have a certain structure, and then search for this data with a certain structure, so as to achieve the purpose of relatively fast search.

This method constitutes the basic idea of full-text search. This part of the information extracted from unstructured data and then reorganized is called indexing.

The main workload of this method is the creation of the index in the early stage, but it is fast and efficient for the later search.

Let’s start with Lucene

,

after a brief understanding of the types of data in life, we know that SQL retrieval in relational databases cannot handle this kind of unstructured data.

This processing of unstructured data relies on full-text search, and the best open-source full-text search engine toolkit on the market today belongs to Apache’s Lucene.

But Lucene is just a toolkit, it’s not a complete full-text indexing engine. Lucene’s purpose is to provide software developers with an easy-to-use toolkit to facilitate the implementation of full-text search functions in target systems, or to build a complete full-text search engine on this basis.

The main open source-available full-text search engines based on Lucene are Solr and Elasticsearch.

Solr and Elasticsearch are relatively mature full-text search engines that perform basically the same functions and performance.

However, ES itself has the characteristics of distributed and easy installation and use, and the distribution of Solr needs to be implemented by a third party, such as the use of ZooKeeper to achieve distributed coordinated management.

Both Solr and Elasticsearch rely on Lucene under the hood, and Lucene implements full-text search primarily because it implements the query structure of an inverted index.

How to understand inverted indexes? If there are three data documents, the contents of the documents are:

    >Java is the best programming language

  • PHP is the best programming language.

  • Javascript is the best programming language.

To create an

inverted index, we split the content field of each document into separate words (we call it terms or Terms) through a tokenizer, create an ordered list of all unique entries, and then list which document each entry appears in.

The result is as follows:

Term Doc_1 Doc_2 Doc_3------------------------------------- 

Java |   X   |        |
is          |
   X   |   X    |   X


the         |   X   |   X    |   X
best        |
   X   |   X    |   X
programming |   x   |   X    |   X
language    |
   X   |   X    |   X
PHP         |       |   X    |
Javascript  |
       |        |   X-------------------------------------

This structure consists of a list of all non-repeating words in a document, with a document list associated with each of these words.

This structure, where the property value determines the position of the record, is an inverted index. Files with inverted indexes are called inverted files.

We convert the above content into the form of a figure to illustrate the structural information of the inverted index, as shown in the following figure:

there are a few core terms to understand:

    Term

  • (Term): The smallest storage and query unit in the index, which is a word for English and generally refers to a word after a participle for Chinese.

  • Term Dictionary: or dictionary, is a collection of Term entries. The usual index unit of a search engine is a word, a word dictionary is a string collection of all words that have appeared in a document collection, and each index item in the word dictionary records some information about the word itself and a pointer to an “inverted list”.

  • Post list: A document usually consists of multiple words, and the inverted table records in which documents a word appears and where.

    Each record is called a Posting. The inverted table records not only the document number, but also stores information such as word frequency.

  • Inverted file: An inverted list of all words is often stored sequentially in a file on disk, this file is called an inverted file, which is the physical file that stores the inverted index.

From the above figure, we can understand that the inverted index is mainly composed of two parts:

dictionaries and inverted tables are two important data structures in Lucene, which are important cornerstones for fast retrieval. Dictionary and inverted files are stored in two parts, the dictionary is in memory and the inverted file is stored on disk.

After laying the groundwork for some basics of

ES core concepts

, we officially enter today’s introduction to the protagonist Elasticsearch.

ES is an open source search engine written in Java, which uses Lucene internally for indexing and searching, hiding the complexity of Lucene through the encapsulation of Lucene, and instead providing a simple and consistent set of RESTful APIs.

However, Elasticsearch is more than Lucene, and it’s not just a full-text search engine. 

It can be accurately described as follows:

  • a distributed, real-time document store where each field can be indexed and searched.

  • A distributed real-time analytics search engine.

  • Capable of scaling hundreds of service nodes and supporting petabytes of structured or unstructured data.

The

official website introduces Elasticsearch as a distributed, scalable, near real-time search and data analysis engine.

Let’s look at some core concepts about how Elasticsearch is distributed, scalable, and near real-time search.

Cluster

ES’s cluster setup is very simple, does not need to rely on third-party coordination and management components, and implements the cluster management function within itself.

An ES cluster consists of one or more Elasticsearch nodes, each configured with the same cluster.name to join the cluster, with the default value being “elasticsearch”.

Ensure that different cluster names are used in different environments, or this will eventually result in nodes joining the wrong cluster.

An Elasticsearch service launch instance is a node. The node sets the node name by node.name and, if not, assigns a random universally unique identifier as the name at startup.

(1) Discovery mechanismSo

there is a problem, how can ES internally connect different nodes to the same cluster through a cluster.name same setting? The answer is Zen Discovery.

Zen Discovery is Elasticsearch’s built-in default discovery module (the discovery module is responsible for discovering nodes in the cluster and electing master nodes).

It provides unicast and file-based discovery and can be extended to support cloud environments and other forms of discovery through plug-ins.

Zen Discovery integrates with other modules, for example, all communication between nodes is done using the Transport module. Nodes use the discovery mechanism to find other nodes by pinging.

Elasticsearch is configured to use unicast discovery by default to prevent nodes from inadvertently joining the cluster. Only nodes running on the same machine automatically form a cluster.

If the nodes of the cluster are running on different machines, using unicast, you can provide Elasticsearch with a list of nodes that it should try to connect to.

When a node contacts

a member of the unicast list, it gets the status of all nodes in the entire cluster, and then it contacts the master node and joins the cluster.

This means that the unicast list does not need to contain all the nodes in the cluster, it just needs enough nodes when a new node contacts one of them and speaks.

If you use master candidates as a unicast list, you only need to list three. This configuration is in the elasticsearch.yml file

:

 discovery.zen.ping.unicast.hosts: ["host1", "host2:port"]. 

Ping first after the node starts, if discovery.zen.ping.unicast.hosts has a setting, then ping Host in the settings, otherwise try pinging a few ports of localhost.

Elasticsearch supports launching multiple nodes on the same host, and the response of Ping contains basic information about that node and what the node considers to be the master node.

At the beginning of the election, first select from the master considered by each node, the rule is simple, sort in the lexicographic order of the ID, and take the first one. If none of the nodes thinks of a master, it is selected from all nodes, and the rules are the same as above.

One constraint here is discovery.zen.minimum_master_nodes, if the number of nodes

does not reach the minimum limit, the above process is repeated until the number of nodes is sufficient to start the election.

In the end, a master will definitely be elected, and if there is only one Local node, it will be yourself.

If the current node is Master, start waiting for the number of nodes to reach discovery.zen.minimum_master_nodes before providing services.

If the current node is not Master, try joining Master. Elasticsearch refers to this process of service discovery and selection as Zen Discovery.

Since it supports any number of clusters (1-N), it cannot restrict nodes to odd numbers like Zookeeper, so it cannot be voted by a voting mechanism, but by passing a rule.

As long as all nodes follow the same rules and the information obtained is equal, the selected master nodes must be consistent.

But the problem with

distributed systems is that the information is not equal, and it is easy to have a split-brain problem.

Most solutions are to set a quorum value that requires the available nodes to be larger than the quorum (usually more than half of the nodes) in order to provide external services.

In Elasticsearch, the configuration of this Quorum is discovery.zen.minimum_master_nodes.

(2) The role of

a node

Each node can be either a candidate master node or a data node as described in the configuration file: /config/elasticsearch.yml can be set, and the default is true.

node.master: true if the candidate master node
data: true Whether the data node

is responsible for data

storage and related operations, such as adding, deleting, modifying, querying, and aggregating data, so the data node (data node) has high requirements for machine configuration and consumes a lot of CPU, memory and I/O.

Often as the cluster grows, more data nodes need to be added to improve performance and availability.

The candidate master node

can be elected as the master node, and only the candidate master node in the cluster has the right to vote and be elected, and other nodes do not participate in the election.

The master node is responsible for creating indexes, deleting indexes, tracking which nodes are part of the cluster, and

deciding which shards are assigned to related nodes, tracking the status of nodes in the cluster, etc., and a stable master node is very important for the health of the cluster.

A node can be either a candidate master node or a data node, but because the data node consumes a lot of CPU and memory core I/O.

So if a node is both a data node and a master node, it may have an impact on the master node and thus the state of the entire cluster.

Therefore, in

order to improve the health of the cluster, we should do a good job of role division and isolation of the nodes in the Elasticsearch cluster. Several machine farms with lower configurations can be used as candidate master farms.

The master node and other nodes

check each other by pinging, and the master node is responsible for pinging all other nodes to determine whether any node has been hung up. Other nodes also ping to determine whether the master node is available.

Although the roles of nodes are distinguished, user requests can be sent to any node, and the node is responsible for distributing requests and collecting results, without the need for the master node to forward them.

This kind of node can be called a coordinator, the coordinator

does not need to be specified and configured, and any node in the cluster can play the role of coordinator.

(3) Split brain phenomenon At the same time, if multiple master nodes are elected in the cluster due to network or other reasons, resulting in inconsistencies in data update, this phenomenon

is

called brain splitting, that is, different nodes in the cluster have diverged in the choice of master, and there are multiple Master competition.

The “split brain” problem may be caused by the following reasons:

    network problem:

  • The network delay between the clusters causes some nodes to not be able to access the master, and the master is considered to be hung up and a new master is elected, and the shards and replicas on the master are red and a new primary shard is allocated.

  • Node load: The role of the master node is both master and data, and when the access volume is large, it may cause ES to stop responding (suspended animation state) and cause a large delay, at this time, other nodes do not get a response from the master node to think that the master node is hung up and will reselect the master node.

  • Memory reclamation: The role of the master node

  • is both master and data, when the ES process on the data node occupies a large amount of memory, it causes large-scale memory reclamation of the JVM, causing the ES process to become unresponsive.

In order to avoid the occurrence of brain split-brain phenomenon, we can start from the cause to make optimization measures through the following aspects:

  • appropriately adjust the response time to reduce misjudgment. Set the response time of the node state through the parameter discovery.zen.ping_timeout, the default is 3s, which can be adjusted appropriately.

    If the master does not respond within the response time range, the node is considered to have been hung up. Adjusting the parameters (such as 6s, discovery.zen.ping_timeout:6) can appropriately reduce false positives.

  • Election triggered. We need to set the value of parameter discovery.zen.munimum_master_nodes in the configuration file of the node in the candidate cluster.

    This parameter indicates the number of nodes of the candidate master node

    that need to participate in the election when electing the master node, the default value is 1, the official recommendation is (master_eligibel_nodes/2)+1, where master_eligibel_nodes is the number of candidate master nodes.

    This prevents split-brain phenomena and maximizes cluster availability, as long as no fewer than discovery.zen.munimum_master_nodes candidates are alive, the election can proceed.

    When it is less than this value, the election behavior cannot be triggered, the cluster cannot be used, and it will not cause sharding chaos.

  • Role separation. That is, the candidate master node and data node we mentioned above are separated from roles, which can reduce the burden of the master node, prevent the suspended animation state of the master node, and reduce the misjudgment that the master node is “dead”.

Shards

ES supports petabyte-level full-text search, when the amount of data on the index is too large, ES splits the data on an index and distributes it to different data blocks by splitting horizontally, and the split database block is called a shard.

This is similar to MySQL’s sharding and sharding, except that MySQL sharding requires the help of third-party components and ES implements this function internally.

When writing data to a multi-shard index, the route determines which shard to write to, so you need to specify the number of shards when creating the index, and once the number of shards is determined, it cannot be modified.

The number of shards and the

number of replicas described below can be configured through Settings when creating an index, ES creates 5 primary shards for an index by default, and creates a replica for each shard.

PUT /myIndex{

   "settings" : {


      "number_of_shards" : 5,
      "number_of_replicas" : 1   }}

ES improves indexes in scale and performance through the ability of sharding, each shard is an index file in Lucene, and each shard must have a primary shard and zero to multiple replicas.

Replicas

A replica is a copy of a shard, each primary shard has one or more replica shards, when the primary shard is abnormal, the replica can provide data query and other operations.

The primary shard and the corresponding replica shard will not be on the same node, so the maximum number of replica shards is N-1 (where N is the number of nodes).

New, index, and delete requests for documents are write operations and must be completed on the primary shard before they can be copied to the relevant replica shard.

ES In order to improve the ability to write, this process is written concurrently, and in order to solve the problem of data conflicts in the process of concurrent writing, ES controls through optimistic locking, each document has a _version (version) number, and the version number is incremented when the document is modified.

Once all replica shards report a successful write, success is reported to the coordinator, which reports success to the client.

As can be seen from the above figure, in order to achieve high availability, the master node avoids placing the primary and replica shards on the same node.

Assuming that the node

Node1 service is down or the network is unavailable at this time, then the primary shard S0 on the primary node is also unavailable.

Fortunately, there are two other nodes that work properly, at which point ES will re-elect a new master node with all the data we need for S0.

We will promote the replica shard of S0 to the primary

shard, and this process of promoting the primary shard occurs instantaneously. The status of the cluster will be Yellow.

Why is our cluster status Yellow instead of Green? Although we have all 2 primary shards, we set up two replica shards for each primary shard at the same time, and there is only one replica shard. So the cluster cannot be in the green state.

If we also turn off Node2, our program can still run without losing any data, because Node3 keeps a copy of each shard.

If we restart Node1, the cluster can reallocate the missing replica shards, and the cluster will return to its original normal state.

If Node1 still owns the previous shards,

it will try to reuse them, but at this time the shards on the Node1 node are no longer the primary shards but replica shards, and if there is a change in the data during the period, only the modified data files need to be copied from the main shard.

Summary:

    data sharding is to improve the

  • capacity of data that can be processed and easy to scale horizontally, and replicating shards is to improve the stability of the cluster and improve concurrency.

  • Copies are multiplication, the more the more consumes, but also the safer. Sharding is division, and the more shards, the less and more dispersed the data per shard.

  • The more replicas, the higher the availability of the cluster, but since each shard is equivalent to a Lucene index file, it consumes a certain amount of file handles, memory, and CPU.

    And data synchronization between shards will also occupy a certain amount of network bandwidth, so the number of shards and replicas of the index is not as much as possible.

Mapping

Mappings are used to define information such as the storage type, word segmentation, and whether ES stores fields in an index, just like a schema in a database, describing the fields or properties that a document may have, and the data type of each field.

However, the

relational database must specify the field type when building a table, and ES can not specify the field type and then guess the field type dynamically, or specify the type of the field when creating the index.

The mapping that automatically recognizes the field type

according to the data format is called Dynamic Mapping, and the mapping that defines the field type when we create the index is called static mapping or explicit mapping.

Before explaining the use of dynamic mapping and static mapping, let’s first understand what field types are in the data in ES? We’ll talk about why we need static mappings instead of dynamic mappings when creating indexes.

There are mainly the following types of field data types in ES (v6.8):

Text A field used to index full-text values, such as email body or product descriptions. These fields are tokenized, and they are passed through a tokenizer to convert the string into a list of individual terms before being indexed.

The profiling process allows Elasticsearch to search every complete text field in a single word. Text fields are not used for sorting and are rarely used for aggregation.

Keyword is used to index fields of structured content, such as email addresses, host names, status codes, postal codes, or tags. They are commonly used for filtering, sorting, and aggregation. The Keyword field can only be searched by its exact value.

Through the understanding of field types, we know that some fields need to be clearly defined, for example, whether a field is of type Text or Keyword is very different, time fields may need to specify its time format, and some fields need to specify a specific tokenizer and so on.

If this cannot be done precisely with dynamic mapping, automatic recognition will often be somewhat different from what we expect.

So when creating an index, a complete format should be to specify the number of shards and replicas and the definition of Mapping, as follows

:

PUT my_index { "

settings" : {

"
number_of_shards"  : 5,
      "number_of_replicas" : 1   }

  "mappings": {


    "_doc": { 
      "properties": { 
        "title":    { "type""text"  }, 
        "name":     { "type""text"  }, 
        "age":      { "type""integer"  },  
        "created":  {
          "type":   "date"
          "format""strict_date_optional_time|| epoch_millis"

} } } }}}ES basic

use

The first thing to consider when deciding to use Elasticsearch is the version, and Elasticsearch (excluding 0.x and 1.x) currently has the following commonly used stable major versions: 2.x, 5.x, 6.x, 7.x (current).

You may find that without 3.x and 4.x, ES jumps straight from 2.4.6 to 5.0.0. In fact, it is for the version of ELK (ElasticSearch, Logstash, Kibana) technology stack to unify, so as not to bring confusion to users.

With Elasticsearch being 2.x (the last version of 2.x, 2.4.6, released on July 25, 2017), Kibana

is already 4.x (Kibana 4.6.5 was released on July 25, 2017).

Then the next major version of Kibana must be 5.x, so Elasticsearch directly released its own major version as 5.0.0.

After unification, we will not hesitate to choose the version, we choose the version of Elasticsearch and then choose the same version of Kibana, without worrying about version incompatibility.

Elasticsearch

is built in Java, so in addition to paying attention to the version uniformity of ELK technology, we also need to pay attention to the version of the JDK when choosing the version of Elasticsearch.

Because each major version depends on a different JDK version, JDK11 is currently supported in version 7.2.

Installation and use

(1) Download and decompress Elasticsearch, you can use it without installing and unzipping, the decompressed directory is as shown above

:

    >bin: binary system instruction directory, including startup commands and installation plugin commands.

  • config: The configuration file directory.

  • data: The data storage directory.

  • lib: Dependent package directory.

  • logs: Log file directory.

  • modules: Module libraries, such as x-pack modules.

  • plugins: Plugins directory.

(2) Run bin/elasticsearch in the installation directory to start ES.

(3) Run on port 9200 by default, request curl http://localhost:9200/ or browser input http://localhost:9200 to get a JSON object containing information such as the current node, cluster, and version.

{
  "name" : "U7fp3O9",
  "cluster_name" : "elasticsearch",
  "cluster_uuid"  : "-Rj8jGQvRIelGd9ckicUOA",
  "version" : {
    "number" : "6.8.1",
    "build_flavor" : " default",
    "build_type" : "zip",
    "build_hash" : "1fad4e1",
    "build_date" : " 2019-06-18T13:16:52.517138Z",
    "build_snapshot" : false,
    "lucene_version" : "7.7.0",
    "minimum_ wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"  },

  "tagline" : " You Know, for Search"

}
Cluster

health status

To check the cluster health, we can run the following command GET /_cluster/health in the Kibana console and get the following message:

{ "
cluster_name""wujiajian",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1 ,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 9,
  "active_shards" : 9 ,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 5,
   "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0 ,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 64.28571428571429}

The cluster status is identified by green, yellow, and red:

    green:

    the

  • cluster is healthy and intact, everything is fully functional and normal, All shards and replicas work fine.

  • Yellow: Alert status, all primary shards are functional, but at least one replica is not working properly. The cluster works fine, but high availability suffers to some extent.

  • Red: The cluster is not available. One or more shards and their replicas are unexpectedly unavailable, and the cluster’s query operations can still be performed, but the returned results will be inaccurate. Write requests assigned to this shard will be incorrect, which will eventually lead to data loss.

When the cluster status is red, it will continue to serve search requests from available shards, but you need to fix those unallocated shards as soon as possible.

ES

mechanism principle

After the basic concepts and basic operations of ES are introduced, we may still have many doubts:

    >

    How do they work internally?

  • How are primary and replica shards synchronized?

  • What is the process of creating an index?

  • How does ES distribute index data across different shards? And how is this index data stored?

  • Why is ES a near real-time search engine and CRUD (create-read-update-delete) operations for documents in real time?

  • And how does Elasticsearch ensure that updates are persisted without data loss in the event of a power outage?

  • Also Why doesn’t deleting a document free up space immediately?

With these questions in mind, we move on to the rest of the article.

Write index principle

The following figure depicts a 3-node cluster with a total of 12 shards, of which there are 4 primary shards (S0, S1, S2, S3) and 8 replica shards (R0, R1, R2, R3), each primary shard corresponds to two replica shards, and node 1 is the master node (master node) responsible for the state of the entire cluster.

A write index can only be written to the primary shard and then synchronized to the replica shard. There are four main shards, according to what rules is a data ES written to a specific shard?

Why is this index data written to S0 and not to S1 or S2? Why is that piece of data written to S3 and not to S0?

First of all, this must not be random, otherwise we will not know where to look for it when we want to get the documentation in the future.

In fact, this process is determined by the following formula:

shard = hash(routing) % number_of_primary_shards

Routing is a variable value, which defaults to the document’s _id or can be set to a custom value.

Routing generates a number from the Hash function, which is then divided by number_of_primary_shards (the number of primary shards) to get the remainder.

This remainder between 0 and number_of_primary_shards-1 is where the document we seek is sharded.

This explains why we determine the number of primary shards when we create the index and never change this number: because if the number changes, then all previously routed values will be invalid and the document will no longer be found.

Since each node in

the ES cluster knows where the documents in the cluster are located through the above calculation formula, each node has the ability to handle read and write requests.

After a write request is sent to a node, the node

is the coordinator node mentioned earlier, and the coordinator will calculate which shard it needs to write to according to the routing formula, and then forward the request to the main shard node of the shard.

Suppose that the value obtained after the data is taken by the routing calculation formula is shard=hash(routing)%4=0.

The specific flow is as follows:

    the

  • client sends a write request to the ES1 node (coordinator node), and the value is 0 obtained through the routing calculation formula, then the current data should be written to the main shard S0.

  • The ES1 node forwards the request to ES3, the

  • node where the S0 primary shard is located, and ES3 accepts the request and writes to disk.

  • Concurrency replicates data to two replica shards R0, where collisions of data are controlled through optimistic concurrency. Once all replica shards report success, node ES3 reports success to the coordinator node, which reports success to the client.

Storage principle

The above describes the write processing process of the index inside ES, which is executed in ES’s memory, and after the data is allocated to specific shards and replicas, it is eventually stored on disk, so that there is no data loss in the event of a power failure.

The specific storage path can be found in the configuration file: /config/elasticsearch.yml and is stored by default in the Data folder of the installation directory.

It is not recommended to use the default value because if ES is upgraded, it may cause all data to be lost

: path.data: /path/to/data

index data path.logs: /
path/ to/logs logging

(1) Fragmentation index

documents are stored on disk in the form of segments, what is a segment? The index file is split into multiple subfiles, then each subfile is called a segment, each segment itself is an inverted index, and the segment is immutable, once the indexed data is written to the hard disk, it can no longer be modified.

The segmented storage mode is adopted at the bottom layer, which almost completely avoids the appearance of locks when reading and writing, and greatly improves the read and write performance.

When a segment is

written to disk, a commit point is generated, which is a file that records all post-commit segment information.

Once a segment

has a commit point, it means that the segment only has read permissions and loses write permissions. On the contrary, when the segment is in memory, it only has write permissions, but does not have the permission to read data, which means that it cannot be retrieved.

The concept of segments was developed primarily because in early full-text indexes a large inverted index of an entire collection of documents was built and written to disk.

If the index

is updated, you need to create a new full index to replace the original index. This method is inefficient when the amount of data is large, and because the cost of creating an index is high, the data cannot be updated too frequently, and the timeliness cannot be guaranteed.

Index files are stored in segments and cannot be modified, so what about additions, updates, and deletions?

    new, new is

  • very easy to handle, because the data is new, so only need to add a new paragraph to the current document.

  • Delete, because it cannot be modified, does not remove documents from the old segments but by adding a new .del file that lists the segment information for those deleted documents.

    This marked deleted document can still be matched by the query, but it will be removed from the result set before the final result is returned.

  • Update, can not modify the old segment to reflect the document update, in fact, update is equivalent to delete and add two actions. The old document is marked for deletion in the .del file, and the new version of the document is indexed into a new segment.

    It is possible that both versions of the document will be matched by a query, but the old version of the document that was deleted will be removed before the result set is returned.

Setting segments to be unmodifiable has certain advantages and disadvantages, mainly in the following aspects:

  • no lock is required. If you never update the index, you don’t need to worry about multiple processes modifying data at the same time.

  • Once an index is read into the kernel’s file system cache, it remains where it is due to its immutability. As long as there is enough space in the file system cache, most read requests will directly request memory and will not hit disk. This provides a big performance boost.

  • Other caches, such as the Filter cache, remain valid for the lifetime of the index. They don’t need to be rebuilt every time the data changes, because the data doesn’t change.

  • Writing to a single large inverted index allows data to be compressed, reducing disk I/O and the use of indexes that need to be cached in memory.

The

disadvantages of segment immutability are as follows:

    when old data is deleted, the

  • old data is not deleted immediately, but is marked for deletion in the .del file. Old data can only be removed when the segment is updated, which results in a lot of wasted space.

  • If there is

  • a data that is updated frequently, and each update is a new tag of the old, there will be a lot of wasted space.

  • Each time new data is added, a new segment is required to store the data. When the number of segments is too large, the consumption of server resources such as file handles can be very large.

  • Including all result sets in the results of the query adds to the burden of the query by excluding old data that has been marked for deletion.

(2) After the delayed write strategy

introduces the form of storage, what is the process of writing indexes to disk? Is Fsync directly written to disk?

The

answer is obvious, if you write directly to disk, the I/O consumption of the disk can severely affect performance.

Then, when the amount of written data is large, the ES will be stuck and the query cannot be responded to quickly. If that were the case, ES wouldn’t be called a near real-time full-text search engine.

In order to improve the performance of writes, ES does not add a segment to disk every time a new piece of data is added, but adopts a deferred write strategy.

Whenever new data is added, it is first written to memory, which is a file system cache between memory and disk.

When the default time (1 second) is reached or when a certain amount of data in memory is reached, a refresh is triggered, and the data in memory is generated on a new segment and cached on the file caching system, which is later flushed to disk and generates a commit point.

The memory here uses ES’s JVM memory, while the file caching system uses the operating system’s memory.

New data continues to be written to memory, but the data in memory is not stored in segments and therefore cannot be retrieved.

Flushing from memory to the file caching system generates new segments and opens them for search without waiting for them to be flushed to disk.

In Elasticsearch, the lightweight process of writing and opening a new segment is called Refresh (i.e., memory flush to the file caching system).

By default, each shard is automatically refreshed once per second. That’s why we say that Elasticsearch is near real-time search, because changes in documents are not immediately visible to the search, but become visible within a second.

We can also manually trigger Refresh, POST /_refresh to refresh all indexes, POST /nba/_refresh to refresh the specified indexes.

Tips: Although a refresh is a much lighter operation than a commit, it has a performance overhead. Manual refresh is useful when writing tests, but don’t manually refresh one document at a time in a production > environment. And not all cases need to be refreshed every second.

Maybe you’re using Elasticsearch to index a large number of log files, and you may want to optimize indexing speed instead of > near real-time search.

At this time, you can reduce the refresh frequency of each index by increasing the value of refresh_interval = “30s” in Settings when creating the index, and you need to pay attention to the time unit after the value, otherwise the default is milliseconds. When refresh_interval=-1 turns off automatic refresh of indexes.

Although the delay write strategy can reduce the number of times data is written to the disk and improve the overall write ability, we know that the file cache system is also a memory space, which belongs to the memory of the operating system, as long as the memory is in danger of power failure or data loss under abnormal circumstances.

To avoid data loss, Elasticsearch adds a translog, which logs all data that has not been persisted to disk.

The entire process of writing indexes after adding the transaction log is shown in the figure above:

  • After a new document is indexed, it is first written to memory, but in order to prevent data loss, a copy of the data will be appended to the transaction log.

    New documents are constantly written to memory and recorded in the transaction log. At this time, the new data cannot be retrieved and queried.

  • When the default refresh time is reached

  • or a certain amount of data in memory is reached, a refresh is triggered to flush the data in memory as a new segment to the file cache system and empty the memory. At this time, although the new segment is not committed to disk, it can provide the function of retrieving the document and cannot be modified.

  • As new document indexes are continuously written, a flush is triggered when the log data size exceeds 512M or the time exceeds 30 minutes.

    In-memory data is written to a new segment and written to the file caching system,

    the data in the file system cache is flushed to disk via Fsync, a commit point is generated, and the log file is deleted, creating an empty new log.

In this way, when the power is lost or a restart is required, ES not only loads the persisted segments according to the commit point, but also needs the records in the tool Translog to re-persist the unpersisted data to the disk to avoid the possibility of data loss.

(3) Segment merging

Because the automatic refresh process creates a new segment every second, this will cause the number of segments to skyrocket in a short period of time. Too many segments can cause more trouble.

Each segment consumes file handles, memory, and CPU run cycles. What’s more, each search request must take turns examining each segment and then merging the query results, so the more segments, the slower the search.

Elasticsearch solves this problem by periodically merging segments behind the scenes. Small segments are merged into large segments, which are then merged into larger segments.

Segment merging purges old deleted documents from the file system. Deleted documents are not copied to a new large segment. Indexing and searching are not interrupted during the merge process.

Segment merging occurs automatically when indexing and searching, and the merge process selects a small subset of similarly sized segments and merges them into larger segments behind the scenes, which can be either uncommitted or committed.

At the end of the merge, the old segment is deleted, the new segment is

flushed to disk, and a new commit point containing the new segment and excluding the old and smaller segments is written, and the new segment is opened for searching.

Segment merging is computationally intensive and eats up a lot of disk I/O, which can slow down write rates and affect search performance if left unchecked.

Elasticsearch limits resources on the merge process by default, so the search still has enough resources to perform well.

Performance optimization

Storage devices

Disk is often a bottleneck on modern servers. Elasticsearch uses disk heavily, and the more throughput your disk can handle, the more stable your node will be.

Here are some tips for optimizing disk I/O:

  • use SSDs. As mentioned elsewhere, they are much better than mechanical disks.

  • Use RAID 0. Striped RAID increases disk I/O at the obvious cost of failing one hard drive entirely. Do not use mirroring or parity RAID as replicas already provide this feature.

  • In addition, use multiple hard drives and allow Elasticsearch to stripe data to them through multiple path.data directory configurations.

  • Do not use remotely mounted storage, such as NFS or SMB/CIFS. This introduced delay is completely contrary to performance.

  • If you’re using EC2, watch out for EBS. Even SSD-based EBS is typically slower than on-premises instance storage.

Internal index optimization

In order to quickly find a term, Elasticsearch first ranks all Term, and then looks up the term according to the dichotomy, and the time complexity is logN, just like looking up through a dictionary, which is the Term Dictionary.

Now it seems similar to the way traditional databases go through B-Tree. But if there are too many Terms, the Term Dictionary will also be large, and it is unrealistic to put memory, so there is a Term Index.

Just like the index page in the dictionary, what are the Terms at the beginning of A and which page are respectively, you can understand that the Term Index is a tree.

This tree does not contain all Term, it contains some prefix for Term. The Term Index allows you to quickly locate an offset of the Term Dictionary and then look it up from that location in order.

In memory, the Term Index is

compressed by FST, and the FST stores all Terms in bytes, which can effectively reduce the storage space so that the Term Index is enough to fit into memory, but this way will also cause more CPU resources to be required for lookup.

Inverted tables stored on disk are also compressed to reduce the space occupied by storage.

Adjust configuration parameters

Tuning the configuration parameters is recommended as follows:

  • assign an ordered well-compressed sequence pattern ID to each document, avoiding random IDs such as UUID-4, which have a low compression ratio and will significantly slow down Lucene.

  • Disable Doc values for indexed fields that do not require aggregation and sorting. Doc Values is an ordered list of mappings based on document=>field value.

  • Fields that do not require fuzzy retrieval use the Keyword type instead of the Text type, which avoids word segmentation of these texts before indexing.

  • If your search results don’t require near real-time accuracy, consider changing the index.refresh_interval per index to 30s.

    If you are doing a bulk import, you can turn off refresh by setting this value to -1 during the import, and you can also turn off the replica by setting index.number_of_replicas: 0. Don’t forget to turn it back on when it’s finished.

  • Avoid deep paging queries recommends using Scroll for paging queries. During normal paged queries, an empty priority queue of from+size is created, and each shard returns from+size data, which by default only contains the document ID and score score to the coordinator node.

    If there are N shards, the coordinator reorders (from+size) × n pieces of data and selects the documents that need to be retrieved. When from is large, the sorting process becomes heavy and consumes CPU resources severely.

  • Reduce the mapping fields and provide only the fields that need to be retrieved, aggregated, or sorted. Other fields can exist on other storage devices, such as Hbase, and go to Hbase to query these fields after getting the results in ES.

  • Specify the routing value when creating indexes and queries, which can be accurate to specific sharded queries and improve query efficiency. The choice of routes requires attention to the balanced distribution of data.

JVM tuning

JVM tuning recommendations are as follows:

  • ensure that the minimum heap memory (Xms) and maximum (Xmx) sizes are the same, and prevent the program from changing the heap memory size at runtime.

    By default, Elasticsearch sets 1GB of heap memory after installation. You can do this by: /config/jvm.option file, but preferably not more than 50% of physical memory and more than 32GB.

  • GC adopts CMS mode by default, concurrent but has STW problems, you can consider using the G1 collector.

  • ES relies heavily on Filesystem Cache, fast searches. In general, you should ensure that at least half of the available memory is physically allocated to the file system cache.

Source | r6a.cn/cmsA

Buy Me A Coffee