• Question 1: I want to migrate from the current 6.7 version of ES business version on Alibaba Cloud to my own 7.10 self-built environment, the certificate is different, it cannot be remote CCR, is there a real-time synchronization tool? Or can you only use logstash?

  • Q2: Are there any components or solutions for ES 2 index data synchronization?

This is a question that is often asked. Involves migration or synchronization of index data across versions, networks, and clusters. Let’s break it down:

2.1 cross-version

7.X is the current mainstream version, and the early business system will stay at 6. X, 5.X and even 2.X, 1.X versions.

Synchronizing data should be noted: 7. What is the difference between X and earlier versions?

Version 7.X has gone through 7.0-7.12 12+

iterations of 12+ minor versions, and version 7.0 release time: 2019-04-10, 2 years + time has passed.

A core point to focus on synchronization

: the

official statement is more convincing: “Before 7.0.0, the mapping definition included a type name. Elasticsearch 7.0.0 and later no longer accept a default mapping. “

    > version 6.X: There is also the concept of type, which can be defined by yourself.

  • Version 7.X: type is _doc.

Practical example: Specify type in 7.X to write data

:

PUT test-002/mytype/1{ "

title":"testing"

}

There will be a warning like this:

 #! [types removal] Specifying types in document index requests is deprecated, use the typeless endpoints instead (/{index}/_doc/{id}, /{index}/_doc, or /{index}/_create/{id}). 

https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html

2.2 Cross-network

The two clusters are not in a local area network, one is in the cloud and the other is local.

It’s one of the common business scenarios, at least I’ve done it.

2.3 Cross-cluster

source and destination data are distributed in two different clusters.

3. The synchronization scheme is compared with

the following synchronization schemes, which we interpret while fighting.

3.0 Actual Environment Preparation

For the convenience of demonstration, we have simplified the environment. Complex environment, the principle is the same.

    cluster

  • 1: cloud, single-node source cluster: 172.21.0.14:19022.

  • Cluster 2: Cloud, single-node cluster: 172.21.0.14:19205.

  • The two clusters share a cloud server, CPU: 4 cores, memory: 8G.

The versions are all the same, all are version 7.12.0.

Test data: 100W (script automatically generated).

The individual records are as follows

: “_source” : { “name” : “


9UCROh3", "
age" : 16,
"last_ updated" : 1621579460000 }

3.1 Solution 1: Reindex

cross-cluster synchronization3.1.1

Reindex Precondition: Set a whitelist

Set the whitelist of the source cluster on the target cluster only in elasticsearch.yml.

reindex.remote.whitelist: "172.21.0.14:19022"

Note that do not test in Kibana Dev Tools as follows, Unless you have modified the default timeout.

3.1.2 reindex synchronous

POST _reindex{ “source”: { “

remote"

: {



"host" "http://172.21.0.14:19022"    },

    "index""test_data",


    "size":10000,
     "slice": {
       "id": 0,
      "max": 5    }  },

  "dest": {


    "index""test_data_from_reindex"  }}

The two core parameters are described as follows

:

    size

  • : The default scroll value size is 1000, and here it is set 10 times larger, which is 10000.

  • slice: Divides a large request into small requests and executes them concurrently. (PS: I’m not strict in my usage here).

3.1.3 reindex synchronization practical conclusion

> script test , reindex synchronizes 100W data, time consuming: 34 s.

3.2 Solution 2: Elasticdump synchronization

https://github.com/elasticsearch-dump/elasticsearch-dump

3.2.1elasticdump Installation notes

< ul class="list-paddingleft-2">

  • elasticdump pre-dependent is node, node needs a version after 8.0+.
  • [root@VM-0-14-centos test]# node -vv12.13.1

    [root@VM-0-14-centos test]# npm -v

    6.12.1
    [root@VM-0-14-centos test]# elasticdump --help
    elasticdump: Import and export tools for elasticsearchversion: 6.71.0Usage: elasticdump --input SOURCE --output DESTINATION [OPTIONS]... ...

    3.2.2 elasticdump Synchronous Combat

    Elasticdump \ –input=http://172.21.0.14:19022/

    test_data \ --output=http://172.21.0.14:19205/test_ data_from_dump \

      --type=analyzer

    elasticdump \  --input=http://172.21.0.14:19022/test_data \  --output=http://172.21.0.14:19205/test_data_from_dump \

      -- type=mapping

    elasticdump \  --input=http://172.21.0.14:19022/test_data \  --output=http://172.21.0.14:19205/test_data_from_dump \

      --type=data \

    --concurrency=5 \ -

    -limit=10000


    The basic above parameters can do: see the name consciousness.

    • input: source cluster index.

    • output: The target cluster index.

    • analyzer : Synchronous tokenizer.

    • mapping: Synchronous mapping schema.

    • data

    • : Synchronize data.

    • concurrency : The number of concurrent requests.

    • limit: The number of documents requested to be synchronized at one time, the default is 100.

    3.2.3 Elasticdump synchronization verification conclusion

    elasticdump Synchronize 100W data, time consuming: 106 s.

    3.3 Solution 4

    : ESM tool synchronization

    ESM is a tool derived from medcl open source: Elasticsearch Dumper, based on the Go language.

    Address: https://github.com/medcl/esm

    3.3.1 ESM Tools Installation Considerations

    Dependent Go version: >= 1.7.

    3.3.2 ESM tools synchronize actual ESM

    s http://172.21.0.14:19022 -d http://172.21.0.14:19205 -x test_data -y test_data_from_esm -w= 5 -b=10 -c 10000
    • w: concurrency.

    • b: bulk size, in MB.

    • c:scroll batch value size.

    3.3.3 ESM tool synchronization practice conclusion

    1 million data 38 s after synchronization, extremely fast.

    esm  -s http://172.21.0.14:19022  -d http://172.21.0.14:19205 -x test_data  -y test_data_from_esm -w=5 -b=10 -c 10000test_data[05-19 13:44:58] [INF] [main.go :474,main] start data migration.. Scroll 1000000 / 1000000 [================================================================================================================] 100.00% 38sBulk 999989 / 1000000  [===================================================================================================================] 100.00% 38s[05-19 13:45:36] [INF] [main.go:505,main] data migration finished.

    When synchronizing: The CPU is burst, indicating that the concurrency parameter has taken effect.

    3.4 Solution 5:

    logstash synchronization

    3.4.1 logstash synchronization considerations

    This article is based on logstash 7.12.0, and the related plugins: logstash_input_elasticsearch and logstash_output_elasticsearch have been integrated and installed, and do not need to be installed again.

    Note: The configured input and output are the names of the plug-in and should be lowercase. Many foreign blogs have mistakes and need to be screened in practice.

    3.4.2 logstash synchronous input

    {

    elasticsearch {

    hosts => ["172.21.0.14:19022"]


    index => " test_data"        size => 10000

            scroll => "5m"


            codec => "json"
            docinfo => true    }}filter {}output {    elasticsearch {

            hosts => ["172.21.0.14:19205"]


    index => "test_data_from_logstash" }}3.4.3

    logstash sync test

    100W data 74 s synchronization.

    3.5 Solution 3: Snapshot & Restore Synchronization

    3.5.1 Snapshot & Recovery Configuration Considerations

    Configure the snapshot storage path in advance in the elasticsearch.yml configuration file.

    path.repo: [“/home/elasticsearch/

    elasticsearch-7.12.0/backup"] Detailed

    configuration reference: dry goods| Elasitcsearch 7.X clustering, index backup and recovery.

    3.5.2 Snapshot & Restore

    Practice # A node creates a snapshot PUT /_snapshot/my_backup{ "

    type": "fs",


      "settings": {
        "location""/home/elasticsearch/elasticsearch-7.12.0/backup"  }}

    PUT /_snapshot/my_backup/snapshot_testdata_ index?wait_for_completion=true

    {

      "indices""test_data_from_dump",


      "ignore_unavailable"true ,
      "include_global_state"false,
      "metadata": {
        "taken_by""mingyi",
        "taken_ because": "backup before upgrading" }}

    # Another restore snapshot


    curl -XPOST " http://172.21.0.14:19022/_snapshot/my_backup/snapshot_testdata_index/_restore "

    3.5.2 Snapshot & Recovery Practical Conclusion

    .”

    • execution snapshot time: 2 s.

    • Restore snapshot time: within 1 s.

    4. Summary

    This paper gives five solutions for data synchronization (simulation) between Elasticsearch across networks and clusters, and verifies them in the actual combat environment.

    The preliminary verification conclusion is as follows:

    Of course, the conclusion is not absolute and is for informational purposes only.

    Each synchronization tool is essentially scroll + bulk + multi-threaded comprehensive implementation.

    The essential differences are: different development languages, different implementations of concurrent processing, etc.

    • reindex Develop ESM

    • Develop

    • logstash based on Ruby + Java

    • Elastidump JS-based snapshots

    involve offsite copies of files, and the speed constraint is network bandwidth, so it is not counted.

    How to choose? I believe that after reading the introduction of this article, you should know it in your heart.

    • reindex scheme involves configuring whitelists, and snapshots and restoring snapshots involve configuring the transfer of snapshot libraries and files.

    • ESM, logstash, and elastidump synchronization does not require special configuration.

    The length of time is related to the cluster size, hardware configuration of each node of the cluster, data type, and write optimization scheme.

    How do you synchronize data in actual development? Welcome to leave a message to discuss.