-
Question 1: I want to migrate from the current 6.7 version of ES business version on Alibaba Cloud to my own 7.10 self-built environment, the certificate is different, it cannot be remote CCR, is there a real-time synchronization tool? Or can you only use logstash?
-
Q2: Are there any components or solutions for ES 2 index data synchronization?
This is a question that is often asked. Involves migration or synchronization of index data across versions, networks, and clusters. Let’s break it down:
2.1 cross-version
7.X is the current mainstream version, and the early business system will stay at 6. X, 5.X and even 2.X, 1.X versions.
Synchronizing data should be noted: 7. What is the difference between X and earlier versions?
Version 7.X has gone through 7.0-7.12 12+
iterations of 12+ minor versions, and version 7.0 release time: 2019-04-10, 2 years + time has passed.
A core point to focus on synchronization
: the
official statement is more convincing: “Before 7.0.0, the mapping definition included a type name. Elasticsearch 7.0.0 and later no longer accept a default mapping. “
-
Version 7.X: type is _doc.
> version 6.X: There is also the concept of type, which can be defined by yourself.
Practical example: Specify type in 7.X to write data
:
PUT test-002/mytype/1{ "title":"testing"
}
There will be a warning like this:
#! [types removal] Specifying types in document index requests is deprecated, use the typeless endpoints instead (/{index}/_doc/{id}, /{index}/_doc, or /{index}/_create/{id}).
https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html
2.2 Cross-network
The two clusters are not in a local area network, one is in the cloud and the other is local.
It’s one of the common business scenarios, at least I’ve done it.
2.3 Cross-cluster
source and destination data are distributed in two different clusters.
3. The synchronization scheme is compared with
the following synchronization schemes, which we interpret while fighting.
3.0 Actual Environment Preparation
For the convenience of demonstration, we have simplified the environment. Complex environment, the principle is the same.
- cluster
-
1: cloud, single-node source cluster: 172.21.0.14:19022.
-
Cluster 2: Cloud, single-node cluster: 172.21.0.14:19205.
-
The two clusters share a cloud server, CPU: 4 cores, memory: 8G.
The versions are all the same, all are version 7.12.0.
Test data: 100W (script automatically generated).
The individual records are as follows
: “_source” : { “name” : “
9UCROh3", "
age" : 16,
"last_ updated" : 1621579460000 }
3.1 Solution 1: Reindex
cross-cluster synchronization3.1.1
Reindex Precondition: Set a whitelist
Set the whitelist of the source cluster on the target cluster only in elasticsearch.yml.
reindex.remote.whitelist: "172.21.0.14:19022"
Note that do not test in Kibana Dev Tools as follows, Unless you have modified the default timeout.
3.1.2 reindex synchronous
POST _reindex{ “source”: { “
remote": {
"host" : "http://172.21.0.14:19022" }, "index": "test_data",
"size":10000,
"slice": {
"id": 0,
"max": 5 } }, "dest": {
"index": "test_data_from_reindex" }}
The two core parameters are described as follows
:
- size
-
: The default scroll value size is 1000, and here it is set 10 times larger, which is 10000.
-
slice: Divides a large request into small requests and executes them concurrently. (PS: I’m not strict in my usage here).
3.1.3 reindex synchronization practical conclusion
> script test , reindex synchronizes 100W data, time consuming: 34 s.
3.2 Solution 2: Elasticdump synchronization
https://github.com/elasticsearch-dump/elasticsearch-dump
3.2.1elasticdump Installation notes
< ul class="list-paddingleft-2">
[root@VM-0-14-centos test]# node -vv12.13.1[root@VM-0-14-centos test]# npm -v
6.12.1
[root@VM-0-14-centos test]# elasticdump --help
elasticdump: Import and export tools for elasticsearchversion: 6.71.0Usage: elasticdump --input SOURCE --output DESTINATION [OPTIONS]... ...
3.2.2 elasticdump Synchronous Combat
Elasticdump \ –input=http://172.21.0.14:19022/
test_data \ --output=http://172.21.0.14:19205/test_ data_from_dump \ --type=analyzer
elasticdump \ --input=http://172.21.0.14:19022/test_data \ --output=http://172.21.0.14:19205/test_data_from_dump \ -- type=mapping
elasticdump \ --input=http://172.21.0.14:19022/test_data \ --output=http://172.21.0.14:19205/test_data_from_dump \ --type=data \
--concurrency=5 \ --limit=10000
The basic above parameters can do: see the name consciousness.
-
input: source cluster index.
-
output: The target cluster index.
-
analyzer : Synchronous tokenizer.
-
mapping: Synchronous mapping schema.
-
: Synchronize data.
-
concurrency : The number of concurrent requests.
-
limit: The number of documents requested to be synchronized at one time, the default is 100.
data
3.2.3 Elasticdump synchronization verification conclusion
elasticdump Synchronize 100W data, time consuming: 106 s.
3.3 Solution 4
: ESM tool synchronization
ESM is a tool derived from medcl open source: Elasticsearch Dumper, based on the Go language.
Address: https://github.com/medcl/esm
3.3.1 ESM Tools Installation Considerations
Dependent Go version: >= 1.7.
3.3.2 ESM tools synchronize actual ESM
–
s http://172.21.0.14:19022 -d http://172.21.0.14:19205 -x test_data -y test_data_from_esm -w= 5 -b=10 -c 10000
-
w: concurrency.
-
b: bulk size, in MB.
-
c:scroll batch value size.
3.3.3 ESM tool synchronization practice conclusion
1 million data 38 s after synchronization, extremely fast.
esm -s http://172.21.0.14:19022 -d http://172.21.0.14:19205 -x test_data -y test_data_from_esm -w=5 -b=10 -c 10000test_data[05-19 13:44:58] [INF] [main.go :474,main] start data migration.. Scroll 1000000 / 1000000 [================================================================================================================] 100.00% 38sBulk 999989 / 1000000 [===================================================================================================================] 100.00% 38s[05-19 13:45:36] [INF] [main.go:505,main] data migration finished.
When synchronizing: The CPU is burst, indicating that the concurrency parameter has taken effect.
3.4 Solution 5:
logstash synchronization
3.4.1 logstash synchronization considerations
This article is based on logstash 7.12.0, and the related plugins: logstash_input_elasticsearch and logstash_output_elasticsearch have been integrated and installed, and do not need to be installed again.
Note: The configured input and output are the names of the plug-in and should be lowercase. Many foreign blogs have mistakes and need to be screened in practice.
3.4.2 logstash synchronous input
{
elasticsearch { hosts => ["172.21.0.14:19022"]
index => " test_data" size => 10000 scroll => "5m"
codec => "json"
docinfo => true }}filter {}output { elasticsearch { hosts => ["172.21.0.14:19205"]
index => "test_data_from_logstash" }}3.4.3
logstash sync test
100W data 74 s synchronization.
3.5 Solution 3: Snapshot & Restore Synchronization
3.5.1 Snapshot & Recovery Configuration Considerations
Configure the snapshot storage path in advance in the elasticsearch.yml configuration file.
path.repo: [“/home/elasticsearch/
elasticsearch-7.12.0/backup"] Detailed
configuration reference: dry goods| Elasitcsearch 7.X clustering, index backup and recovery.
3.5.2 Snapshot & Restore
Practice # A node creates a snapshot PUT /_snapshot/my_backup{ "type": "fs",
"settings": {
"location": "/home/elasticsearch/elasticsearch-7.12.0/backup" }}PUT /_snapshot/my_backup/snapshot_testdata_ index?wait_for_completion=true
{ "indices": "test_data_from_dump",
"ignore_unavailable": true ,
"include_global_state": false,
"metadata": {
"taken_by": "mingyi",
"taken_ because": "backup before upgrading" }}# Another restore snapshot
curl -XPOST " http://172.21.0.14:19022/_snapshot/my_backup/snapshot_testdata_index/_restore "
3.5.2 Snapshot & Recovery Practical Conclusion
.”
-
execution snapshot time: 2 s.
-
Restore snapshot time: within 1 s.
4. Summary
This paper gives five solutions for data synchronization (simulation) between Elasticsearch across networks and clusters, and verifies them in the actual combat environment.
The preliminary verification conclusion is as follows:
Of course, the conclusion is not absolute and is for informational purposes only.
Each synchronization tool is essentially scroll + bulk + multi-threaded comprehensive implementation.
The essential differences are: different development languages, different implementations of concurrent processing, etc.
-
reindex Develop ESM
-
Develop
-
logstash based on Ruby + Java
-
Elastidump JS-based snapshots
involve offsite copies of files, and the speed constraint is network bandwidth, so it is not counted.
How to choose? I believe that after reading the introduction of this article, you should know it in your heart.
-
reindex scheme involves configuring whitelists, and snapshots and restoring snapshots involve configuring the transfer of snapshot libraries and files.
-
ESM, logstash, and elastidump synchronization does not require special configuration.
The length of time is related to the cluster size, hardware configuration of each node of the cluster, data type, and write optimization scheme.
How do you synchronize data in actual development? Welcome to leave a message to discuss.