

1. Tencent Data Lake introduction
Second, the landing scenario of tens of billions of data landing
1. Traditional platform architecture
-
first is that the cost of operation and maintenance is relatively large; -
The second is the cost of development. For example, on the business side, one has to write Spark, the other has to write Flink or SQL, and in general, the development cost is not particularly friendly to data analysts. -
The second is the Kappa architecture. In fact, it is the message queue, to the underlying transmission, and then to do some analysis later. It is characterized by being relatively fast and has a certain real-time performance based on Kafka.
> In the Lambda architecture, batches and streams are separated, so O&M requires two sets of clusters, one for Spark/Hive and one for Flink. There are several problems:
- the
2. Scenario 1: Hand Q security data enters the lake
■ Small File Challenge
delay requirements
3. Small files explode
■ Solution
-
adds small file merge operators;
-
Added Snapshot automatic cleanup mechanism.
ul class=”list-paddingleft-2″
Add background services for small file merging and orphan file deletion;
Add small file filtering logic to gradually delete small files;
Increase per-partition merge logic to avoid generating too many deleted files at once and causing task OOM.
■ Fanout Writer’s pit
When Fanout Writer, you may encounter multiple tiers if you have a large amount of data. For example, the data of hand Q is divided into provinces and cities; But after the division, it was still very large, so it was divided into buckets. At this time, each Task Manager may be divided into many partitions, and each partition opens a Writer, and the Writer will be very large, resulting in insufficient memory.
Here we do two things:
-
The second is to do LRU Writer, maintaining a Map in memory.
> the first is KeyBy support. Do KeyBy actions according to the partitions set by the user, and then gather the same partitions in a Task Manager so that it does not open so many partitioned writers. Of course, this approach will bring some performance losses.
3. Scenario 2: News platform index analysis
The above is an online indexing architecture of news articles based on Iceberg streaming. On the left is Spark collects the dimension table above HDFS, and on the right is the access system, after collection, Flink and the dimension table will be used to make a Window-based Join, and then written to the index pipeline.
■ Function
-
quasi-real-time detail layer;
-
real-time streaming consumption;
-
Streaming MERGE INTO;
-
multidimensional analysis;
-
Offline analysis.
■ Scene
characteristics
The above scene has the following characteristics:
-
Latency requirements: End-to-end data visibility in minutes;
-
Data source: full-volume, quasi-real-time increment, message flow;
-
Consumption mode: streaming consumption, batch loading, point check, row update, multi-dimensional analysis.
> order of magnitude : Index single table exceeds 100 billion, single batch 20 million, daily average 100 billion;
■ Challenge: MERGE INTO
has users who have proposed the demand for Merge Into, so we think about it from three aspects:
> Function: Merge the flow table after each batch join into the real-time index table for downstream use;
-
performance: the downstream has high requirements for index timeliness, and it is necessary to consider that merge into can catch up with the upstream batch consumption window;
-
ease of use: Table API? Or the Action API? Or the SQL API?
■ Solution
-
refer to Delta Lake Design JoinRowProcessor;
-
Take advantage of Iceberg’s WAP mechanism to write temporary snapshots.
-
optionally skip Cardinality-check;
-
When writing, you can choose to only hash and not sort.
-
supports the Dataframe API;
-
Spark 2.4 supports SQL;
-
Spark 3.0 uses the community version.
4. Scenario 3
: Advertising data analysis
■ Advertising data mainly has the following characteristics:
-
Source: SparkStreaming incremental into the lake;
-
Data characteristics: labels keep increasing, schema keeps changing;
-
How to use: Interactive query analysis.
> Order of magnitude: 100 billion petabytes of data per day, a single 2K;
■ Challenges encountered and corresponding solutions
:
-
Challenge 1: Schema nesting complexity, Nearly 10,000 columns after tiling, OOM as soon as it is written.
Solution: By default, each Parquet Page Size is set to 1M, which needs to be set according to the Executor memory.
-
Challenge 2: 30-day data basic cluster burst.
Solution: Provides actions for lifecycle management, and documents distinguish between lifecycle and data lifecycle.
Third,
future
planning for the future is mainly divided into the core side and the platform side.
1. On the kernel
side
In the future, we hope to have the following plans on the kernel side
:
■ More data access
-
incremental lake support;
-
V2 Format support;
-
Row Identity support.
-
index support;
-
Alloxio acceleration layer support;
-
MOR optimization.
Better data governance
-
governance action;
-
SQL Extension support;
-
Better metadata management.
2. On the platform side
,
we have the following plans:
- Spark
-
consumption CDC into the lake;
-
Flink consumes CDC into the lake.
4. Summary
:
-
usability: Through the actual operation of multiple business lines, it was confirmed that Iceberg can withstand the test of tens of billions or even hundreds of billions per day.
-
Ease of use: The threshold for use is relatively high, and more work needs to be done to let users use it.
-
scenario support: The currently supported lake entry scenarios are not as many as Hudi, and the incremental reading is also relatively missing, which needs to be made up for by everyone’s efforts.