Lenovo is a member of Lenovo Group, a global PC leader, and is committed to building a technology-driven, flexible, agile, first-class service experience intelligent logistics ecological platform, providing integrated logistics solutions for industrial enterprises, and becoming an intelligent supply chain technology enterprise serving Chinese and global customers. The big data team has gradually introduced a variety of OLAP analysis engines to better meet the needs. DorisDB stands out from the many OLAP analysis engines, which uses comprehensive vectorized computing technology to be a very powerful new generation MPP database. Through the introduction of DorisDB, a new unified data service platform is built, which greatly reduces the complexity of data link development and greatly improves the efficiency of BI analysis.
,
The
evolution of OLAP engine in Lenovo
Phase I
Before 2018, the total amount of data of Liansheng Zhida is not particularly large, this stage uses the traditional relational database (SQL Server), the data warehouse system has not yet been established, and the implementation of many data requirements is met by the development of SQL scripts.
However, with the increasing complexity of the business and the rapid growth of data volumes, this model quickly hit a bottleneck. The most important thing is that the query response time is getting slower and slower. For example, a task that used to take 10 minutes or 20 minutes to run now takes an hour or more, and the query efficiency is severely reduced. In addition, there are bottlenecks in data storage capacity, which cannot meet the data storage requirements that grow rapidly with the business.
In the second phase
of
2019, as the data warehouse was built and improved on the Hadoop/Hive system, ETL tasks were all transferred to the Hadoop cluster, and dozens of Presto were used to complete OLAP analysis in this stage. Presto naturally shares metadata information with Hive and uses physical data storage together, and flexible queries for a large number of logarithmic bins are done using Presto. The front-end BI layer uses Tableau to directly connect to Presto to achieve data analysis and mining.
In the third stage
,
in 2021, the Liansheng big data team carried out the overall design and construction of the offline data warehouse, which not only needed to do low-latency BI reports, but also meet the complex queries of Adhoc, and also had high requirements for efficient detailed queries. At this stage, we introduce the hot DorisDB product in the OLAP circle according to the scenario, which can not only do Presto’s Adhoc multi-table association query and complex nested subquery, but also provide better single-indicator detailed query and multi-dimensional materialized view volume acceleration than ClickHouse to meet the needs of ultra-fast BI analysis.
Data analytics architecture
Current status of
OLAP system
The entire data analysis system consists of data collection, data storage and calculation, data query and analysis and data application.
Original architecture diagram:
class=”rich_pages wxw-img js_insertlocalimg” src=”https://mmbiz.qpic.cn/mmbiz_png/Sq4ia0xXeMC5svjOx12UQ7HiaaCU6QddoDmqD5qTrdicgLk7DCUMic7LYoQ3OzANgbmQ6a5Go5aWibCopSwA87ziaBSA/640?wx_fmt=png”>
Data acquisition
-
read RDBMS through Sqoop to import Hive.
-
Use Flume to synchronize log files to Hive.
-
Through crawling technology, the data on the network is crawled, stored in the RDBMS, and then read by Sqoop RDBMS and imported to Hive.
Data storage and computing
offline data processing
: Hive’s highly scalable batch processing capabilities are used to undertake all the ETL and data model processing work of offline data warehouses.
Data query and analysis
The data sharing layer mainly provides the underlying data storage and query sharing interface of external services. The data after offline ETL is written to the RDBMS or MPP database, which is oriented to a variety of downstream services, providing OLAP query and analysis capabilities for different scenarios such as Tableau BI, multi-dimensional fixed reports, and Adhoc ad hoc queries. The application side perfectly serves the BI reporting platform, ad-hoc query and analysis platform and data visualization platform (Control Tower) The
data application layer of the
data application layer
is mainly for management and operation personnel, and the query requires low latency response and the demand is also iterating. Ad hoc queries for data analysts require OLAP engines to support complex SQL processing and the ability to quickly select data from massive data.
Comparison
of OLAP analysis tools
Pros
-
strong single-table query performance, suitable for OLAP multidimensional analysis query based on large and wide tables.
-
Contains a rich MergeTree Family with support for pre-aggregation.
-
Ideal for large-scale log detail write analysis.
Cons
does not support real deletion and update.
The join way is not very friendly.
The concurrency capacity is relatively low.
The MergeTree merger is incomplete.
DorisDB
advantages
> Both single-table query and multi-table query performance are strong, and can support both wide table query scenarios and complex multi-table queries.
Supports high concurrency queries.
Supports real-time data micro-batch ETL processing.
Both streaming and batch data writes are strong.
Compatible with MySQL protocol and standard SQL.
Cons
DorisDB application practice
in SEC data centers
The core data of channel warehouse distribution management (SEC) comes from two blocks: one is the consumer business; The second is SMB small and medium-sized enterprises (Think, Yangtian). Based on these data, according to the needs of different business scenarios, relevant business statistical indicators are summarized and query and analysis services are provided to the outside world.
Before the introduction of DorisDB, the
original solution
used a large number of Hive tasks to clean and process business logic, and part of the data after cleaning and processing was retained in Hive, and part of the data was written to MySQL/SQL Server to achieve data landing. The front-end BI connects to Hive, MysSQL, SQL Server, etc. through the Presto calculation engine to achieve report analysis and data visualization.
Technical pain points
The original architecture mainly has the following two problems:
-
the data logic is not well integrated and merged, the maintenance workload is large, and new requirements cannot be responded to quickly.
-
Presto’s response time is slow on Tableau complex reports with more SQL, which cannot meet the needs of business real-time data viewing.
Therefore, we hope to optimize the original system, the core idea is to use an OLAP engine to unify this layer, the requirements for the OLAP engine are relatively high:
-
It can support flexible queries with multi-dimensional combination, and the response time is less than 100ms.
-
It is better to support multi-table association.
-
The amount of data queried by a single table is more than 1 billion, and the response time is less than 100ms.
> It can support the data write requirements of large throughput.
After a lot of research, DorisDB is a good fit for the overall requirements of the data center. DorisDB’s efficient query capabilities can provide integrated services for data center data reporting. The new architecture has the following advantages:
-
clear structure, RDBMS focuses on data cleaning, business logic computing is migrated from Hive to DorisDB, and DorisDB is the end point of data business logic.
-
It can maintain a unified data caliber, one data input, and multiple APP interface outputs.
-
MPP distributed architecture can better support distributed aggregation and correlated queries.
-
It has good compatibility with Tableau and can meet the core BI analysis needs.
DorisDB-based solution
post-upgrade architecture diagram:

datasheet design
1) Data Model Design
DorisDB itself provides three data models: detail model/aggregation model/update model. For SEC business, the current detailed model is the mainstay, and if there are other scenarios in the future, consider applying other models.
2) Data partitioning/bucket splitting
DorisDB provides data partitioning and bucket splitting functions, which can improve the performance of detailed query in historical inventory and turnover scenarios. For example, a common query scenario for historical inventory query is to query the inventory turnover in a certain period of time in the past, we can partition according to the outbound time in DorisDB, filter out unnecessary partitioned data, reduce the amount of data in the entire query for rapid positioning, minimize the data range covered by the query statement, partitioning, bucketing, prefix indexing and other capabilities, which can greatly improve the concurrency of point checks. These features are also of great significance for business growth and facing high concurrency scenarios that may occur in the future. Querying the historical trajectory data of a material barcode (SN) can quickly retrieve all the historical storage trajectory information of the barcode, helping us efficiently complete the full life cycle of the supply chain.
Materialized
view
We use DorisDB materialized view to build in real time and on demand, flexibly increase the characteristics of deletion and transparency, and establish a materialized view based on inventory material SN granularity, product type characteristic granularity, warehouse granularity, and distributor granularity. Based on these materialized views, queries can be greatly accelerated.
Data
import data import DorisDB
here uses two solutions:
1) On the basis of the Broker Load provided by DorisDB, the tables of the offline data warehouse Hive are imported into DorisDB.
2) Import data from SQL Server and MySQL to DorisDB through DataX tools.
DorisDB uses effects
flexible modeling to improve development efficiency
Combined with the wide table model and the star model, the wide table and materialized view can ensure report performance and concurrency, while the star model can allow the AP to model as in TP, directly carry out association queries, do not need to rely on wide table preparation in all scenarios, and improve data consistency and development efficiency. In addition, many tables are in MySQL, and we expose queries through DorisDB external tables, eliminating the process of data import, greatly reducing the development and migration cycle of the business side. DorisDB’s distributed Join capability is very strong, combined with the ability of View to build a unified view layer, query different BI reports under the surface, improve the consistency of indicator caliber, and reduce repeated development.
The BI experience is excellent
The early part of the BI visualization was built based on SQL Server and MySQL. After the continuous optimization and enrichment of requirements of some Kanban boards, coupled with multi-dimensional flexible condition filtering, each loading is very slow, and some Tableau reports take a long time to load, which is unacceptable to the business. After the introduction of DorisDB, we use DataX to import SQL Server data into DorisDB, where we use the DorisDB-Writer plugin, the underlying encapsulated Stream-Load interface, and the vectorization import efficiency is very high. MySQL can be imported into select streaming through foreign tables, or it can be directly queried by foreign tables, which is very convenient. Tableau charts are out in seconds, and the experience has made a qualitative leap.
Low O&M costs
Data centers are a very core online service, so they have very high requirements for high availability and flexible capacity expansion. DorisDB supports multiple copies of data, and the simple architecture of FE and BE is only two roles, which can ensure the high availability of the entire cluster in the event of a single node failure. In addition, DorisDB can be elastically scaled online under the scale of big data, and there is no Down Time when expanding, which will not affect online business, which is also very needed.
Summary
Lenovo began to investigate DorisDB in April this year (2021), and the POC testing phase used 1/4 of the resources, which perfectly replaced the Presto cluster of dozens of nodes, and DorisDB has been online and running stably. After the introduction of DorisDB, it realizes the unification of data services, greatly simplifies the offline data processing link, and also ensures the query delay requirements, which will be used to improve data services and query capabilities in more business scenarios. Finally, thanks to Keystone Technology for its strong support, I also hope that DorisDB will become better and better as a new generation MPP database leader with strong performance!

. Application of DorisDB in the Business Area of the Medium Mobile IoT PGW Real-Time Session
