(Complex enterprise data analysis architecture)
First, the data analysis performance is not up to standard.
With the deepening of data-driven, business has put forward more analysis requirements, such as multi-dimensional analysis, real-time analysis, high-concurrency query, and adhoc query. In many analysis scenarios, the current system performance is poor and cannot provide a fast analysis experience.
Second, the flexibility of data analysis is insufficient.
Many times, in order to provide a fast analysis experience, you need to build large and wide tables for various scenarios, or do complex preprocessing, which loses analysis flexibility. Especially in flexible scenarios such as self-service BI, the value of star models and snowflake models is irreplaceable. Existing systems are difficult to support these modeling methods with high performance at the same time.
Third, the complexity of the data architecture is too high.
In order to meet the multiple analytical needs of your business, you must build multiple systems to use a combination. This makes the analytics layer architecture very complex, resulting in high development and maintenance costs as well as business usage costs. In addition, with the rise of various real-time analysis scenarios, you need to build offline data links and real-time data links at the same time. However, problems such as data synchronization, data consistency, computational logic synchronization, abnormal data processing, and multi-system operation and maintenance immediately followed. You’re just getting tired of coping.
Fourth, the elasticity of data analysis capabilities is insufficient.
The scale of your data is getting bigger and bigger, and the corresponding data analysis system needs to be continuously expanded; Different business lines have different data analysis access volumes, and the SLA of each business line needs to be guaranteed. Some businesses also have traffic peaks such as promotions and anniversaries, how to ensure that they can not only support good business, but also save costs? I believe that these questions have given you no less headaches.
root cause of these problems is that the old big data technology architecture can no longer meet the needs of the current rapid business development. Patting on the old underlying architecture only solves part of the problem. In order to fundamentally break the game, a new set of “fast and unified” data architecture is needed. “Extremely fast” means to comprehensively improve the performance of data processing and analysis; “Unified” means fusing complex and disparate data architectures into a simple, unified architecture.
To this end, we decided to upgrade our core product DorisDB to StarRocks, and fully open source (Github search for “StarRocks”), and work with global big data practitioners to build a new generation of ultra-fast and unified data analysis architecture!
StarRocks pioneered a new level of blazing-fast unified analytics
At the beginning of 2020, no one believed that an enterprise’s data analysis architecture could be unified, but we believed that “fast unified analysis” would definitely be possible. After nearly 20 months of team efforts day and night, we have overcome many “impossible” technical problems, and through independent research and development of a new generation of technology, we have built StarRocks into an epoch-making product: “a new generation of ultra-fast all-scenario MPP database
Newly designed fully vectorized MPP query engine that supports extremely fast single-table and multi-table query performance.
StarRocks’ self-built new generation of fully vectorized MPP engine greatly improves query performance, which is more than 3~5 times that of non-native vectorization systems (Kylin / Druid / Elasticsearch / Impala-Kudu / Presto / Greenplum). The ClickHouse vectorization engine does not support comprehensive MPP, and the multi-table query capability is poor, and the multi-table query performance of StarRocks is more than 3~5 times.
> a newly designed real-time columnar storage engine with ultimate real-time update and query performance.
Under real-time update, StarRocks query performance is more than 3~5 times that of other products.
Other systems cannot support high concurrent queries, and StarRocks can support tens of thousands of concurrent queries per second.
> a newly designed CBO optimizer that supports extremely fast second-level AdHoc queries.
StarRocks performance can reach more than 5 times that of the mainstream AdHoc query system Presto, and can achieve second-level latency.
a newly designed modern materialized view with flexible and transparent pre-calculation acceleration capabilities.
With other products not being able to achieve very good transparency acceleration, with high development and management costs, StarRocks has made a lot of innovations in a modern materialized view that can be flexibly and transparently accelerated.
Through these unique technical capabilities, StarRocks truly achieves extremely fast unified analysis:
The new OLAP multi-dimensional analysis experience breaks the limitation of “only large and wide tables”, and enables a variety of data modeling modes: precompute, large and wide tables, star models and snowflake models to have extremely fast analysis experience.
data analysis experience that truly supports real-time updates and deletions, and can ensure extremely fast query performance.
The new high-concurrency query experience breaks through the limitation that traditional OLAP cannot be high concurrency, and supports thousands of simultaneous access.
The new simplified and unified OLAP architecture greatly reduces the complexity of use and O&M management, and improves development and use efficiency.
> StarRocks can efficiently support OLAP multi-dimensional analysis, real-time data analysis, high-concurrency query, AdHoc query and other scenarios at the same time, and the analysis capability is more than 3~5 times faster than the previous generation of the same type of product.
A new real-time
“Extremely fast unified analysis” is
not the end, but a new starting point
On the basis of the current ultra-fast unified data analysis architecture, our next goal is to create a “new generation of flow batch fusion Lakehouse” ”。 As we all know, in the current mainstream data processing chain, real-time data processing and offline data processing are separated. To manage these two data processing scenarios, enterprises often have complex system architectures and are difficult to maintain. We want to achieve the integration of these two data processing methods in StarRocks.
we will design a new cloud-native architecture that converges real-time and offline data to efficiently manage both real-time and offline data.
Although Snowflake, a cloud native benchmark, has built an advanced storage and computing separation architecture in offline data scenarios, this architecture has great shortcomings in real-time data analysis support. We will design next-generation cloud-native architectures that support both real-time and offline data writing and reading with high performance.
we will also design a new vectorization calculation engine for stream-batch fusion, which can perform extremely fast batch processing and stream processing at the same time.
By building a new vectorized batch processing engine, the batch processing speed can be more than 5~10 times faster than Apache Spark. At the same time, it perfectly integrates streaming semantics and uses vectorization technology to improve stream processing performance. Users no longer have to endure the complexity of using Spark and Flink for decentralized batch processing and stream processing!
“Insist on bold attempts, achieve the impossible” is the value we have always practiced. In the next year and a half or so, we’ll be working with the community to build a new StarRocks. Let the offline data and real-time data of the enterprise can be processed by the same architecture, the same semantics, and the same engine, so that the data architecture can achieve comprehensive “rapid unification” and “return the original simple things to simplicity”!
go fast alone In
order to realize these great dreams, we will build a StarRocks open source ecosystem around the world and attract outstanding people with lofty ideals to participate in community construction. We will spare no effort to encourage more global users to join the community, understand and evaluate StarRocks, use and improve StarRocks. We will also enable the global community of data engineers/data analysts to build next-generation solutions for various data analytics scenarios based on StarRocks.
If you’re like us and dreaming, follow us now, get involved in community building, and star StarRocks on Github. Let’s create a new era of big data with “fast unification” and say no to the impossible!
StarRocks – Together with the future, the sea of stars!
Scan the code to join the StarRocks community exchange group
If you want to know more details, please follow us!