Version 2.0 of StarRocks, too.

2021 has just passed, looking back on the many experiences experienced in this year, every little friend who has personally experienced the development of StarRocks will inevitably have ripples in their hearts:

2021

At the end of January 2021, StarRocks vector version 1.0 was launched for the first time, and I remember that there was still a little “excitement” at that time, and the new product had just “fallen” and had single-table query performance comparable to the world’s fastest open source system. In the past year, we have been committed to redefining the speed of single-table query, in version 2.0, StarRocks innovatively implemented low-cardinality string query optimization based on global dictionaries, carried out a large number of CPU instruction level optimization, etc., in the single-table query scenario, the performance of version 2.0 can reach about twice the performance of the old version, and also achieved a significant surpassing of the original “world’s fastest open source system”, which is StarRocks’ continuous practice of “achieving the impossible” New results achieved. On the occasion of the release of 2.0, every student in the community who participated in the development was already “hot” in their hearts, and everyone was welcome to test the experience and truly feel this refreshing and fast analysis!

Test environment: StarRocks 1FE 3BE, versions 1.19 and 2.0; ClickHouse equivalent configuration of 3 nodes, version 21.9

In order to obtain extremely fast analysis results, many business users are forced to level multi-table data into large and wide tables, and behind this, data development engineers have suffered how much untold sadness. In December 2019, StarRocks embarked on a self-subversive approach to a new CBO optimizer (cost-based optimizer) in order to provide users with a blazing fast analysis experience based on multi-table data without complex preprocessing. The difficulty of this is like “climbing Mount Everest from the north slope”, I believe that developers should have a deep understanding. However, in order to achieve extremely fast analysis of all scenarios, this is an obstacle that cannot be bypassed. After more than a year of overcoming difficulties, version 2.0 of the CBO optimizer has basically matured, and can achieve double-fold performance improvement for more multi-table complex query types, and greatly improve its completeness and stability. Compared with other open source systems, StarRocks can achieve 5-10 times the performance advantage, StarRocks has achieved an unprecedented epic leap from “single-watch speed” to “multi-meter speed”!

In May 2021, as the demand for data updates in real-time analytics scenarios gradually increased, StarRocks began to gear up again! At that time, OLAP systems often used the merge-on-read mode to complete data updates, but this practice of sacrificing query performance for better import performance was not the best solution. And so the Primary Key model is here! The new storage engine uses delete-and-insert to complete data updates, which can bring 3-10 times query performance improvement in real-time update scenarios. After 6 months of polishing, version 2.0 will officially release the Primay Key real-time update feature. Users no longer have to struggle with “live updates”!

In June 2021, the development of the Pipeline execution engine was put on the agenda, a feature dedicated to greatly improving StarRocks’ concurrent processing power and complex query performance on multi-core machines. This is a work from scratch, doing a lot of exploration that no one has ever done before, relying on a momentum of “realizing the impossible”. In version 2.1, the Pipeline execution engine will meet with you, sell a level here, and hope that users will submit more real experience, we promise: package you are satisfied, do not play imaginary!

Stability is the foundation of large-scale user use, and StarRocks has been sparing no effort to comprehensively solve stability problems for nearly half a year. In version 2.0, we have redesigned the memory management mode to fundamentally solve the BE OOM problem. With the release of version 2.0, I believe everyone will be able to use StarRocks more easily in the new year!

In September 2021, StarRocks opened its source code and opened a new chapter in global community building! Let’s look at a set of small numbers first:

  • open source for 114 days, a total of 75 contributors, 40+ active contributors per month, 1238 Commits, and 1900 stars.

  • Organized 8 community online and offline meetups, covering more than 5,000 people.

  • The community has attracted 85 large users (valued or valued at more than one billion dollars) to use StarRocks, and it is still growing rapidly.

Look to the past and look to the future.

StarRocks is committed to pioneering a new data architecture that is extremely fast and unified, and comprehensively upgrades the speed, flexibility, and real-time of data-driven! The goal in 2022 is to become the world’s No. 1 ultra-fast all-scenario analysis database from a “multi-table” scenario leader to an “all-scenario” leader, helping more users achieve extremely fast unified analysis. Some people may question, say no modesty, no peace. StarRocks was founded less than 2 years ago, it is the period of thriving, the age of daring to think and do, we firmly believe that only great dreams can make great products!

So what are the specific things to do? Plain and simple, five things:
1
Resource management

StarRocks has performed well in various analysis scenarios, and users are using StarRocks to undertake more and more business. Each business side does not want to be affected by other services, and the platform side does not want to maintain multiple clusters. What to do?

StarRocks is introducing new resource management mechanisms. The new resource management mechanism can support “resource groups”, and you can set up separate resource groups for businesses that need to be isolated, so as to ensure that this business can obtain sufficient resource quotas without being interfered with by other services, and this business will not interfere with the resource use of other services, so that different services can run in a cluster. On the one hand, it solves the pressure of platform operation and maintenance of multiple clusters, and on the other hand, it allows different services to easily share clusters and improve resource utilization.


2
Multi-table materialized view

In recent large-scale user interview surveys, multi-table materialized views are one of the most demanding needs. Thinking about what users think and anxious about what users are anxious about is StarRocks’ top priority in requirements design.

When some queries are found to be underperforming, users can speed up by creating materialized views. The materialized view of the current StarRocks can be built synchronously and automatically routed at query time. Synchronous construction means that when the data of the original table is updated, the data in the materialized view can also be updated synchronously. Automatic query-time routing means that StarRocks calculates the cost of different query plans when querying the plan, and selects the most appropriate materialized view to support specific queries. This enables transparent acceleration of user queries.

However, it is currently not possible to support the materialized view capability of multiple tables and a more flexible expression on the materialized view. In 2022, StarRocks will complete support for the ability to materialize views on multiple tables and support more flexible expressions in materialized views. StarRocks looks forward to simplifying the entire data transformation process by creating materialized views. In the past, data engineers may have needed to create a data model. With a more flexible materialized view expression, analysts can directly obtain the final built model by creating various materialized views. This makes data analysis more agile.

In addition to supporting multi-table materialized views, StarRocks will also introduce intelligent recommendations and other capabilities in the dimension of materialized views. Through the analysis of user queries, intelligent recommended users create materialized views to accelerate user queries.


3
Separation of storage and computation

The current architectural pattern of StarRocks is still a mode of coupling storage and compute. This approach will bring users the ultimate query performance. However, due to the coupling of the two, there is no way to allocate resources on demand, which sometimes brings unnecessary cost overhead.

As more and more of the

current user’s infrastructure is built on public or private clouds, OLAP systems should also adapt to the development trend of the times, better use the resource elasticity provided by the cloud environment, and bring more resource savings and flexibility to users.

StarRocks looks forward to adjusting the storage and computing separation architecture in 2022. What we wanted to do was challenging and groundbreaking across the industry: on the one hand, StarRocks’ storage-compute separation needed to be offline and real-time compatible; On the other hand, it can make public cloud compatible with private deployment. In addition, StarRocks’ architecture needs to be able to better support multi-cloud architectures.

We will work with our community partners to meet various needs through a set of technical architectures.


4
Blazing fast data lake analysis

At present, StarRocks carries more of the capabilities of the data warehouse. Users import higher-value data into StarRocks for blazing-fast analysis. Raw data that is not of high value is housed in the data lake. In summary, users not only have the need for extremely fast analysis of data lake data, but also the correlation analysis needs of data warehouse data and data lake data.

In order to be able to provide users with a better lakehouse analysis experience. StarRocks will focus on enhancing its data lake analytics capabilities in the new year. We look forward to StarRocks’ efforts not only enabling users to perform extremely fast analysis of data lakes, but also enabling users to complete unified analysis of data lakes and data warehouses through StarRocks.”

At present, the StarRocks community has worked with Alibaba Cloud to complete the first phase of development to support Iceberg queries. From the latest test effect, it will have a 5 times performance improvement compared to Trino. Support for Hudi will be completed in the future, as well as more improvements. StarRocks sincerely invites interested partners in the community to participate in the joint construction. At the same time, Alibaba Cloud EMR will soon open the StarRocks service public beta in January, and more cloud vendors’ services are also on the way, so stay tuned!

5
Many users of batch flow integration

are looking forward to using StarRocks’ extremely fast capabilities for data processing scenarios (such as WorkFlow currently completed with Spark or Flink), and even many users have begun real practice on this road.

StarRocks will enhance batch processing and stream processing capabilities in 2022 (which doesn’t mean solving all batching scenarios for all users). With hundreds of nodes, StarRocks is confident in providing an all-in-one streaming solution. In this way, users can complete the processing of the original data through StarRocks, and the processed data can be analyzed by StarRocks. By then, through StarRocks, users will be able to open the link from ultra-fast data processing to rapid data analysis, so as to achieve more levels of unification.

Xiongguan is really like iron, and now he steps forward from the beginning.

Let’s look forward to these five major events being realized one by one in the new year, and StarRocks hopes to work with more community developers and users to climb the peak on the way to creating the future!

2022, here we come!