Click Follow Official Account 👆 to explore more Shopee technology practices

Why is Shopee Druid evolving to a cloud-native architecture? What do you need to do to become cloud-native? What other pits might be stepped on in the process?

This sharing will focus on the above three points, starting from the problems encountered by Shopee Druid physical machine architecture, exploring the advantages of cloud-native architecture, and focusing on the technical details of cloud-native architecture design, as well as some best practices for landing.

In Apache Con Asia 2022, Jiayi from the Shopee Data Infra team shared Apache Druid’s evolution of cloud-native architecture at Shopee. This article is based on the content of the speech.

Druid is a high-performance, real-time, analytical database. Its high performance is mainly reflected in these aspects: columnar storage, Bitmap inverted indexing, data compression, SIMD vectorization acceleration, caching system, and so on.

Druid’s Lambda architecture also enables it to support the writing and querying of real-time data.

At the same time, Druid, as an OLAP engine, has a rich set of query operators built in to meet a wide variety of analysis needs.

As you can see from this Venn diagram, Druid combines some of the features of a time series database, a data warehouse, and a full-text search engine:

Of course, the actual internal design details are many different, and I will not talk about them here.

Now that you have a preliminary understanding of Druid, let’s take a look at what problems Shopee has encountered with services built on the physical machine architecture before.

It mainly includes the following aspects: stability issues, efficiency issues, cost issues and security issues.

The first is the stability issue.

This is a monitoring chart of Druid querying QPS. As you can see, near 12 noon, a sudden spike in query traffic far exceeds the pre-set alarm threshold. We then concluded that bugs in the business scripts caused a lot of meaningless requests to be sent.

I believe that everyone has encountered similar situations to a greater or lesser extent. For example, near the meal, or even in the meal, suddenly received a wave of alarm calls, and can only solve the online problem without sleeping; And if the time of the failure is very unfortunate in the middle of the night, then the alarm call will become a “life-threatening serial call”.

Let’s look at another example.

This screenshot is an alert message from our Druid slow query script. Most of this content is desensitized, and we only need to focus on the highlighted parts. It can be seen from this that this is a requery that attempts to analyze the data of the past year, which consumes a lot of server resources and causes the normal query requests of other businesses to be affected.

Especially when we open up the SQL client to end users, we even encounter one or two select * queries without any restrictions. This is tantamount to launching a wave of violent DDoS attacks.

There are many similar stability cases, which can be divided into three categories.

The first category is query-related, for example:

The second category is write-related, for example:

The third category is the cluster itself, for example:

It can be seen that the causes of stability problems are very many, and it is really impossible to prevent.

After analyzing the stability problem, let’s look at the efficiency problem.

When the performance bottleneck is caused by insufficient resources, we need to communicate with the corresponding person in charge of the business temporarily, and the time cost of communication is relatively high, so the efficiency is relatively low. Even if we have completed information synchronization and communication, the operation of scaling and scaling requires human intervention, and it is difficult to complete it in a short period of time.

All services are in the same large cluster, making it difficult to prioritize services. Because for every business, it is certainly the most important thing. As a result, it is difficult to achieve automatic traffic degradation.

The problem of efficiency will eventually make the development of the business seriously restricted.

Next is the cost issue that companies need to take into account.

First look at the cost of machine resources:

We often make multi-instance deployment on the same physical machine extremely difficult due to port conflicts and the same resource preference; Simple hybrid deployment, there will still be a waste of resources; In addition, resources cannot be customized based on business size and write rate.

Thus, once the resources are not in place in time, the performance requirements of the business will not be met; And if there are too many resources, it is difficult to achieve high resource utilization.

Second, look at the input of labor costs:

The complexity of building a physical machine cluster is high, and it is easy to have inconsistencies. For example, if the same service is running on the old physical machine cluster, it is no problem, but if it is migrated to the new physical machine cluster, there is an inexplicable problem; And the construction of each exclusive cluster needs to consume countless people/days; When we work hard to build an exclusive physical machine cluster, we will find that the subsequent maintenance costs will rise sharply with the number of clusters.

You know, even Druid, which supports automatic identification of new nodes and automatic load balancing, will still make O&M students feel stressed. This is even worse if it is other engines that do not support this type of function.

Ultimately, the issue of cost will reduce the competitiveness of our services.

Finally, let’s look at security.

In our low-version clusters, authentication is not enabled, which can lead to misoperation of one business, which may affect other services and cause unpredictable consequences. Even if authentication is enabled, it will still not be 100% safe and reliable because there is no physical isolation.

Also, there are many known security vulnerabilities in the lower versions, even major security vulnerabilities like 0Day like Log4j.

The number of services sharing large clusters is too large, and the scale of clusters is too large, resulting in great resistance and risk of upgrading. This also makes the upgrade thing drag on again and again, and the difference with the latest version is getting wider and wider, and then there will be compatibility obstacles. That is, we have to upgrade to a specific version before we can upgrade to the latest version, which multiplies the complexity of the upgrade and comes with operational risks that are becoming more difficult to assess.

In addition, you will not be able to enjoy the bonus of the new version. For example, the previously mentioned requery problem, the new version of Deruid supports fast and slow query queues, which can avoid individual unreasonable requeries and affect other normal queries. This can also alleviate some pain points to some extent.

These are the problems in most physical machine architectures. Next, let’s analyze how cloud-native architecture solves these problems.

Before we officially carried out the cloud nativization process, we also did a lot of work at the core level, but it was difficult to achieve a “medicine to get rid of the disease” effect.

Similarly, we have conducted thorough research and testing, weighing and making trade-offs in many technical details to find the most suitable solution. We also gathered opinions with various stakeholders, and in general, the response was strong and supported our structural upgrade.

As the saying goes, chaos is a ladder. It is precisely because of the disorder and turbulence of the shared physical machine cluster that it gives the business side enough motivation to complete the landing of the new architecture with us.

The chart above illustrates several key stakeholders, as well as their respective claims.

In the spirit of customer first, let’s first understand the needs of the business side. The first thing to put is to ensure stability, and to be able to have the ability to resist the pressure of the peak of the large flow rate, and support automatic expansion in seconds.

Then comes the needs of the operator. They want to be able to ensure observability and facilitate real-time and clear observation of the health status and performance indicators of individual components; Also, ensure that the resource utilization of the cluster is high enough; It also needs to be able to support flexible and changeable alarm policies, such as dynamic thresholds and hierarchical alarms.

Finally, there is the requirement of the kernel side. We expect to be able to focus solely on the kernel and fully host everything except Druid kernel development; To be able to support CI/CD continuous integration; And the use of Docker mirroring, instead of the previous pure code delivery model, improve the overall iteration efficiency.

Cloud-native architectures include high stability, efficiency, low cost, and security, and we’ll detail how they meet the needs of all parties.

The first is high stability.

We have established separate Druid on K8S clusters for each core business. Moreover, the resources of each exclusive cluster are isolated, so the problem of resource preemption is fundamentally solved.

And for each business feature, we have also carried out the ultimate customization optimization. At the same time, when a service fails, it can be automatically restored in seconds, and the user is not aware of it. Further, we have also built HA clusters in different data centers to achieve high availability at the IDC level to meet the more stringent stability requirements of individual core services.

In addition to the general sense of custom optimization for cluster parameter configuration, we may also encounter scenarios similar to the need to coordinate the compression algorithm version.

Assuming that the business upstream of Druid uses version A’s ZSTD compression algorithm, and the default in Druid is version B, it needs to be adjusted and consistent with the business version, otherwise the data cannot be deserialized normally.

If it is the previous shared cluster model, and a new service is used, it is version C, and it is impossible to coordinate at this time. Then, it is necessary to drive the business and transform all the upstream and downstream components, and the cost will be very high.

The key is not only the cost of the problem, if there happens to be a component in the full link is also a shared cluster model, can not be adjusted, then the entire link will not be able to open normally.

Of course, there are many cases that will actually be encountered, and I will not list them all here.

Another advantage of a cloud-native architecture is efficiency.

Since we ensure that all projects in the same exclusive cluster are the same department or project group, it is easier to evaluate and sort priorities between projects, making it easier to achieve automatic traffic degradation.

Moreover, because we support the automatic scaling and scaling function according to the load situation, we no longer need to communicate with the business in advance to collect the increase of the large promotion traffic, and we no longer have to worry about the problem of inaccurate prediction.

As the Druid on K8s cluster grew in size, at some point in time, we made a cross-sectional comparison. It was found that the number of clustered machines in the cloud-native architecture was less than that of the old physical machine clusters, but the amount of service writes carried by them was more.

In the bar diagram above, the physical architecture cluster is on the left and the cloud-native architecture cluster on the right. Orange represents the peak of data writes, and purple indicates the total amount of data writes. It can be seen that whether it is peak or total, the cloud-native architecture is higher than the physical machine architecture.

Therefore, this side can reflect the higher machine resource utilization of Druid’s cloud-native architecture.

Let’s look at the labor cost aspect.

Because the complexity of the physical machine architecture cluster construction is relatively high, it takes about a month from the beginning of the construction to the complete delivery of the online formal environment. Even if the subsequent operations are more proficient and more scripted, it still takes several days.

The cluster construction of the cloud-native architecture can be deployed in one click through CI/CD to achieve minute-level delivery.

The x-axis above represents the number of clusters, and the y-axis represents the corresponding person/day of consumption. The yellow line indicates the physical machine architecture, and the green line indicates the cloud-native architecture. Through this picture, you can more intuitively feel the difference between the two.

Finally, there are the advantages in terms of security.

First of all, we enable authentication by default to ensure the data security of the business.

Second, because cloud-native exclusive clusters are lighter, it makes it easier for us to follow up on the latest Druid kernel version. Known security vulnerabilities have been fixed in the later version to make them more secure and reliable.

In addition, the mode of containerized operation can achieve physical isolation, avoid resource preemption, and control the fault domain well. No longer have to worry about the misoperation of one business affecting other businesses, and the risk of online is further reduced.

Of course, the advantages of the new architecture go far beyond that. Moreover, the amplification effect of the superposition of advantages has elevated the quality of our service to a completely different height.

Next, we summarize what opportunities and challenges have been encountered in the process of cloud nativeization.

First, about the challenge section:

We need to introduce the concept of cloud nativeness to meet the needs of all stakeholders and design a new set of cloud-native architectures; There are also many technical difficulties to be overcome, from 0 to 1 to build a K8s base; And use Docker images to replace the previous iteration of pure code, and use Helm Chart to complete container orchestration and management of Druid clusters; Finally, we need to fully validate and test most business scenarios and establish benchmark users. Then, promote the migration of the business on a large scale and continue to optimize at the customized level.

Second, about the opportunity part:

Because of the need to build a complex set of cloud-native architectures, there is also an opportunity to further exercise and improve the capabilities of our architecture; In the process of practicing and landing K8s cloud natively, by solving specific technical problems, the relevant technical level is also enhanced; At the same time, with the avant-garde containerization operation mode, it also leads the cloud native process of the Druid open source community; In addition, in the process of completing the migration of the business one by one, there is also an opportunity to deeply understand the real usage scenarios and pain points with the business, and deepen the understanding of the business.

I believe that at this point, everyone already knows why we have evolved to a cloud-native architecture, and the opportunities and challenges encountered in the process of landing cloud-native architecture.

Next, let’s move on to the architecture level and take a look at Shopee Druid’s overall cloud-native architecture design and the interaction between the components.

This is our general architectural diagram, which is mainly divided into six levels, namely the business layer, the platform layer, the visualization layer, the engine layer, the GitOps layer, and the K8s layer.

The top layer is the business layer, which mainly includes user behavior analysis, product recommendation, sales data analysis, brand analysis, network performance analysis, core index analysis, advertising revenue analysis, application trajectory analysis, cross-border e-commerce analysis, content recommendation, etc.

As you can see, Druid’s application scenarios inside Shopee are very diverse.

The second layer is the platform layer, which includes DataStudio data analysis, TrinoDB federated query engine, DataHub data integration, DataMap and Metamart metadata management, and more.

The visualization layer includes Druid’s own web UI, K8s Dashboard, Apache Superset, Grafana, and front-end pages for business implementation.

Next is the engine level. Below Druid is PostgreSQL as a Druid metadata store; HDFS acts as the underlying storage to record the full amount of business data; ZooKeeper acts as the configuration hub.

To the right of Druid are the writing of Kafka real-time data and the import of HDFS offline data; Spark data analysis is also supported.

Finally, in the lower right corner is ElasticSearch responsible for the storage of logs; Druid self-monitors via the Metric System; Grafana is then in charge of the alarm.

Let’s look at the GitOps layer. We divide the entire release process into four parts: development, testing, pre-release, and production.

Unlike traditional releases, we deliver Docker images and use Harbor as the image repository. At the same time, clusters using cloud-native architecture ensure the consistency of test and online running environments through containerized operation. No more worrying about the embarrassing situation of passing the test and not being able to run online.

At the bottom is the kubernetes cluster. We have built a set of peer-to-peer K8s clusters in different IDC rooms, and built exclusive Druid clusters for each core business on top of the K8s clusters.

Our splitting logic is to divide by department. There may also be multiple projects in a department, and under one project, there will be multiple DataSource tables.

For example, assuming that projects 2 and 3 are under the same department, if project 2 is metric monitoring and project 3 is calculating actual business data, we can easily prioritize and automatically downgrade project 2 when the overall resource is bottlenecked to ensure the most core project.

The schema layering diagram is dissected. So, how do the internal components interact and cooperate with each other?

Let’s start with the traffic entrance.

Typically, read and write requests are initiated through a visual page or background program and are received by Druid.

As we can see, the various components inside Druid are multi-copy, and there are no single points of problem in the architectural design. That’s why Druid is easier to be cloud-native than many other databases. Because Druid components are more delineated, it is easy to map to Pod for lifecycle management. The functions inside each component are very cohesive, and new nodes can be automatically identified and added to distributed clusters, making the application of HPA or VPA scaling strategies more accessible.

Then ZooKeeper acts as a configuration center, which is also responsible for the selection of overlord and coordinator nodes, task distribution functions.

PostgreSQL is a metadata storage engine that includes metadata information for DataSource, Segment, and Task.

HDFS provides the underlying storage, and all business data is stored in full in the HDFS cluster and loaded into the Historical data node according to the Retention Rule to speed up the query.

Next up is the System Monitoring section.

There are currently three levels of metric monitoring, namely the physical machine level, the service level and the Druid Metric level:

Usually, we find that the combination of various monitors tends to achieve the effect of 1+1>2. However, we also need to make trade-offs based on the actual situation. It is recommended that you iterate gradually and introduce new monitoring components to get a new analytical perspective, rather than blindly stacking. Because only by controlling the complexity of the architecture can the risk of the system be reduced more effectively.

Finally, there is the K8s part. Ingress acts as a traffic ingress with a higher-dimensional abstraction of the service, and the service redirects the traffic to kube proxy after load balancing. Subsequently, the proxy forwards to the smallest dispatching unit, the Pod.

Pods are also divided into stateless and stateful, the former including Router, Broker, Coordinator and Overlord, etc., and the latter including MiddleManager, Historical, ZooKeeper, PostgreSQL and so on. Moreover, these stateful pods also need to declare PV persistent volumes to facilitate data preservation. In order to avoid cross-K8s node access, we usually increase the node affinity and improve the locality of the data.

We can also see that stateful Pods make up the Stateful Set collection, while stateless Pods make up the Replica Set collection. In order to facilitate version control and lifecycle management, the concept of Deployment is also abstracted from the Replica Set.

All of these cluster states are saved in the Etcd of the K8s master node. In addition, the Controller Manager maintains the state of the cluster, scheduler schedules, and provides a unified entry through the api server.

At this point, Druid cloud native architecture is no longer a “black box” for you.

That’s all there is to architecture. The next subsection will briefly describe how Shopee encapsulates services and forms a complete set of solutions to meet the business needs of diverse scenarios.

First, we provide Druid on K8s exclusive clusters for each core business to ensure stability, security, efficiency and low cost.

Second, we will also build a companion Grafana on K8S exclusive cluster with built-in basic monitoring panels such as TPS writes, QPS queries, and so on. And give the business Admin administrator permissions, so that the business side can customize the design of the corresponding monitoring panel according to its own business scenario requirements.

In addition, some businesses may also have the need to develop the Grafana plug-in for the second time. In exclusive mode, it is easier to upgrade and iterate. Even if a bug occurs with a new plugin that causes Grafana to fail, it will not affect other businesses.

Of course, in order to support a variety of business scenarios, we provide a wealth of visualization solutions in addition to Grafana:

There’s always something that fits our business needs well.

By evolving to a cloud-native architecture, we have ensured high stability, high performance, achieved high efficiency and low cost, greatly improved service quality, and promoted business development.

From the architectural dimension, we solved problems that could not be handled at the kernel level, or were too expensive to solve. Usually, if you have encountered a similar dilemma, you may wish to jump out and think from another angle, and you may also reap unexpected results.

In addition to continuing to promote architecture upgrades and kernel improvements, we will continue to make efforts in integrating and being integrated, working with the open source community, and building team impact.

To help readers better understand architectural design, we will briefly introduce the basic concepts involved in this article, including jargonology, Kubernetes core components, and commonly used plugins.

The first is Druid’s related terminology:

This is followed by cloud-native related terminology:

Let’s go through the core components of K8s.

Let’s take a look at the common plugins for K8s.

These two plug-ins are mandatory for most scenarios:

If you install these two plugins, it will significantly improve our efficiency, so it is highly recommended:

Finally, there is the optional Federation, which allows us to implement cluster federation across Availability Zones.

The author of this article

Jiayi, a big data technology expert, from the Shopee Data Infrastructure team.

Join us

The Shopee Data Infrastructure team focuses on building a stable, efficient, secure, and easy-to-use big data infrastructure and platform for companies.

Our business includes: real-time data link support, Kafka, Flink related development; Development and maintenance of Hadoop ecosystem components such as HDFS and Spark; O&M of Linux operating system and O&M of big data components; OLAP components, Presto, Druid, Trino, Elasticsearch, ClickHouse development and business support; Development of big data platform systems, resource management, task scheduling and other platforms. Please feel free to search for more information about Data Infra jobs on the Shopee careers website.

👇 Click “Read the original article” to join Shopee