Author: vivo Internet server team – Chen Dongxing, Li Haoxuan, Chen Jinxia

As business becomes more complex, performance optimization has become a compulsory subject for every technical person. Where does performance optimization begin? How do you locate performance bottlenecks from the appearance of the problem? How do I verify that the optimization measures are effective? This article will introduce and share the performance tuning practice in the vivo push recommended project, hoping to provide you with some reference and reference.

First, the background introduction

In the Push recommendation, the online service receives events from Kafka that need to reach users, and then selects the most appropriate articles for those target users to push. The service is developed by Java and is CPU-intensive.

With the continuous development of the business, the concurrency of requests and the increasing amount of model calculation have led to performance bottlenecks encountered in the project, a serious backlog of Kafka consumption, the inability to complete the distribution of target users in time, and the business growth requirements cannot be met, so it is urgent to carry out performance optimization.

Second, optimize the measurement indicators and ideas

Our performance metric is the throughput TPS, which can be known by the classic formula TPS = concurrency / average response time RT, and there are 2 ways to improve TPS:

Increase the number of concurrency, such as increasing the number of parallel threads on a single machine, or scaling the number of machines horizontally;

Reduce the average response time RT, including application thread (business logic) execution time, and GC time for the JVM itself.

In practice, our machine CPU utilization is already very high, reaching more than 80%, and the expected benefit of increasing the number of concurrency per machine is limited, so the main effort is invested in reducing RT.

The following will be from the hot code and JVM GC two aspects of how we analyze and locate the performance bottleneck point, and use 3 tricks to improve throughput by 100%.

Third, hot code optimization

How do I quickly find the most time-consuming hotspot code in my app? With Alibaba’s open-source arthas tool, we get a CPU flame graph for online services.

Flame Diagram Description: A flame diagram is an SVG image generated based on perf results that shows the call stack of the CPU.

The y-axis represents the call stack, and each layer is a function. The deeper the call stack, the higher the flame, with the function executing at the top and its parent function below.

The x-axis represents the number of samples, and if a function occupies a wider width on the x-axis, it means that it has been pumped more times, that is, it has been executed for a long time. Note that the x-axis does not represent time, but rather all the call stacks are merged and arranged in alphabetical order.

The flame graph is to see which function on the top floor occupies the widest width. As long as there is a “plateaus”, it indicates that the function may have a performance problem.

Colors have no special meaning, because the flame graph indicates how busy the CPU is, so warm colors are generally chosen.

3.1 Optimization 1: Try to avoid the native String.split method

3.1.1 Performance bottleneck analysis

From the flame graph, we first found that 13% of the CPU time was spent on the java.lang.String.split method.

Students familiar with performance optimization will know that the native split method is a performance killer, which is less efficient and consumes a lot of resources when called frequently.

However, the processing of business characteristics does require frequent splitting, how to optimize it?

By analyzing the split source code and the usage scenarios of the project, we found 3 optimization points:

(1) Regular expressions are not used in business, and the native split is handled by default as a regular expression when the delimiter is 2 or more characters; It is well known that regular expressions are inefficient.

(2) When the delimiter is a single character (and not a regular expression character), the native String.split performs performance optimization processing, but some internal conversion processing in the middle is redundant and performance-consuming in our actual business scenario.

The specific implementation is to implement the segmentation processing through the String.indexOf and String.substring methods, store the partition result in the ArrayList, and finally convert the ArrayList to string[] output. In our business, in fact, many times we need list-type results, and there are 2 more times for list and string[] to rotate each other.

(3) Where the split is called most frequently in the business, only the first result after the split is needed; The native split method or other tool classes have overloaded optimization methods, you can specify the limit parameter, and you can return in advance after the limit number is satisfied; However, in business code, the use of str.split(delim)[0] mode is not the best performance.

3.1.2 Optimization scheme

For business scenarios, we customize the split implementation of the performance-optimized version.

Compared to the native String.split, there are several main changes:

 Regular expression support is discarded and only splits by delimiter are supported;

The output parameter returns directly to list. Segmentation processing implementation, similar to the processing of single characters in the native implementation, using string.indexOf and string.substring methods, the split result is put into the list, and the reference is directly returned to list, reducing data conversion processing;

The splitFirst method is provided to further improve performance when the business scenario only needs the first string before the delimiter.

3.1.3 Microbenchmarking

How do I verify our optimization? First, jmh is selected as a micro-benchmarking tool, and the StringUtils.split method of native String.split and apache is selected, and the test results are as follows:

Use a single character as the delimiter

As you can see, the native implementation performs about the same as apache’s utility class, while the custom implementation improves performance by about 50%.

Use multiple characters as the delimiter

When the delimiter uses 2 characters of length, the performance of the original implementation is greatly reduced, only 1/3 of the time of a single char; The implementation of apache has also been reduced to 2/3 of the original, while the custom implementation is basically the same as the original.

Use a single character as the delimiter and simply return the first split result

When a single character is chosen as the delimiter and only the first result is split, the custom implementation performs 2 times more than the native implementation and 5 times the full result of the native implementation.

3.1.4 End-to-end optimization effect

After verifying the benefits through micro-benchmarking, we will optimize the deployment into the online service to verify the overall performance benefits from end to end;

Reuse arthas to collect flame graphs, and the time consumption of the split method is reduced to about 2%; The overall end-to-end time consumption decreased by 31.77% and throughput increased by 45.24%, with particularly significant performance gains.

3.2 Optimization 2: Speed up the lookup efficiency of the map

3.2.1 Performance bottleneck analysis

From the flame graph, we find that the HashMap.getOrDefault method also takes a particularly large proportion of time, reaching 20%, mainly on the query weight map, because:

High-frequency calls are indeed required in the service, and the number of features is expanded after cross-processing, and the concurrent call of a single machine reaches about 1000w ops/s.

The weight map itself is also very large, storing more than 10 million entries and occupying a large piece of memory; At the same time, the probability of hash collision also increases, and the query efficiency during collision decreases from O(1) to O(n) (linked list) or O(logn) (red and black trees).

Hashmap itself is a very efficient map implementation, and at first we tried to adjust the load factor loadFactor or switch to other map implementations, but we did not get a significant benefit.

How can you improve the performance of the get method?

3.2.2 Optimization scheme

During the analysis, we found that the key of the query map (the key after cross-processing) is string and the average length is above 20; We know that string’s equals method is actually to traverse characters in the comparison char[], and the longer the key, the less efficient the comparison.

Is it possible to shorten the length of the key, or even change to a numeric type? Through a simple micro-benchmark, we found that the idea should be feasible.

So communicate with the algorithm classmates, coincidentally the algorithm students also have the same appeal, they found that the efficiency of the string is particularly low in the process of switching the new training framework, and they need to replace the feature with a numerical type.

Immediately, the plan was quickly determined:

The algorithm students map the feature key to a long value, and the mapping method is a custom hash implementation to minimize the probability of hash collision;

The algorithm students train the weight map of the new model, which can retain more entries to level the performance indicators of the baseline model;

After leveling the performance indicators of the baseline model, the new grayscale model of the online service terminal, the key of the weight map is changed to the long type, and the performance indicators are verified.

3.2.3 Optimization effect

With a 30% increase in the number of feature entries (the model effect exceeds the baseline), the engineering performance also achieved a significant benefit;

The overall end-to-end time consumption decreased by 20.67% and throughput increased by 26.09%; In addition, good gains were made in memory usage, with the memory size of the weighted map falling by 30%.

Fourth, JVM GC optimization

Java designed garbage collection to liberate application developers from manual dynamic memory management. Developers don’t have to worry about the allocation and recycling of memory, or the lifetime of allocated dynamic memory. This completely eliminates some errors related to memory management at the expense of some additional runtime overhead.

When developing on small systems, the performance overhead of GC is negligible, but when extended to large systems, especially those with large amounts of data, many threads, and high transaction rates, the overhead of GC cannot be ignored and can even become an important performance bottleneck.

The diagram above simulates an ideal system that is fully scalable except for garbage collection. The red line represents applications that spend only 1 percent of their time garbage collection on a single-processor system. This means that on a system with 32 processors, the throughput loss is more than 20%. The magenta line shows that for applications with 10 percent garbage collection time (in single-processor applications, garbage collection time is not too long), more than 75 percent throughput is lost when scaled up to 32 processors.

Therefore, the JVM GC is also an important performance optimization measure.

Our recommendation service uses high-end computing resources (64 cores and 256G), and the influencing factors of GC are quite considerable; By collecting and monitoring the online service GC data, it was found that our service GC situation was very bad, and the cumulative YGC time per minute was about 10s.

Why is GC overhead so large, and how can GC be reduced in time?

4.1 Optimization 3: Use an out-of-heap cache instead of an in-heap cache

4.1.1 Performance bottleneck analysis

We dumped the service’s liveness heap objects, used the mat tool for memory analysis, and found that 2 objects were particularly large, accounting for 76.8% of the total liveness heap memory. Thereinto:

The first largest object is the local cache, which stores commonly used data at the fine-grained level, with tens of millions of levels of data per machine; Using the caffine cache component, the cache auto-refresh period is set to 1 hour; The goal is to minimize the number of IO queries;

 The second largest object is the model weight, the map itself, which is resident in memory and does not update, and is unloaded as an old model after the new model is loaded.

4.1.2 Optimization scheme

How can you cache as much data as possible while avoiding excessive GC pressure?

We thought of moving cached objects out of the heap so that they are not limited by the amount of memory in the heap; And the out-of-heap memory is not controlled by the JVM GC, avoiding the impact of too large cache on the GC. After research, we decided to adopt the mature open source off-heap cache component OHC.

(1) OHC introduction

Brief introduction

OHC is called off-heap-cache, or out-of-heap cache, a caching framework developed for Apache Cassandra in 2015 and later separated from the Cassandra project as a separate class library with a project address

https://github.com/snazy/ohc 。

characteristic

The data is stored outside the heap, and only a small amount of metadata is stored inside the heap without affecting the GC

Supports setting an expiration time for each cache entry

Configuring LRU, W_TinyLFU eviction policies is supported

Ability to maintain a large number of cache entries

Asynchronous loading of caches is supported

Read and write speeds are in microsecond level

(2) OHC usage

Quick Start:

Optional configuration items:

In our service, set the capacity capacity to 12G, the segmentCount number of segments to 1024, and the serialization protocol to use kryo.

4.1.3 Optimization effect

After switching to heap caching, the service YGC was reduced to 800ms per minute, and the overall end-to-end throughput increased by about 20%.

4.2 Thinking Questions

In the Java GC optimization, we moved locally cached objects from inside the Java heap to outside the heap, achieving good performance gains. Remember another giant object mentioned above, the model weight map ? Can the model weight map also be removed from the Java heap?

The answer is yes. We used C++ to rewrite the model inference calculation part, including the storage and retrieval of the weight map, the calculation of the sorting score and other logics; The C++ code is then output as an so library file, and the Java program is called through the native mode to implement the removal of the weight map from the Jvm heap, which has a good performance benefit.

Conclusion

Through the three measures described above, we have improved service load and performance from both hotspot code optimization and Jvm GC, and the overall throughput has increased by 1 times, reaching the expected goal of the stage.

However, performance tuning is never-ending, and the actual situation of each business scenario and each system is also very different, and it is difficult to cover all the optimization scenarios in one article. I hope that some of the practical experience of tuning introduced in this article, such as how to determine the optimization direction, how to start analysis, and how to verify the benefits, can give you some reference and reference.

Strong open source a small program!

2021-11-07

Highly recommended a well-established Logistics (WMS) management project (with code)

2021-10-23

Recommend a Spring Boot + MyBatis + Vue music website

2021-10-19

Share a family banking system (with source code)

2021-09-20

Recommend an open source payment system at the Internet enterprise level

2021-09-04

Recommend an open source common background management system (with source code)

2021-08-21

A fairy to pick up private work software, hanging to no!

2021-07-31

SpringBoot-based imitation Douban platform [source code sharing]

2021-07-18

Kill WordPress! This open source website builder is a bit hanging!

2021-06-18

20 practical projects from friends, quick collar!

2021-06-12

If there is any gain, click on it, sincerely thank you