Hello everyone, I’m Brother Tom.

Improving system performance and draining computer resources is the ultimate pursuit of programmers. Talk to you today about performance optimization.

Divided into three parts, from shallow and deep to write about all aspects of performance optimization, not limited to the code level, I hope that the small partners can gain something.

Software design and development is, in a sense, the art of “taking” and “giving”.

In terms of performance, just as building is designed to be 9 degrees seismic resistant at an additional cost, a high-performance software system also means a higher implementation cost and sometimes conflicts with other quality attributes such as security, scalability, observability, and so on.

Most of the time, what we need is to optimize the system to the desired level before the business encounters bottlenecks.

So, what are the technical directions and means of performance optimization?

Performance optimization is often a trade-off between “time” and “space”.

This section is divided into two parts, and in the first part, it explains six common means of exchanging “time” and “space”:

In the next part, four advanced sections are introduced, most of which are related to improving parallelism:

For each technical approach to performance optimization, I found a picture of a character or ninjutsu in Naruto that was in the right place, and in the comment section, I answered that any character or ninjutsu sent a small star.

(Note: All the pictures are from the anime “Naruto”, some of the pictures have added text for easy understanding, only for technical communication purposes)

After 10ms.

The principle of indexing is to exchange additional storage space for query time, which increases the overhead of writing data, but makes the time complexity of reading data generally reduce from O(n) to O(logn) or even O(1).

Indexes are not only widely used in databases, but also in the development of the front and back ends.

When the data set is relatively large, not using an index is like looking up a word from a Xinhua dictionary with no table of contents and the content is out of order, and you have to turn page after page to find it;

After using the index, just like using pinyin to find out which page you want to find in the table of contents first, just turn it over.

The table of contents of books is a typical tree structure, so what are the data structures of the common indexes in the software world, and in what scenarios?

Database primary key battle: self-growing vs UUID. Primary keys are very important indexes for many databases, especially RDBMS such as MySQL, which often faces this dilemma: Do you use a self-growing ID or a random UUID as the primary key?

The performance of the self-growing ID is the highest, but it is not good to do the globally unique ID after the database and table split, and the self-growing law may leak business information; The UUID is not readable and takes up too much storage space.

The result of the dispute is to find a compromise that combines the advantages of both:

Using the snowflake algorithm to generate a globally unique ID for a distributed environment as the primary key of the service table is acceptable, less storage-intensive, and can guarantee global monotonic increment, but introduces additional complexity and once again reflects the trade-off.

Back to the index in the database, what points should be paid attention to when building the index?

In addition to the database, indexing thinking can also be applied in the code, such as for the search for a large amount of data in the collection, the use of data structures such as Set, Map, and Tree is actually using a hash index or a tree index, which is much higher than the performance of directly traversing the list or array lookup.

Cache optimization performance works the same way as indexes, which are additional storage space in exchange for query time. Caching is everywhere, imagine how many layers of caching will there be when we open this article in a browser?

Just a few of the common caches listed here come in many forms: from cheap disks to expensive CPU caches, all designed in exchange for valuable time.

Since caching is so good, the question arises: Is caching a “silver bullet”?

No, Phil Karlton once said:

There are only two hard things in Computer Science: cache invalidation and naming things.

There are only two difficult things in computer science: cache invalidation and naming conventions.

In addition to the additional complexity of caching, the use of caching also faces the problem of how to handle cache ineffectiveness.

In addition to caching in the usual sense, the pooling technique of object reuse can also be seen as a variation of a cache.

Common runtime constant pools such as JVM, V8, database connection pools, HTTP connection pools, thread pools, Golang’s sync.Pool object pools, and so on.

Pooling reuse is also a common means of optimizing performance by taking one directly from an existing pool when a resource is needed, making minor modifications or using it directly for another purpose.

After talking about the two “space for time”, let’s look at another “time for space” method – compression.

The principle of compression consumes computational time and represents the data in a more compact way of encoding.

Why trade time for space? Isn’t time the most valuable resource?

To take an example of a video website, if you do not do any compression encoding of the video, because the bandwidth is limited, the huge amount of data will take much more time to transmit on the network than the encoding compression.

Although the compression of data consumes time in exchange for less space storage, smaller storage space brings greater time benefits in another dimension. Large factories lay off employees, what should we do?

This example is essentially a trade-off between “operating system kernel and network device processing burden vs compression and decompression CPU/GPU burden”.

We usually use lossless compression in our code, such as the following scenarios:

Information theory tells us that the limit of lossless compression is information entropy. Further volume reduction can only come at the cost of losing some information, that is, lossy compression.

So, what are the applications of lossy compression?

In addition to lossy/lossless compression, there is another way to do it, which is the extreme of compression – to fundamentally reduce the data or delete it completely.

What can be reduced is reduced:

If you can delete it, delete it:

After all, a big guy named Kelsey Hightower once said:

No code is the best way to write secure and reliable applications. Write nothing; deploy nowhere

Not writing code is the best way to write a safe and reliable application. Write nothing; Deploy nowhere else.

Prefetching is usually used with the cache, and its principle is to go further on the basis of the cache space exchange time, plus a “time for time”, that is, the time spent on prefetching in advance is exchanged for the time of the first load.

When you can guess that some kind of data is likely to be used at some time in the future, getting the data in advance to the place where it needs to be used can greatly improve the user experience or the response speed of the server.

Whether to use the pre-pick mode is like the difference between the buffet restaurant and the chef’s fresh, in the buffet restaurant can directly take the prepared dishes, the general restaurant needs to sit down and wait for the dishes to be made.

So, in which practical scenarios will prefetch be used?

There is no pie in the sky, and pre-fetching also has side effects.

Just as it takes time and extra electricity to preheat an oven, the side effects of prefetching/preheating in software code are usually slow start-up, taking up some idle computing resources, and what may be taken up is not necessarily needed later.

The principle of peak shaving and valley filling is also “time for time”, and the valley time changes peaks.

Peak shaving and prefetching are reversed: prefetching is done in advance, and peak shaving and valley filling is done after the fact. Just like the Three Gorges Dam can withstand short-term huge floods, after the rain stops, the floodgates are slowly opened for waterproofing. The “peak shaving and valley filling” in the software world is similar, but it is not achieved by the Three Gorges Dam, but by message queuing, asynchronous and other methods.

There are several common types of problems, let’s look at each corresponding solution separately:

Batch processing can also be seen as “time for time”, which works to reduce repetitive things and is a kind of compression of the execution process. At the expense of longer time consuming individual batch operations, more time is exchanged for the whole.

The application of batch processing is also very extensive, and we will start with the front end:

Batch processing is so easy to use, so the question is, how big is each batch appropriate?

This question is actually inconclusive, and there are some personal experiences to share.

In short, how large a batch can ensure that the response time of a single batch is not too long while making the overall performance the highest, which needs to be benchmarked in actual conditions, and cannot be generalized. The side effect of batch processing is that the processing logic will be more complicated, especially some problems involving transactions and concurrency; An array or queue needs to be used to hold a buffered batch of data, consuming additional storage space.

Earlier, we summarized six universal performance optimization methods, including indexing, compression, caching, prefetching, peak shaving and valley filling, and batch processing, and briefly explained the principle and practical application of each technical means.

Before opening the last article, we need to figure out:

Where is the time and space spent during program runs?

A person blinks an eye for about 100 milliseconds, while a modern 1-core CPU can execute hundreds of millions of instructions in the blink of an eye.

Modern CPUs are already very powerful, and the frequency has reached the GHz level, that is, billions of instruction cycles per second.

Even if some CPU instructions require multiple clock cycles, due to the existence of the pipeline mechanism, on average, about 1 instruction can be executed per clock cycle, such as a 3GHz CPU core, which can execute about 2 billion to 4 billion instructions per second.

The program also needs RAM to run, and may also use persistent storage, networking, and so on. With the advent of new technologies and processes, these hardware are also becoming more and more powerful, such as the improvement of CPU cache, the leap in the read and write rate and latency of NVMe SSDs relative to SATA disks, and so on. How strong is this hardware?

There is a great website, “Latency Numbers Every Programmer Should Know”, which provides a visual view of the specific values of cache, memory, hard disk, and network time overhead from 1990 to the present.

https://colin-scott.github.io/personal_website/research/interactive_latency.html

The image below is a screenshot of 2020 and is indeed “a number that every developer should know”.

Here are a few very critical pieces of data:

Seeing the orders of magnitude of the gap between different pieces of hardware, it’s easy to understand some of the technical means of performance optimization.

For example, the time of a network transmission is 5,000 times that of the main memory access, and it is not difficult to understand why it is deducted from the salary when writing HTTP requests for circulation.

Zoom in to the time frame that we can easily perceive to understand the 5,000-fold gap: if a primary memory access is 1 day, a LAN data transfer will take 13.7 years.

If you want to transmit more network data, there is a fixed interval between each two network frames (Interpacket Gap), during which the Idle signal is transmitted, and the data link layer uses this to distinguish between the two packets, the specific value is in the link wiki, here are a few of the networks we are familiar with to feel it:

However, it does not make much sense to simply look at the upper limit of hardware, there are many layers of abstraction from code to machine instructions, just sending a byte of packets on a TCP connection, from the operating system kernel to the network cable, involving countless infrastructure-level hardware and software. At the application level, although there is no very precise number of time spent in a single operation, the empirical range is also worth referencing:

In the history of computing, nonvolatile storage technology has evolved faster than Moore’s Law. In addition to embedded equipment, database systems, etc., large manufacturers layoffs, what should we do? Now most scenarios have less need to optimize the space occupation of persistent storage, and this is mainly about another relatively scarce form of storage – RAM, or main memory/memory.

In the JVM, for example, there are a lot of objects in the heap that we create.

If object pointer compression is disabled on a machine with more than 32G of memory, the object pointer becomes 8 bytes, including the Klass pointer in Header, it is not difficult to understand why the performance of the JVM plummets with heap memory exceeding 32G.

For example, an object with 8 int type members needs to occupy 48 bytes (12+32+4), and if there are 100,000 such objects, it needs to occupy 4.58MB of memory. This number may not seem like a big number, but in fact in the heap memory of a Java service, various objects usually occupy much more memory than this number, and most of the memory is consumed on array or collection data types such as char[].

For example, an object with 8 int type members needs to occupy 48 bytes (12+32+4), and if there are 100,000 such objects, it needs to occupy 4.58MB of memory. This number may not seem like a big number, but in fact in the heap memory of a Java service, various objects usually occupy much more memory than this number, and most of the memory is consumed on array or collection data types such as char[].

Beyond the heap memory, it’s another world.

From the perspective of the operating system process, there are also many large memory-consuming users, no matter what Runtime can not escape these space overhead: each thread needs to allocate an MB-level thread stack, running programs and data will be cached, the input and output devices used need buffers…

The memory occupation of the code “written out” is only the part above the iceberg, the real memory occupation is more than that of the “written out”, and there is a problem of space utilization everywhere.

For example, even if we just write response.getWriter().print(“OK”) in Java code, return 2 bytes to the browser, the layers of the network protocol stack are encapsulated, and the protocol header is increasing additional data, so that the number of bytes finally returned to the browser is far more than the original 2 bytes, like the IP protocol header has at least 20 bytes, and the data link layer has an Ethernet frame header of at least 18 bytes.

If the transmitted data is too large, the layer protocol also has the limitation of the maximum transmission unit MTU, IPv4 can only have a maximum of 64K bits per packet, beyond this value needs to be split and sent and combined at the receiving end, more additional headers lead to reduced space utilization (IPv6 provides a Jumbogram mechanism, the maximum single packet 4G bits, “waste” is reduced).

How big is the “waste” of this part? The link below has a table that transmits a load of 1460 bytes, and after a wired-to-wireless network conversion, at least 120 bytes are added, **space utilization < 92.4%**.

https://en.wikipedia.org/wiki/Jumbo_frame

This phenomenon is very common, using a technology platform with a higher level of abstraction, and the platform provides advanced capabilities, while the “information density” of its underlying implementation is usually lower.

Object Headers like Java come at the cost of using the JVM, and further with dynamically typed languages, space is even more expensive for flexibility. The automatic scaling of the hash table, the powerful reflection ability, etc., also pay the price of space.

For example, binary data exchange protocols are generally more space-efficient than plain text protocols. But most manufacturers still use plain text protocols such as JSON and XML to exchange information redundancy for readability. Even binary data interaction formats have information redundancy, and can only be used through better protocols and compression algorithms to try to approximate the limit of compression – information entropy.

Once you understand where time and space are consumed, you can’t fully explain why software tends to exhaust hardware resources. There is a law that can be explained, and it is it that hammers Moore’s law.

It’s Andy-Bill’s law.

“What Andy gives, what Bill takes.”

Andy refers to former Intel CEO Andy Grove, and Bill refers to Bill Gates.

The meaning of this sentence is: software development is faster than hardware, and it can always eat hardware.

20 years ago, in the most powerful computers could not necessarily play racing games;

10 years ago, personal computers can already play 3D racing games with good picture quality;

Now, autonomous driving + 5G cloud driving is about to become a reality.

Behind this are countless hardware technology leaps and all kinds of software that have eaten these hardware.

That’s why we have to change our phones every two or three years: it’s not the machines that get stuck as they age, it’s the bloodthirsty software that’s at work.

Therefore, even if the modern hardware level has been so strong, performance optimization is still necessary.

The more complex the software and the higher the level of abstraction, the more the underlying infrastructure needs to be fully optimized.

For most developers, high-level code is gradually moving towards low-code and visualization, and the impact of “one-line code” is also increasing, and writing inefficient code will eat more hardware resources.

This article is also the most hardcore article in this series, my technical level is limited, there may be omissions or errors, I hope to be correct. Still chose the illustrations and naming of Naruto to help understand:

(Note: These “middle two” prefixes are only used in some terms in “Naruto” to describe the technical solution)

Let the hardware resources all be processing really useful logical computations, rather than doing extraneous things or idling.

From transistors to integrated circuits, drivers, operating systems, and even layers of abstractions of high-level programming languages, each layer of abstraction brings stronger versatility and higher development efficiency, mostly at the expense of operational efficiency.

But when we write code in high-level programming languages, we can write in a more efficient and more suitable way for the runtime environment on the basis of ensuring readability and maintainability, so as to reduce the additional performance loss of the knowledge and ideas transmitted by books such as Effective XXX, More Effective XXX, and High Performance XXX.

Falling to the technical details, the following four subsections explain how to reduce “useless work”, avoid idling, and squeeze out hardware.

Reduce system calls and context switching to keep the CPU focused.

Take a look at two posts on stackoverflow:

https://stackoverflow .com/questions/21887797/what-is-the-overhead-of-a-context-switchhttps://stackoverflow.com/questions/23599074/system-calls-overhead

For most Internet application services, the time-consuming part is not computing, but I/O. Large factories lay off employees, what should we do?

Reduce I/O wait, perform your duties, concentrate on I/O, concentrate on dry computing, epoll batch fishing tasks, (refer: event driven)

Reduce CPU burden with DMA – Zero copies of NewI/O Redis SingleThread (even 6.0), Node.js

Avoid unnecessary scheduling – Context Switch

CPU affinity, so that the CPU is more focused

Use more efficient data structures, algorithms, and third-party components to transform the program itself.

From logic short circuits, Map instead of List traversal, reduce the scope of locks, such coding techniques, to the application of FisherYates, Dijkstra and other classic algorithms, pay attention to every line of code details, quantitative changes will occur qualitatively. What’s more, an algorithm is enough to give a system performance an order of magnitude or two improvement.

Adapted to local conditions and specific operating environments

In the browser, the main optimization direction is I/O, UI rendering engine, and JS execution engine.

The less I/O, the better, where WebSocket can be used, Ajax is not used, and where Ajax can be used, the entire page is not brushed;

In terms of UI rendering, reduce rearrangement and redraw, such as Vue, React and other MVVM frameworks of the virtual DOM with additional computing in exchange for the most streamlined DOM operation;

In terms of JS execution engine, it is less dynamic to use highly dynamic writing, such as eval, arbitrarily modifying the properties of objects or object prototypes.

There is an artifact in the front-end optimization: Light House, in the new version of Chrome has been embedded in the developer tools, you can generate a performance optimization report with one click, and change it according to the optimization recommendations.

Node .js environments that are quite similar to browser environments:

https://segmentfault.com/a/1190000007621011#articleHeader11

Java

Linux

Take advantage of language features and runtime environments – such as writing code that is good for JIT

Reduce memory allocation and reclamation, and reduce the addition or deletion of lists

For embedded environments with limited RAM, sometimes time is not a problem, but time is exchanged for space to save RAM.

Broaden the horizon, jump out of the program and the operating environment itself, systematically analyze the most cost-effective optimization scheme from the whole, analyze the potential optimization entry point, and the resources and technologies that can be deployed.

Some of the easiest ways to do this are to spend money on better or more hardware infrastructure, which is often overlooked by developers, and here are some tips:

The first point is very important, software performance follows the barrel principle, we must find out which hardware resource the bottleneck is, and spend the money on the blade.

If it is a performance problem caused by a server-side bandwidth bottleneck, it is useless to upgrade the multi-core CPU.

I have a performance optimization case: a Node .js server running a complex business is replaced from the AWS m4 type to the c4 type, the memory is only half of the original, but the CPU usage has dropped by 20%, and the price is cheaper than before, killing two birds with one stone.

This is because Node.js the main thread’s computing task has only one CPU core in the dry, through the CPU Profile’s flame graph, you can locate the bottleneck of the service on the main thread’s computing task, so the effect of increasing the single-core frequency is immediate. The business does not consume much memory, and some schemes that customize the memory parameters of the v8 engine cannot play any role.

After all, there are not many such examples, most of the time you still have to spend more money to buy a higher configuration server, in addition to this way to spend money to directly solve the problem, the rest of the method is more difficult:

Some means, is it out of thin air to exchange more space and time?

There is no free lunch in the world, and even those optimization techniques that look like empty gloves and white wolves need additional labor costs to do, and the side effect may be the expert hairline. Fortunately, I don’t know a lot of complex performance optimization techniques, so my own hairline is OK.

This section summarizes some directions, some of which are very deep in technical detail and are not sufficient to expand on here. However, even if the stand-alone performance is squeezed dry, it may not be enough to support the business, and then the distributed cluster needs to appear, so the three technical directions introduced later are related to parallelization.

The horizontal scaling of this section and the sharding in the following section can be regarded as an overall performance improvement rather than a single point of performance optimization, which will reduce the performance of handling individual requests due to the introduction of additional components.

However, when the scale of the business is large to a certain extent, even the best stand-alone hardware cannot withstand the peak of traffic, so it has to be expanded horizontally, after all, “everyone collects firewood and the flame is high.”

The theoretical basis behind this is that silicon-based semiconductors are already approaching the limits of physics, and as Moore’s Law weakens, the effect of Amdal’s Law becomes apparent:

https://en.wikipedia.org/wiki/Amdahl%27s_law

Horizontal capacity expansion inevitably introduces load balancing

Horizontal scaling is for stateless components, and sharding is for stateless components. Both principles improve parallelism, but sharding is more difficult.

Load balancing is no longer a simple weighted poll, but has evolved into a coordinator for individual shards

Some business scenarios, such as inventory business, are implemented according to normal logic, and the improvement brought by horizontal expansion is very limited, because it is necessary to lock inventory, deduct, and then unlock inventory.

The ticketing system is similar, in order to avoid overselling, there needs to be a lock to imprison the ability to scale horizontally.

Whether it is a stand-alone or distributed microservice, locks are a major factor that limits parallelism. For example, the second-kill scenario mentioned in the previous part, there is so much inventory, the system oversold may lead to very large economic losses, but the use of distributed locks will lead to even if the service expands thousands of instances, and eventually countless requests are still blocked on the serial component of the distributed lock, and no matter how many horizontally expanded instances are useless.

Avoiding competition Race Condition is the perfect solution.

In response to the spike scenario mentioned in the previous part, the prefetching inventory is an example of alleviating the race condition, although there are still multi-threaded locks after fetching the server memory, but the granularity of the locks is more fine, and the concurrency degree is improved.

From the perspective of ROI, look at software development, the initial labor cost input, the later maintenance cost, the cost of computing resources, etc., choose a suitable solution instead of the highest performance program.

This article combines personal experience to summarize common performance optimization methods, which are only the tip of the iceberg. It is impossible to design and implement a perfect high-performance system at the early stage, with the iteration of software and the increase of volume, the use of pressure testing, various tools (profiling, vmstat, iostat, netstat), and monitoring means, gradually find the bottleneck of the system, according to local conditions to choose the optimization means is the right way.

There must be disadvantages and disadvantages, and some of them will inevitably be lost, and some means should be used with caution. Linux performance optimizer Brendan Gregg has repeatedly emphasized that it is necessary to avoid premature optimization and over-optimization.

Continuous observation and optimization of 80% high input-output ratio.

In addition to these possible means of design and implementation, it is also important to select high-performance frameworks and components for technology selection.

In addition, the hardware performance of the deployment infrastructure is the same, and the appropriate infrastructure such as servers and networks will often do more with less, such as the instances provided by cloud service vendors that begin with various letters, the speed and stability of network equipment bandwidth, the I/O capabilities of disks, and so on.

Most of the time we should use higher-performance solutions, but sometimes we even deliberately violate them. Finally, let’s end with a quote from the first chapter of Effective Java.

Learn the basic rules first, and then you can know when you can break them.

About me: Brother Tom, a former Ali P7 technical expert, offered harvesters, participated in many Taobao Double 11 promotion activities. Welcome to pay attention, I will continue to output more classic original articles to help you advance to the big factory

WeChat 8.0 will let go of friends to ten thousand, small partners can add my size, first-come, first-served, and then full is really gone. Scan the QR code below to add me WeChat, 2022, hug the group for warmth, and work together to be bullish.