click “zhisheng” above to follow, star or pin to grow together

Flink goes from beginner to proficient A series of articles

today is going to talk about some ideas and practical summaries for IT personnel to face the analysis and solution of technical problems, I have talked about it a long time ago, for developers in the later stage of the need is not a simple new business function design and development capabilities, but problem analysis and solution capabilities. This kind of problem analysis and solution itself includes two aspects:

  • one is the analysis and solution of IT system operation problems and failures;

  • The second is the ability to transform complex business problems into technical solutions.

When I talked about thinking articles earlier, I specifically talked about IT personnel should pay attention to the improvement of their thinking ability, which actually includes analysis and cognition of things, independent problem analysis and solution of two levels.

  • for the first level in the IT field, it is more about the ability of architecture design, the ability to transform realistic business requirements and scenarios into abstract architecture design languages and architecture models;

  • The second level in the IT field is the ability to analyze and diagnose, hypothesize and verify problems in the face of problems or failures, and quickly solve them.

For many of our current IT personnel, in fact, the ability of both aspects is lacking, neither can independently carry out the overall architecture design, the responsible business from the top down, divide and conquer modeling and design. Nor can it be quickly positioned in the face of critical faults or problems in the production environment, and find the root cause to solve them quickly. Instead, he spends a lot of his time on repetitive transactional tasks and feverish pursuit of new technologies.

In fact, I have never been opposed to maintaining an interest in learning new technologies. But any new technology, if your actual working environment does not have the opportunity to practice, then a large number of new technologies should appear similar performance, safety, reliability and other problems you can not really get practical verification and solve.

In this case, the new technology can only stay at the theoretical stage and does not make much sense.

For the core logic of problem analysis and solution, you can first refer to my previous article: “Problem Analysis and Solving Logic – McKinsey’s Seven Steps to Poem is Just the Beginning ( In the previous article, combined with the McKinsey problem analysis seven-step method, the core logic of problem analysis is described in detail.

First, the

key points

of technical problem solving

I have written a lot of articles on technical problem analysis and diagnosis, These questions are basically derived from real project practice.

Even now, some problems have not been fully located and finally solved, including the fact that we have called Oracle experts and consultants, which does not mean that we can solve the technical problems immediately.

To put it simply, if a technical problem, you can directly and quickly search for the relevant answer on the Internet according to the exception or problem keyword, this kind of question is not a really challenging technical problem.

For the solution of technical problems,

based on the previous practice of problem positioning, analysis and solution ideas, I still want to talk about some key points and thinking logic in solving technical problems.

1. The

accumulation of a large amount of personal practical experience in the early stage

This is very important, any knowledge base, search can not replace the accumulation of personal knowledge and experience.

Why is work experience valuable?

Often it is because you have a lot of practice accumulation in a professional field, a lot of experience in problem analysis and solving. This experience can help you quickly anticipate and locate problems when they are encountered, including proposing the most likely hypothetical path.

Many of today’s problem-solving methods are unstructured problem-solving, that is, prioritizing the most likely hypothesis and then verifying whether the hypothesis can actually solve the problem.

Then experienced people are often the easiest to come up with the most likely hypothetical path, and reduce the attempts at various impossible detours. A problem itself has five independent hypothetical paths from A to E, and the most likely path is A, and the reason why you are slow to solve the problem is often that you finally assume and try to the A path and solve the problem, while experienced people often choose hypothesis A for verification at the beginning.

To accumulate this experience, it is necessary to review the problem in a timely manner after it has been solved, abstracting it into experience and methodology.

The focus of problem positioning is to narrow the scope and define boundaries. The most important thing after a problem arises is to locate quickly.

For example, if a business system queries a fault, it is necessary to quickly locate whether it is a problem of infrastructure resources, a problem of database and middleware, or a problem of a program.

If it is a problem with

the program, you need to immediately locate whether it is a front-end problem, a problem with the logical layer, or a problem with the database.

Only by quickly determining the boundary and locating the problem can we solve the problem in a targeted manner. Any problem is located back to the root cause of the problem, not to the appearance of solving the problem, similar to a headache.

So how do you narrow down and quickly determine boundaries?

For example, let’s assume the simplest scenario: the problem goes through two processes of A-B. So how do you quickly determine whether the problem arises in stage A or stage B?

For this problem, we have the following problem positioning methods and ideas to refer to and learn from


    > substitution method: For example, replace A with A1, if the problem disappears, then the problem is in stage A;

  • Breakpoint method: set a breakpoint between A and B to monitor the output to determine whether the output A is normal;

  • Hypothetical method: Assuming that there is a problem in stage A, adjust the parameters of stage A and observe whether the problem is solved.

Of course, there are many other ways to locate problems, but among all the methods of problem localization and boundary determination, the most effective is still similar to the dichotomy in Quick Find, which can quickly help us narrow down the scope and locate the problem.

Let’s further illustrate the above logic, such as a software application bug scenario, as shown in the following figure

You can see why it’s hard to analyze and locate bugs?

The introduction problem may be either an error in our input, a problem with the operating state of the hardware and software environment we are facing, or a problem with our actual program processing.

Even if you target a program processing problem, it could be caused by multiple points at the logical layer, the data access layer, or the database.

2. Make good use of search engines

Understand that any technical problem you have encountered has often been encountered by predecessors, stepped on the pit, and summarized and shared on the Internet. Therefore, making good use of the Internet and search engines to conduct technical searches based on problem keywords is still a key way to solve technical problems.

Even if search engines don’t help us solve the final problem, they often help us learn a lot about this technical problem in the search process.

To search, a key point is to choose the search keywords, for the choice of keywords not once to choose accurately, you have to try and iterate many times until you can accurately describe the problem, and the search answer in the search process often can also help you further refine the keyword.

For example, for system operation failures or problems, the description of keywords should include:

  • extract keyword information from database, middleware and business system error logs;

  • Add keyword information from the environment, background, and scene where you generate the problem to narrow the scope of search;

  • Mine more meaningful keyword information describing similar issues from searched pages.

At the same time, for search, especially for technical problems

, those with an official knowledge base should give priority to searching the official knowledge base: for example, for technical problems related to Oracle products, we will also search the official Oracle support website first, and search for websites like StackOverFlow. These sites tend to have more comprehensive technical problem-solving articles.

Search for technical articles, then foreign technical websites are relatively

more comprehensive, and for Baidu this piece is relatively weak, many foreign technical website content can not even be searched, at this time you can try Google or Bing search.

3. Technical problem solving and review

In the early stage of the Oracle SOA project, we encountered the problem that after the service was encapsulated and registered into the OSB, the message message content of the client consumption and call service was truncated.

Because the probability of this problem is not high, and the consumer system itself has a retry mechanism, it does not affect the operation and use of specific OSB services for the time being. Although until now, the cause of this problem has not been confirmed as the client-server configuration, load balancing, network, message itself, and bugs in the OSB suite itself, the whole troubleshooting and analysis process is still meaningful.

In the process of troubleshooting and analysis,

for the meaning of various timeout periods, some key configurations of OSB, message parsing, Http Post message sending long and short connections, and some configurations of Tomcat have been understood, and through the analysis of this problem, some problems in the process of technical problem analysis have also been found for reference and reference in the analysis of problems later.

4. Determining the boundaries of the problem is always the most important


client sends packets to the server to receive packets, and the current phenomenon is that the client Log log packets are complete, while the log packets of the Log on OSB are incomplete.

So is it the client, the server, or the problem in the network transmission process? The determination of the boundaries of this issue is quite important.

In fact, in the past few days of problem analysis and troubleshooting, the problem boundary has not been finally confirmed, resulting in the problem has not been very sure where it is, and the final problem has not been clearly solved and troubled.

For example, the above mentioned message message is incomplete, to determine the boundary is actually two conventional ideas,

  • one is to modify the program code for more detailed logging;

  • The other is to increase trace monitoring.

For example, the problem can be Http or TCP Trace on the

client side, and Http TCP Trace on the server side, and the boundary of the problem can be finally determined through the Trace information on both sides.

However, it is difficult to do this in the production environment


< ul class="list-paddingleft-2"> one is

  • that the concurrency of interface service calls is large, resulting in a huge amount of trace logs, and more than this one interface service is calling;

  • One is that

  • there are too many resources that need to be coordinated, and it is difficult to jointly investigate and track.

  • 5. It is important to reproduce problems

    The reproduction of faults is a basis for us to analyze and locate problems, which are often the most difficult to solve if they occur randomly and accidentally. When you face a problem, you need to locate, and then you need to be able to reproduce the problem to facilitate continuous debugging or tracing.

    In the process of solving this problem, because the exception appeared by chance, from time to time and irregularly, it also caused us a lot of trouble to troubleshoot the problem.

    Although during the troubleshooting process, I exported and analyzed the abnormal logs of the problems and the normal instances before and after, and analyzed the server server nodes, callers, and call time periods that had problems, but there was no obvious finding where the law of problems was.

    At the same time, the

    problem has a high randomness, often the first call is unsuccessful, but the same message is successfully called in the second or third time, and the truncation length of the message is different each time. This makes it difficult to analyze the specific scenarios in which the call is unsuccessful.

    That is, because the problem

    cannot be reproduced under specific input conditions, it is difficult for us to further analyze and locate the problem, and it is also difficult for us to carry out specific tracking and boundary determination. At the same time, it is difficult to further analyze the problem in the test environment, and test and verify the modified parameters and conditions.

    That is, because the problem

    cannot be reproduced under specific input conditions, it is difficult for us to further analyze and locate the problem, and it is also difficult for us to carry out specific tracking and boundary determination. At the same time, it is difficult to further analyze the problem in the test environment, and test and verify the modified parameters and conditions.

    All of the

    above makes it difficult to quickly locate and analyze the problem, only a wide range of scenarios + abnormal keyword search, and then search for related possible solutions, one by one to try to see if it can be solved.

    However, this approach poses a huge problem, that is,

    we cannot do this in the test environment because the problem does not recur. Then the solution verification found can only be done in the production environment, but the production environment is absolutely not allowed to modify the configuration and adjust the parameters at will according to the regulations.

    This is also the reason why we see that many large-scale IT projects go online, often reserved for about 3 months of trial operation, during the trial operation of the production environment daily operation and maintenance and configuration modification will not be strictly controlled management, but also convenient to timely analysis and solve problems.

    6. It is difficult to search for completely consistent abnormal scenes in online searches

    Due to the Oracle SOA Suite 12c suite product used in the project, there is currently no large-scale application in China, so if you can’t search for useful information with Baidu search, you can’t use Google or Bing to find a lot of information.

    Therefore, in the process of troubleshooting this problem, we basically checked all relevant knowledge points on the Oracle Support website, and selected various keywords for search engine search, including:

    • Weblogic Tomcat Post Timeout KeepAliveOSB-;

    • Persistent connection timeout OSB-382030;

    • Failed to parse XML text, etc.

    However, no identical scene was found.

    For the most similar scenario for the Failed to parse XML document, we made the relevant tweaks to set KeepAlive to False and the Post Timeout to 120 seconds, but there was still an issue with any request without Post to complete after the 120 second timeout expired resulting in a timeout.

    The inability to search for exactly similar scenarios also made it difficult for us to further test and verify according to the methods given online. And Oracle consultants can only give useless suggestions for Tcp tracing on this issue.

    7. The lack of key basic technical knowledge leads to unreasonable problem analysis and hypotheses

    In the analysis and solution of the original problem, because the search engine often gives exactly similar scenarios, we only need to troubleshoot the problem according to the troubleshooting ideas given by the search engine.

    Therefore, it is very efficient to solve, and we do not need to master and understand the specific underlying principled content, but only need to be able to choose the right keywords, search for the most suitable content through the search engine and then troubleshoot.

    But this time, the special point is that the search engine simply cannot give exactly similar articles. This leads to the need to formulate a variety of reasonable hypotheses based on the problem and test them one by one.

    So how do you come up with reasonable assumptions?

    Here it involves the underlying TCP protocol, the meaning and principle of each timeout value, the parameter configuration of

    Tomcat Server, the parsing process of OSB proxy service, the key parameter configuration and meaning of Weblogic, load balancing strategy, and even Docker containers and IP mapping.

    For example, during the troubleshooting process, I will think about whether I need to adjust the MaxPostSize value of Tomcat to the assumption, but the exception is that Tomcat sends data to the Weblogic Server for a Post request, which will not affect the Post request at all for Tomcat’s MaxPostSize, and only the Post Size on Weblogic will have an impact. This assumption is inherently unreasonable. To quickly judge whether these assumptions are unreasonable, you must have these key basic technical knowledge and background accumulation in advance.

    Including for Keep Alive long connections, whether Keep

    Alive’s Time out timeout setting will affect the service abnormal call operation, in fact, because the specific meaning of Keep Alive long connections and various types of timeout is not deeply understood, it is difficult to determine whether there is an impact, and can only pay attention to try to rule out the possibility. All of this also makes it difficult to quickly locate the root cause of the problem.

    8. Difficulty in solving problems when it comes to the coordination of peripheral stakeholders

    This is also a key impact in solving the service interface problem. For interface service operation problems, it often involves multiple related factors and vendors such as business system consumers, service system providers, OSB service buses, networks, and load balancing devices.

    The investigation of

    a problem often requires the coordination of multiple resources to cooperate with each other at the agreed time, which directly leads to the difficulty of investigation, and it is difficult to rely on the strength of individuals to complete.

    In the implementation process of similar large projects, they often encounter the analysis and troubleshooting of these interface problems, often after the problem has a serious impact, all parties will really pay attention to the problem, and coordinate their own resources to form a joint troubleshooting team for problem analysis and troubleshooting, and finally be able to solve the problem.

    Although the problem has not been finally solved so far, the entire analysis process is still meaningful and is summarized in this article.

    Second, the problem review – the file handle is opened too much

    As mentioned earlier, after the problem analysis is solved, it is necessary to review it in time, and the problem review is not a simple problem solving summary, but a combing of the entire problem analysis and thinking process, including what pits have been stepped on in the problem solving, what detours have been taken, and what reference significance these lessons have for the solution of subsequent problems.

    Problem description: The server responds slowly, the service call times out, and then queries the relevant error log information. The error log information includes too many open files information for IO Exception, as well as socket connection timeout information for socket receive time out.


    After getting this problem, because there were problems of slow service response and call timeout, the first thing to troubleshoot was the health of the application server itself, so it began to use jstat to check the CPU and memory usage of the server itself. After checking that the server itself is completely functional.

    2. Check the database connection pool and thread pool

    After checking this, then check the database connection pool and thread pool,

    and after checking that although there is queuing, there is still a large amount of surplus in the connection pool itself, and there is no situation where the connection pool is overcrowded.

    3. Error log check

    After this check, go back to the problem error log, because there are currently two errors, namely:

      > Problem A file is open too much, problem B service connection timeout, then there is now a key question is whether problem A caused problem B, or problem B caused problem A, or A and B themselves are two unrelated problems caused at the same time, at this time there is actually no completely certain conclusion.

    This leads to the need to find the root cause of the problem from two problem paths, and then summarize and converge.

    To know that for problem B connection timeout, Oracle’s official support website knowledge base includes troubleshooting 6 to 7 scenarios for problem solving, and the whole problem is quite difficult to troubleshoot.

    And the problem is a new problem with the old server, not a problem with a completely new server. That must consider whether it is related to newly deployed and launched services and applications.

    4. Review recent code changes

    Now back to the problem of too many file opens, after checking the basic questions, what we find is that the file handle is opened too much, so what we have to do is to check the newly added modifications and changes, and whether there is a situation where the file handle is not closed.

    5. Exception – “Too many files are open-” to further locate which files are located

    After a code review we did not find this situation. Then it is natural to further locate and analyze which file handles are open and not closed.

    And the way to check this problem is lsof log data, I found that our hpunix minicomputer actually this command cannot be used, there is no way we first simply increase the maximum number of file openings status quo but the problem still exists.

    Note that at this point we stopped here and didn’t think further about how to fix the problem, but instead went to analyze the service timeout problem.

    Note: When we are analyzing and diagnosing problems, choose the standard solution path without problems, do not give up easily because of obstacles, and you will find that you will eventually return to this critical path.

    When faced with the problem of service timeout, we took another detour, that is, we directly analyzed and eliminated each problem scenario according to the metalink troubleshooting method, and made a lot of modifications to the parameters and settings of the middleware, but in the end the problem was still not solved.

    The reason for the detour is mainly that the specific reasons for the service timeout of the current server are not well analyzed, and there is no specific analysis corresponding to the current scenario.

    6. Further analyze the scenarios and boundaries of the problem

    So let’s go back to the analysis of the current scenario, what services have we newly deployed on the current server, do these services need to be eliminated one by one, and whether our actual service runs all time out, or individual services time out? Where exactly does this service timeout occur? The exact question of where the boundaries are needs to be further clarified before there can be follow-up.

    1) Boundary confirmation: not all service calls time out

    The first situation is that not all services have timeout problems, and the services that mainly time out are certain types of services. Then we exclude timed out services and service timeout logs, including specific firewall settings, the service itself running with long transactions, and so on.

    2) Boundary confirmation: Not only the newly added service times out, but also the old services time out

    The current timeout services are not all newly added services, but also include the old services that are already running, which is indeed a very strange phenomenon. After the service timeout setting parameters are adjusted, continue to observe the health status of the server and error logging, and find that too many files open errors continue to appear.

    At this time, it is found that it is necessary to go back to the too many file open exception log for further analysis of the reason.

    7. Go back to installing the LSOF component on HPUNIX

    To analyze in detail, there must be a detailed log log for positioning, and at this time, it is found that you must first install the lsof component, and then find the data to re-install the lsof component on HPUNIX, and perform a detailed lsof log after the component is installed.

    Then, the

    LSOF data is newly fetched every 1 hour or so, and the analysis of the LSOF data shows that there are indeed cases where the file handle is constantly increasing and cannot be released.

    At this point, two branches of minor problems emerge:

    • one is that LSOF data discovery uses the 1524 backdoor port for Oracle database connections. At first it was suspected that there was a problem with the use of this port, but later this assumption was rejected;

    • That goes back to the file handle issue.

    The crux of the question is: which file handles are not increasing?

    Later, a detailed analysis of the lsof log found that there are always more than 60 file handles constantly increasing and opening, these file handles are repeatedly opened many times, and the file inode value is the same, at this time the key idea is how to find out which files are through the file inode?

    Because in the actual logging there is only the path to the file and not the name of the file, it is stuck in this place for a while.

    At this point, there is no way, what to do is to study the specific meaning of each field in the lsof log in detail, and find a way to find the specific file through the file inode.

    The first thing that comes to mind is whether the file itself in the file system has an inode, and then you can find the file handle of each file through the ls command. Naturally, we can find and export inode information for all files through ls, and then compare them with the file handles in lsof to find specific files.

    According to this idea, we compare all the file attribute information everywhere, and finally find which files the file handle keeps opening.

    As soon as you see that these files are constantly opening, the root cause of the problem is basically found, these files are related to our underlying service component, then look at the method of opening these files, and release the resources in time after the use of these files is opened.

    8. Source code final positioning – saxReader class file processing

    After source code analysis, it was found that the specific problem was that the file kept opening, but the file handle was not closed manually.

    But why there is no general problem in our production environment, this is related to the way the saxReader class handles files, and the saxReader class will close the file, but the specific time is not clear.

    Why the production environment did not have the same problem, whether it will be recycled during FGC, this is only a hypothesis, and no further verification has been done for the time being, but at least the analysis of a large number of continuous openings of files definitely needs to be modified.

    After modifying the code, it was re-deployed, and after the deployment was completed, it was observed that there were no further file IO exceptions in too many files. However, some service timeout exceptions continue to appear.

    9. The cause and effect of the problem should be distinguished

    However, any service timeout exception will no longer cause too many files to open the exception, suddenly found that the problem A and problem B that just appeared, in essence, reflects that our current application has two independent problems, although the two problems may affect each other, but the problem itself is independent, each has its own problems leading to the root.

    When further analyzing the service timeout problem, we analyze the detailed log data of service call calls and find that most services are normal, but only individual services have service call timeout problems, then the cause of individual services is searched.

    Since it is a problem of individual services, we can completely suspect that there is a problem with the service provider’s system, so it is necessary to locate and find the cause of the service provider’s service capability, and finally find a reason that the other party’s operation causes the other party’s database to be deadlocked and the service has been waiting and locking, based on this assumption we subsequently confirmed that this is indeed the reason.

    At this step, basically all the root causes and causes of the problem are basically confirmed, and through the problem location, analysis and solution, the analysis and solution of the corresponding service application performance and problem location are further improved, from CPU memory to IO, from service exception log to service detailed call log information, basically forming a complete diagnostic method.

    Third, the problem review – service call timeout

    Recently, tracking the OSB service running timeout, I found a very strange scene, that is, when calling the business system, there was a timeout return of 1500 seconds. When OSB itself does the service encapsulation settings, we will set two times, as follows


    • Socket Read Time out: The timeout is set to 600s;

    • Connectoin Time out: The connection timeout is set to 30s.

    That is to say, there is no configuration situation in the OSB configuration with a timeout of 1500 seconds.

    Later, I asked the business system,

    and the answer was that there was a timeout setting of 5 minutes, that is, 300s on the business system. But even so, it should be a timeout error returning 300s, not 1500s.

    At the beginning, we

    were always analyzing whether it was a 300s timeout, retried 5 times, resulting in the final timeout setting seen to be 1500s, so we checked all the configuration parameters of the OSB again, and the result was that we did not check any timeout configuration items with 5 minutes, and did not check any check items with 5 retries times.

    When configuring OSB business services, it is indeed possible to configure retries, but our current setting is:

    • the maximum number of retries is 0;

    • whether to support application retries , this is true.

    But since the maximum number of retries is 0, even if the subsequent checkbox is true, you should not retry.

    Because from the return of calling the interface service of other business systems, no corresponding retry operation has occurred. At the same time, in the process of testing with the business system, deselect the checkbox, and an error of 1500 seconds will also occur, so it is temporarily determined that it has little to do with this parameter.

    Later, if you check the logs in detail, you will find that the overall process is


    2018-10-24 11:25:

    38 to start the call

    Oct 24, 2018 11:35:49,172 AM GMT+08:00 Report

    600s hang Oct 24, 2018 11:50:46,140 AM GMT+08:00

    Report Connection Reset

    In the 600s, when we set the Read out timeout, the following exception message is reported

    : WatchRule: (log.severityString == ‘Error’) and ((log.messageId == ‘WL-000337’)

    or (log.messageId == ‘BEA-000337’))

     WatchData: MESSAGE = [STUCK] ExecuteThread: ‘7’ for queue: ‘weblogic.kernel.Default (self-tuning)’ has been busy for “610” seconds working on the request “Workmanager: SBDefaultResponseWorkManager, Version: 0, Scheduled=false, Started=true, Started time: 610222 ms

     “, which is more than the configured time (StuckThreadMaxTime) of “600” seconds in “server-failure-trigger”. Stack trace: Method)

    This exception is an exception message related to the 600s timeout, that is, Socket Read timeout, and after reporting this exception information, the thread retries again after 60 seconds, namely


    <[STUCK] ExecuteThread: '7' for queue: 'weblogic.kernel.Default (self-tuning)' has been busy for "670" seconds working on the request "Workmanager: SBDefaultResponseWorkManager, Version : 0, Scheduled=false, Started=true, Started time: 670227 ms

     “, which is more than the configured time (StuckThreadMaxTime) of “600” seconds in ” server-failure-trigger”. Stack trace:

    The relevant records of thread 7 cannot be searched in the log, and the Connection Reset error occurs after an interval of 900 seconds. Connection reset

    at org.glassfish.jersey.client.internal.HttpUrlConnector$

    That is, the

    preliminary analysis is likely to be that the service call itself times out on the business system side within 5 minutes, but the business system does not process or close the connection, resulting in the connection being not perceived on the OSB side.

    Therefore, wait until the 600s when the timeout occurs, and this timeout itself is not a timeout that detects a problem on the business system, or other reasons that cause the thread to be stucked and the connection to be suspended. So after waiting another 900 seconds, the connection reset occurred.

    Based on the above analysis, we further find the 900s-related settings, and there is a 900s shrink frequency setting in the Weblogic DataSource connection pool, which means the number of seconds to wait before shrinking the connection pool that has increased to meet the needs. If set to 0, shrinkage is disabled. Now we set this value to 900s.

    Further search for information, find further information as:

    A large number of Connection for pool “SCDS” closed messages can be observed in the Weblogic Server log, indicating that the system will close a batch of connections in batches at a certain time, which is generally done when the physical connection is broken (WebLogic configures pool shrinkage to do the same, if not configured, the default is 900s check, and it is found from your configuration file that pool shrinkage is not configured). From the thread name, it appears that the application’s thread closed the connection.

    That is,

    when the 600s connection is suspended, wait for the 900s, and officially close and recycle the connection when the weblogic connection pool checks and shrinks, thus returning the Connection Reset error.

    This hypothesis has not been further tested, but from the whole process and log analysis, it basically makes sense.

    In this problem analysis, one of our biggest errors is to judge that it was caused by 5 retries according to the 300s and 1500s, and we have been looking for why the retries are made, and the retry configuration is checked and verified.

    Simply put, the previous assumption itself is wrong, but there are too many detours in verifying the hypothesis. So we have to come back to the problem itself.

    The previous analysis analyzed that when there is thread blocking and suspension for 600 seconds, it waits for another 900 seconds for the connection timeout, so it is 1500 seconds timeout from the time point of view.

    In order to confirm this hypothesis, we modify the time of Read Time out to 400 seconds, then it should be 1300 seconds to report an abnormal error of service timeout, but the final test result is still 1500 seconds timeout.

    Therefore, the previous assumption does not hold.

    For this timeout, there is no 5-minute timeout setting on the

    OSB cluster side, and checking the timeout configuration document of F5 load balancer shows that there is an idle time out timeout setting on the F5 load balancing device, which is 300 seconds by default.

    When the diagnostic analysis of any problem often fails to put forward clear and reasonable assumptions, it is still necessary to return to the problem generation process and link, and then determine the specific problem points and boundaries through divide and conquer.

    Therefore, in order to solve this problem, the first thing to do is to determine whether it is related to the load balancing device. For the current service call, it needs to be completed through the ESB service cluster and the service cluster of the business system. As follows:

    That is, the call process of the entire service request is the sequence of 1->2->3->4, which needs to go through two load balancing devices 1 and 3 at the same time. Then the entire service call timeout will be related to the configuration of the four nodes of 1, 2, 3, and 4.

    Therefore, for further verification, we try to troubleshoot the problem by calling directly to the following path:

    In this mode,

    the test is tested by calling through the management system and through the SOAPUI respectively. It is found that the overall call can succeed, and a successful instance is returned.

    There is a retry phenomenon when calling through

    control, but no retry is found when calling through SOAPUI.

    Secondly, for client calls, a 5-minute call timeout will still occur, and a Connection Reset error will be returned. But at this time, the service is still running, that is, the 2-4 connections are still running and can be successfully run, so you can see the successful service running instance data.

    • do not go to the cluster on both sides of the ESB and the business system, and go 2-4 to directly call the interface service

    In this mode, we test the call to the interface service through SOAPUI, which can be successfully called, a successful instance is returned, and the client may also get a successful return information.

    That is,

    there is a successful instance, and the client also returns success. That is, an outcome we hope to achieve.

    This is the initial call mode, we still use SOAPUI to invoke, and find that the call will be retried, and the final service will fail to run, reporting a timeout error of 1500 seconds.

    A connection timeout error is also reported on the client. That is, it is consistent with the phenomenon we initially saw. However, it is not clear why the retry was initiated after 5 minutes and whether the retry was initiated by the load balancer.

    On load balancing, we see that there are tcp_tw_recycle parameter configurations, but for the time being, we are not sure whether the automatic trigger retry is related to the setting of this parameter, and it is not recommended to enable this parameter configuration from the online article.

    After analysis, this timeout problem is basically determined to be caused by the timeout setting of load balancing. Therefore, the solution is simple, that is, adjust the load balancing timeout settings corresponding to both clusters, and ensure that the timeout period > the read time out time in the OSB service configuration.

    In the end, the problem was solved.

    4. Analysis of JVM memory overflow problem

    There are many problems and solutions to Java JVM memory overflow on the Internet, in fact, it is a very common problem for this kind of problem, and a standard problem solving and diagnosis methodology has been formed.

    The best way is to follow this step to diagnose, rather than relying on your own experience to formulate various hypotheses, because in this case your hypotheses are likely to be guesswork and waste a lot of time.

    Even if you don’t have enough experience with the problem, you must follow the general methodological steps to solve it.

    Going back to the memory overflow problem, the general steps for this problem are as follows:

    > here , we basically return to the general problem solving method.

    Because it’s a

    production issue, and because it’s a commercial product, I can’t reproduce in a test environment or do static code checks. Therefore, the memory reclamation log analysis of Java GC is still required first.

    For a detailed analysis of JVM memory overflow problem

    , please refer to: “From appearance to root cause – the whole process of JVM memory overflow problem analysis and solution of a software system”. (

    5. Analysis and diagnosis of business system performance problems

    If a business system has no performance problems before going online, but has more serious performance problems after going online, then in fact, the potential scenarios mainly come from the following aspects:

      > Large concurrent access to services leads to performance bottlenecks.

    • After the system database data is launched, the data accumulates over time, and performance bottlenecks occur after the data volume increases.

    • Other critical environmental changes, such as what we often call network bandwidth impact.

    It is for this reason that when we find performance problems, we first need to determine whether there are performance problems in the non-concurrent state of a single user, or whether there are performance problems in the concurrent state.

    For single-user performance problems, it is often easier to test and verify, and for concurrent performance problems, we can perform pressure testing and verification in the test environment to judge the performance under concurrency.

    For detailed business system performance problem analysis and diagnosis, please refer to: Business System Performance Problem Diagnosis and Optimization Analysis. (

    The above are some thoughts and summaries on the analysis and solution of technical problems for your reference.

    public number (zhisheng) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitoring < keywords such as span class="js_darkmode__80"> to view more articles corresponding to keywords.

    Like + watch, less bugs 👇

    Buy Me A Coffee