id=”js_tags” class=”article-tag__list”> included in the collection #监控
id=”js_article-tag-card__right” class=”article-tag-card__right”> 1
This article focuses on how to use ELK Stack to help us build a log monitoring system that supports Nissan terabytes. In an enterprise-grade microservices environment, running hundreds or thousands of services is considered relatively small. In the production environment, logs play a very important role, such as logs for exception troubleshooting, logs for performance optimization, and services for business troubleshooting.
However, there are hundreds of services running in production, each of which will only be stored simply, and when logs are needed to help troubleshoot problems, it is difficult to find the node where the logs are located. It is also difficult to mine the data value of business logs.
Then, the unified output of logs to a place for centralized management, and
then the log processing, and the output of the results into O&M and R&D available data is a feasible solution to solve log management and assist O&M, and it is also an urgent need for enterprises to solve logs.
Through the above requirements, we have launched a log monitoring system, as shown above:
unified log collection, filtering and cleaning.
Generate visual interfaces, monitoring, alerts, and log searches.
The functional process overview is shown in the figure above:
buried on each service node and collected relevant logs in real time.
Unified log collection service, filtering, and cleaning logs to generate a visual interface and alarm functions.
① We use FileBeat on the log file collection side, and the operation and maintenance are configured through our background management interface, each machine corresponds to a FileBeat, and the topic corresponding to each FileBeat log can be one-to-one, many-to-one, and different policies are configured according to the daily log volume.
In addition to collecting business service logs, we also collect slow query logs and error logs from MySQL, as well as other third-party service logs, such as Nginx.
Finally, combined with our automated publishing platform, every FileBeat process is automatically published and started.
(2) Call stack, link, process monitoring metrics we use the proxy mode: Elastic APM, so that there is no need to change the business side of the program.
For business systems that are already in operation, it is undesirable and unacceptable to need to change the code in order to incorporate monitoring.
Elastic APM can help us collect the HTTP interface’s call link, internal method call stack, SQL used, CPU of the process, memory usage metrics, etc.
Some people may have doubts, with Elastic APM, other logs can basically be collected. Why else use FileBeat?
information collected by Elastic APM does help us locate more than 80% of the problems, but it is not supported in all languages such as C.
Second, it cannot help you collect the non-error logs and so-called key logs you want, such as: an error occurred during an interface call, and you want to see the before and after logs at the time of the error; There are also logs that are printed for easy analysis of business related operations.
Third, custom business exceptions,
which are non-system exceptions and belong to the business category, and APM will report such exceptions as system exceptions.
If you later alarm on system abnormalities, these exceptions will interfere with the accuracy of the alarm, and you cannot filter business exceptions, because there are many types of custom business exceptions.
(3) At the same time, we made two openings for the agent. Collect more detailed GC, stack, memory, thread information.
(4) We use Prometheus for server collection.
(5) Because we are SaaS servitization, there are many services, and many service logs cannot be unified and standardized, which is also related to historical problems, a system that has nothing to do with the business system to indirectly or directly dock the existing business system, in order to adapt to itself and let it change the code, it is not pushable.
The awesome design is to make yourself compatible with others and treat them as objects of attack. Many logs are meaningless, for example, in order to facilitate troubleshooting and tracing problems during development, only the iconic log is printed in the if else, which represents whether the if code block or the else code block is taken.
Some services even print Debug-level logs. Under the conditions of limited cost and resources, all logs are unrealistic, and even if resources allow, it will be a large expense in a year.
Therefore, we adopt filtering, cleaning, and dynamically adjusting log priority collection. First, collect all logs into the Kafka cluster and set a short validity period.
We currently set an hour, an hour of data, and our resources are acceptable for the time being.
(6) Log Streams is our log filtering, cleaning stream processing service. Why an ETL filter?
Because our Log Service resources are limited, but this is not right, the original logs are scattered on the local storage media of each service.
Now we are just collecting, after collecting, the resources on each service can release some of the resources occupied by logs.
Yes, this is indeed the allocation of resources on each service to Log Service resources, and does not increase resources.
But this is only theoretical, online services, resources expansion is easy, shrinkage is not so easy, implementation is extremely difficult.
Therefore, it is impossible to allocate log resources used on each service to Log Service in a short period of time. In this case, the resources of Log Service are the amount of resources currently used by all service logs.
The longer it is stored, the greater the resource consumption. If the cost of solving a non-business or indispensable problem in a short period of time is greater than the benefits of solving the current problem, I think that no leader or company is willing to adopt a solution with limited funds.
Therefore, from the perspective of cost, we introduce filters in Log Streams to filter out worthless log data, thereby reducing the resource cost used by Log Service.
Technology We use Kafka Streams as ETL stream processing. Dynamic filtering and cleaning rules are realized through interface configuration.
The general rules are as follows:
Centered on the error time point, the window is opened in the stream processing, and the non-error level log is collected at N time points that can be configured above and below the radiation, and only the info level is used by default.
Each service can be configured with 100 key logs, and all key logs are collected by default.
On the basis of slow SQL, configure different time-consuming filters according to business classification.
Real-time statistics of service SQL according to business requirements, such as peak periods, and the query frequency of similar service SQL within one hour. DBAs can provide a basis for optimizing databases, such as creating indexes on queryed SQL.
peak hours, logs are dynamically cleaned and filtered by weight metrics of service type, log level metrics, maximum log limit for each service in a time period, and time period metrics.
Dynamically shrink the time window based on different time periods.
Generate corresponding indexes according to the log file rules generated by the service, for example: a service log is divided into: debug, info, error, xx_keyword, then the generated index is also debug, info, error, xx_keyword plus date as a suffix. The purpose of this is to habitually use the log for development.
Log index generation rules:
(7) Visual interface We mainly use Grafana, which supports many data sources, including Prometheus and Elasticsearch, which can be described as seamless with Prometheus. Kibana, on the other hand, we mainly use for visual analytics in APM.
log visualization is as follows:
public number (zhisheng) reply to Face, ClickHouse, ES, Flink, Spring, Java, Kafka, Monitoring < keywords such as span class="js_darkmode__148"> to view more articles corresponding to keywords.
like + Looking, less bugs 👇