introduction: the life cycle of log data includes log collection, access, transmission, application and other links. the stability of data has a crucial impact on the company’s report construction, decision analysis, and the effect of the conversion strategy. the purpose of the full text is to introduce the current situation of baidu’s log middle office and the company’s internal application promotion. in particular, in the construction of data accuracy, in-depth discussion will be conducted. the stability construction of data generation in all aspects of the final business application includes: optimization of data report timeliness, thinking about access persistence, and construction of data streaming in the process of data streaming.

the full text is 4047 words, with an estimated reading time of 12 minutes

a brief description

1.1 middle office positioning

LOG MIDDLE OFFICE IS A ONE-STOP SERVICE FOR DOTTING DATA, REALIZING THE WHOLE LIFE CYCLE MANAGEMENT OF DOTTING DATA, AND CAN QUICKLY COMPLETE THE FUNCTIONS OF LOG DATA COLLECTION, TRANSMISSION, MANAGEMENT, QUERY AND ANALYSIS WITH SIMPLE DEVELOPMENT, WHICH IS SUITABLE FOR BUSINESS SCENARIOS SUCH AS PRODUCT OPERATION ANALYSIS, R&D PERFORMANCE ANALYSIS, OPERATION AND MAINTENANCE MANAGEMENT, ETC., TO HELP CUSTOMERS SUCH AS APP AND SERVER TO EXPLORE DATA, MINE VALUE, AND FORESEE THE FUTURE.

1.2 access conditions

THE LOG MIDDLE OFFICE HAS COVERED MOST OF THE KEY PRODUCTS IN THE FACTORY, INCLUDING: BAIDU APP FULL DOT, MINI PROGRAM, MATRIX APP, ETC., AND THE BENEFITS IN ACCESS ARE AS FOLLOWS:

  • Access: Almost covers the existing APP, Mini Program, innovation incubation APP, and off-site acquisition APP
  • Service scale: log number hundreds of billions per day, peak QPS millions per second, service stability 99.9995 %

1.3 explanation of terms

CLIENT: REFERS TO A SOFTWARE SYSTEM THAT CAN BE USED DIRECTLY BY USERS, USUALLY DEPLOYED ON TERMINAL DEVICES SUCH AS USERS’ MOBILE PHONES OR PCS. FOR EXAMPLE, BAIDU APP, MINI PROGRAM, ETC.

server: a service used to respond to network requests initiated by clients, usually deployed on cloud servers.

Log middle office: This refers specifically to the end log middle office, including the capacity building of the whole life cycle of the end log. Includes core components such as dot SDK / dot server/ log management platform.

Dot SDK: Responsible for the collection, packaging, and reporting of dot logs. According to different log production ends, it is divided into APP-side SDK and H5-side SDK, which are divided into general-point SDKs, performance SDKs, and Mini Program SDKs according to the scenario, and users can integrate different SDKs according to their needs.

dot server: the log receiving server is the most core module of the log middle office server.

feature/model service: the log middle office forwards the point information that needs to be calculated for the policy model to the downstream < the policy recommendation middle office > in real time. the feature/model service is the entry module for < policy recommendation middle office >.

1.4 panorama of services

log service mainly includes the basic layer, management platform, business data application, and product support. around all levels, in june 2021, the baidu client log reporting specification was formulated and released.

  • Basic layer: Supports APP-SDK, JS-SDK, performance SDK, and general SDK to meet the rapid access scenarios of various dotting needs. Relying on big data basic services, dot data is distributed to various application parties.
  • platform layer: the management platform supports the management and maintenance of data element information, and controls the whole life cycle link. the online level supports real-time and offline forwarding of data, and relies on reasonable traffic control and monitoring to ensure service stability of 99.995%.
  • business capabilities: log dot data output to data center, performance platform, strategy middle platform, growth center, etc., effectively help product decision analysis, end quality monitoring, policy growth and other fields.
  • Business support: covering key APP, new incubation matrix APP, horizontal general components.

second, the core goal of the log middle office

As introduced earlier, the log middle platform carries all the APP logs in Baidu, standing at the forefront of data production, and the most important core challenge facing it is the accuracy of the data on the basis of ensuring full functional coverage and fast and flexible access. The entire data is accessed from the output, the log middle office to the processing, the downstream application, and all data quality problems need to be hosted in the log middle office. The accuracy of the data can be broken down into 2 parts:

  • non-repetitive: ensure that the data is not duplicated in the strict sense. it is necessary to prevent data duplication caused by various retries at the system level and abnormal recovery of architecture;
  • no loss: ensure that the strict meaning of the data is not lost. you need to prevent data loss problems caused by system-level failures, code-level bugs, and so on.

to achieve nearly 100% weightlessness at the system level, there are more problems to be faced.

2.1 log middle office architecture

access log middle office dot data from the end to the online service to the final (real-time /offline) forwarding to the downstream, need to go through the following links:

  • data is applied differently and has the following centralized types:real timequasi-real-time stream (message queue): for downstream data analysis, features: high (min) timeliness, the need for strict data accuracy. typical applications: r&d platform, trace platform;PURE REAL-TIME STREAMING (RPC PROXY): FOR DOWNSTREAM POLICY APPLICATIONS, FEATURES: SECOND-LEVEL TIMELINESS, ALLOWING A CERTAIN DEGREE OF DATA LOSS. TYPICAL APPLICATION: RECOMMENDED ARCHITECTURE.offline: offline large table, all logs full set, features: day level / hour level timeliness, the need for strict sense of the data accuracy.other: a certain amount of timeliness and accuracy is required

2.2 problems faced

judging from the above log middle office architecture, there are the following problems:

  • giant module: the dot server carries all the data processing logic, and the function coupling is serious:multiple functions: access &persistence, business logic processing, various types of forwarding (rpc, message queue, pb drop disk);fan-out: there are 10+ business fan-out streams, which are forwarded by dotting the server.
  • direct docking to message queues: from a business perspective, there is a risk of message loss in the sending message queue, and it cannot meet the requirements of not losing the service.
  • no hierarchy of services:core business and non-core business architecture deployments are couplediterate on each other and influence each other

third, do not lose the realization

3.1 theoretical basis for not losing data

3.1.1 only 2 theory of data loss

  • terminal: due to the objective environmental impact of the mobile terminal, such as white screen, flashback, inability to reside in the process, uncertain startup cycle and other factors, there is a certain probability that the client message will be lost
  • access layer: due to the inevitable possibility of server failure (service restart, server failure), there is also a certain probability of data loss
  • computing layer: after the access point, based on the streaming framework, the construction needs to ensure that the data is not heavy or lost in a strict sense.

3.1.2 optimization direction of log middle office architecture

data access level:

  • the principle of persisting data first, business processing later
  • reduce logic complexity

downstream forwarding level:

  • live stream class: strict sense is not lost
  • high timeliness class: guarantees data timeliness and allows for possible partial loss
  • resource isolation: physically isolate the deployment of different services to avoid the mutual influence of different services
  • prioritize: differentiate between different types of data according to the service’s demands on different data

3.2 schema disassembly

based on the analysis of the current situation of the log middle office and combined with the only 2 theory of the log dot service, we disassemble and reconstruct the existing architecture for the log middle office.

3.2.1 dot server service disassembly (optimizing access layer data loss)

based on the above theory of no weight and no loss, the log access layer has been constructed in the following aspects to ensure that the data is not heavy or lost as much as possible.

  • log priority persistence: minimize data loss caused by server failure at the access point;
  • giant service disassembly: the access point should be built with a simple and lightweight idea to avoid service stability problems caused by too many business attributes;
  • flexible & easy to use: design a reasonable streaming computing architecture based on the characteristics of business requirements without losing weight.

3.2.1.1 log priority persistence

the existing fan-out data in the log middle platform needs to be persisted first, which is the basic requirement of the log access layer. in terms of real-time streaming, under the condition of ensuring the minute-level delay in business data forwarding, it is necessary to ensure that the data is “not lost as much as possible”.

  1. persistence: before the real service processing, the access layer gives priority to data persistence, and ensures that the data is not lost “as much as possible”.
  2. real-time streaming: avoid direct docking with message queues, and give priority to the use of disk +minos forwarding message queues to ensure that data is delayed by up to many minutes, and try not to lose it.

3.2.1.2 giant service disassembly & function sinking

in order to reduce the stability risk caused by excessive functional iteration of log service, and to meet the needs of flexible subscriptions for downstream services, it is necessary to ensure the rationality of the log middle stage fanning out. we’re taking the online service a step further:

  • real-time streaming services: dotting message flows through the access layer→ fan-out layer→ the service layer → services.access layer: single function, the design goal is to not lose data as much as possible, to ensure that the data is durable for the first time;fan-out layer: ensure the downstream flexible subscription method, data splitting >reorganization (currently based on dot id dimension fanning);business layer: combine subscription fan-out layer data, complete the realization of the business’s own needs, and be responsible for producing and forwarding the dot data to the downstream;
  • high-efficiency business:Policy real-time recommendation service, separately extract services, support rpc data forwarding, ensure ultra-high timeliness and ensure that the data forwarding SLA reaches more than 99.95%;
  • other types of business:data monitoring, vip, grayscale and other services, the requirements for timeliness and loss rate are further reduced, and this part of the service can be separated from a separate service;
  • technology selection: for the characteristics of data streaming computing, we have selected the streamcompute architecture to ensure that the data is “not heavy and not lost” in the whole process after passing through the access layer.

therefore, you can further disassemble log service as shown in the following figure:

3.2.1.3 streaming computing thinking

in order to ensure strict data flow stability, it is necessary to rely on the streaming computing architecture to solve the problem that data is completely undemanded and not lost in the process of business computing, and to meet the requirements of obtaining data in different scenarios of the service. in view of the characteristics of the log middle office, we enter the following design for the streaming computing processing architecture:

  • dot server: pass the live stream through the message queue and fan out separately to the streaming framework (offload flow entry)
  • offload flow: splits and outputs different points of information to different message queues based on the size of the traffic. the advantage of this is that it takes into account the flexible fan-out requirements of data, and the downstream can flexibly subscribe;QPS is less than some thresholds or horizontal points, etc.: separate message queue output, to achieve flexible fan-out;QPS IS SMALLER AND AGGREGATED OUTPUT TO THE AGGREGATION QUEUE IN ORDER TO SAVE RESOURCES;
  • business flow: if the business has its own data streaming requirements, you can deploy the job separately so that the resources of each job are isolated.input: combine different fan-out data of the subscription shunt flow to perform data calculations;output: after the data is mixed and calculated, it is output to the service message queue, and the service party subscribes to the processing by itself;business filter: as the ultimate operator sent to the business layer, it is responsible for each data flow at the end-to-end level. (click server, side of the system retry data) is not heavy. the dot server generates a unique identifier for each dot data (similar to a piece of data md5), and the business flow filter operator performs global deduplication.

3.2.2 DOT SDK DATA REPORTING OPTIMIZATION (SOLVE THE PROBLEM OF DATA LOSS REPORTED BY THE TERMINAL)

THE CLIENT POINTS OUT THAT THERE IS A CERTAIN RISK OF DATA LOSS DUE TO THE ENVIRONMENTAL PROBLEMS IN WHICH THE END IS LOCATED. ESPECIALLY WHEN THE CONCURRENCY OF DOT CALLS IS HIGH, IT IS IMPOSSIBLE TO SEND ALL THE DATA TO THE SERVER AT THE FIRST TIME. THEREFORE, THE CLIENT NEEDS TO STAGE THE SERVICE DOT DATA IN THE LOCAL DATABASE, AND THEN SEND THE MESSAGE TO THE SERVER ASYNCHRONOUSLY TO ACHIEVE THE GUARANTEE OF ASYNCHRONOUS SENDING AND PRIORITY LOCAL PERSISTENCE. HOWEVER, THE APP CAN BE EXITED AND UNINSTALLED UNDER ANY CIRCUMSTANCES, AND THE LONGER THE DATA STAYS LOCALLY, THE LESS VALUABLE IT IS TO THE BUSINESS DATA AND THE EASIER IT IS TO LOSE, SO WE NEED TO OPTIMIZE THE FOLLOWING DIRECTIONS FOR DATA REPORTING:

  • increase the timing of reporting: separate scheduled task polling, triggering the carrying of cache data when the service is clicked, and triggering the cache message when it reaches the threshold (obtained experimentally) improves the timing of the reporting of cached data and minimizes the time for messages to be cached locally.
  • increase the number of reported messages: in order to ensure the size and number of data reports (the experimental comparison obtains the threshold), the number of reported messages is adjusted to obtain a reasonable number of reports, so as to achieve the maximum benefit of the first time the message reaches the service.

through the continuous optimization of the client-side transmission logic, great benefits have been made in terms of timeliness. convergence is double-ended by 2%+.

fourth, the outlook

the previous article introduced some efforts made by the log middle office service in terms of data accuracy assurance. of course, we will continue to dig deeper into the risk points of the system in the future, such as:

  • data loss caused by disk failure: the access layer subsequently targets disk failures, and based on the company’s data persistence capabilities, the strict foundation of construction data is not lost

hopefully, the log middle office will continue to be optimized to contribute to the accurate use of dot data in the business.

reference from: The log middle platform is not heavy and does not lose the implementation of a brief discussion_Baidu Developer Center _InfoQ writing platform