Fourth, results and planning

The status quo of the buried point framework is: the business side triggers the buried point, turns on the asynchronous thread, and after the data assembly process, the storage is reported at the same time, and half of all the remaining buried points in the database are reported again after the report is completed, and the number of buried points is not checked, regardless of whether the buried point being fished is a buried point that is being reported, so the cycle, and the buried point files are scattered in multiple modules, calling each other.

After analyzing the individual nodes, the following issues were found:

Inaccurate buried point data: buried point loss, repeated reporting, unlimited reporting and other issues;

Low reporting success rate: Half of the database is reported at a time, and the concurrency is too high, which can easily lead to network interface blockage or OOM;

Low portability: The internal submodule is seriously coupled, or even highly coupled to the service, which is not suitable for other apps;

After summarizing and summarizing the existing buried point framework, it is concluded that the goals of the reconstruction of the buried point framework are:

Solve problems such as the loss of buried points: under the premise of solving problems such as repeated reporting and unlimited reporting, ensure that buried points can be fully reported;

Improve the success rate of the buried point: the success rate of the old frame is only 70%, and the success rate of the new buried point framework should reach more than 99%;

The frame as a whole is divided into three layers

Interface layer

Indicates what kind of capabilities the SDK provides, is a bridge for the SDK to interact with external business parties, and provides SDK initialization entrances, code buried point interfaces, and SDK configurations.

Business core libraries

These include a management module and three subsystems, namely data acquisition, data storage, and data reporting system. Each subsystem is independent and only responsible for handling its own logic, while the management module is the manager who coordinates the collaboration between the subsystems, and is also responsible for the construction and initialization of each subsystem.

The underlying library

The timeliness of the buried point generally varies from case to case. Some points are not time-sensitive and can be used with the database. Some points are more time-sensitive, and the unified reporting cannot be delayed to prevent centralized reporting from causing instantaneous abnormal alarms on the interface. Therefore, the new framework has designed two external interfaces, and the non-timeliness point will be put into the library, and the timeliness point will not be put into the library.

In order to solve the problem of loss of buried points, the new framework adopts the form of real-time reporting and timed reporting in the design of the reporting module, and the data of the failure is reported in real time, which can be reported in time through the timing task, so as to avoid the user from reporting without operation.

1. Real-time reporting

When the APP actively calls the code buried point or the monitoring trigger, it will carry the business data into the new framework, first enter the acquisition system, and the data length and format will be checked in this system, and the check will be combined with the App general data (device information, user information, etc.) after the verification is passed;

After the collection of system assembly data is completed, it is scheduled to enter the storage system through the management module, first convert the data format required for storage, and then create a storage task to the thread pool to queue for execution, and at the same time of data storage, this data also enters the reporting process synchronously;

The data from the storage system will be converted into an escalation task, handed over to the single-threaded thread pool queued for reporting, in order to prevent too many tasks from causing OOM, the queue of the thread pool is limited to 128 or less, after exceeding the earliest task, after the report is over, through the management module to update the number of reports (reporting failure) or delete (report successful) data;

The new framework will register listeners when initializing, mainly for automatic reporting of buried points;

After the new framework is initialized, the scheduled task will be started at the same time;

The timing time arrives, first of all, the amount of data to be reported in the database will be checked, if there is data to be reported, it will be reported in batches, each batch reads a fixed number, and the next batch of reading and reporting will be carried out at the end of the previous batch of reports, so as to prevent too much data read in memory from causing OOM;

Successfully deleted data for reporting, failed to update the number of reports, and delete data if the number of reports reaches the upper limit;

The new framework design adopts the form of real-time reporting and timing task to ensure that the buried point is not lost, so how to avoid the data taken by the timing task is the data that is being reported or to be reported?

In the design can maintain a timestamp, the record is the most recent real-time reporting end buried point data generation time, all points before this timestamp have been reported, so the data obtained according to this time has failed, will not coincide with the data being queued for reporting.

Under the old framework, there is a situation where the buried point data is exactly the same, how to distinguish between repeated buried points or continuous clicks by users?

In the new framework design, we put a unique label on each log, so that we can easily distinguish whether the log is repeatedly reported or the user operates instantaneously, and the generation of this label uses device information and timestamp, which can be used to reverse the information generated by the buried point.

So how do you avoid the problem of unlimited reporting in the old framework? We add a limit to each log (uploadCount), and if we exceed the preset number of times, we can delete the log.

In terms of verifying the stability and robustness of the new framework, we use the uploadCount parameter to count the number of reports of all logs, calculate how many times each log experiences on average to successfully report them, and count the retransmission rate.

Finally, there is a table in the new framework that records the number of points sent and successful by the current device every day in the dimension of days, and after the library is dropped, we can count the success rate according to the ratio of the device’s sending value to the success value.

The new frame reported 20% more buried points of order of magnitude pv and uv than the old frame.

2. High success rate

The success rate of the new framework reached 99.912%, and the retransmission rate was 1.2%.

3. High portability and scalability

At present, both Dada merchants and knight apps have been connected to the new framework, 2 days / person when accessing the joint call, the configuration code is controlled at about 50 lines, and it has experienced two non-intrusive extensions.

4. Planning

While the new framework is already running stably on multiple apps and yields good results, there are a few things that can be optimized:

Add buried point data list and search function to improve the efficiency and accuracy of the buried point verification stage;

The buried point can be integrated into some common components, bind the data, and improve the research and development efficiency of the buried point;

For some PV and exposure buried points, learn from SPM ideas to improve the design and R&D efficiency of buried points;

In the future, the new buried point framework will be accessed by more and more apps, and the Dada Express team will continue to optimize the new buried point framework to improve efficiency from all aspects.