In the Internet era, everything is connected, the network security situation is becoming more and more severe, security is the cornerstone of the enterprise, risk control plays a “police” role in the enterprise, using various technologies and means to protect the interests of users in the enterprise from infringement.

Risk control decision-making is the entrance to the risk control middle office, providing business risk scenario event access, visual orchestration of complex decision-making, rich feature variables and scene recognition services and other functions. Compared with the traditional risk control engine that requires development background and algorithm background to be used, the decision engine described in this article does not require development background or even algorithm modeling background after it is built, and can be configured and applied to business decision-making as a pure strategy operation to fight against black production in real time.

Risk events correspond to a risk area, which is a concrete abstraction of a specific business event area to facilitate the use of adversarial rules in the field of policy operator management.

For example, if you want to prevent and control fission marketing scenarios, such as sharing the current content to the circle of friends can get a beautiful gift of 88 yuan, then the risk field can be roughly divided into initiating sharing (initiation frequency/cheating), accepting sharing (fan/group/frequency), and receiving rewards after sharing (frequency/group/attribution group), and each step corresponds to different risk characteristics and requires proprietary strategy deployment prevention and control.

As above, after full communication between the risk control team personnel and the business team personnel, they have roughly known the risk points of the business game, that is, they can clearly divide the “risk events”. So how to efficiently and stably manage the risk confrontation strategy in the current field is a major challenge for risk control research and development.

In order to combat the black industry, strategic operators need to change the prevention and control methods at every moment, and the timeliness should be high, that is, they expect to take effect immediately. So how can production changes be safely and efficiently deployed online? In order to iterate quickly, the early team used XML schema changes, the advantage is high extensibility, the disadvantages are obvious: only research and development can replace the policy personnel to modify the decision flow configuration, the policy personnel do not understand or dare not move the XML “code” at all.

With the expansion of the team, with the addition of professional UED and front-end development partners, we can support us to work vision and operation, refer to the industry’s BPMN 2.0 workflow design specifications, convert the boring and complex XML code into the decision flow orchestration configuration mode, greatly increase the confrontational efficiency of the strategy operation personnel, and also liberate the back-end R & D personnel, no longer have to cut the XML code and worry about mistakes.

The “policy node” is the most important node in the decision-making process, and there are a large number of rules associated with the fight against black industry, which involves how the rules work together and how the decision results are output. In the process of confrontation with the black industry, the ancestors summed up a set of models to quickly deploy the rules, maximize the effect of the fight against the black industry, and at the same time will not “manslaughter” good users.

The strategy is divided into two modes: scorecard and worst-case match, and I’ll show you in detail how each mode works and where it fits.

The scorecard here does not refer to the data model scorecard, but to the expert scorecard, in layman’s terms, according to the experience of experts, the score is weighted by different rules for hitting, and if the final score hits the rejection interval segment, it needs to be rejected. This kind of model, for the early lack of user “black label”, directly rely on expert experience, is a good choice.

Scorecards are a probabilistic problem, and the darker the user hits the rule, the more points they get, which indicates the higher the risk. For example, in order to combat risk control, the black industry will find a large number of proxy IP or modify GPS positioning to disrupt the risk control system detection, and there are also “cat pools” and other devices to provide a large number of hand numbers. However, at present, many normal users have 2 mobile phones, and because of work reasons, frequent and multi-place flights, that is, the geographical location is constantly changing, then the scoring rules formulated by the strategist are as follows:

The scoring ranges are as follows

As you can see, normal users will only hit 0, and users with multiple devices (2 to 3) will basically score 20 and won’t be disturbed. Even if normal users travel frequently to various cities directly, they will not say that one second is in Shanghai, the next second is in Guangzhou, and there is still a trace of fighting against black production!

The “worst match” is somewhat similar to anyMatch in the stream processing concept, where any rule is rejected as soon as any rule hits. The rules contained in this model are very certain and must be interpretable, and if hit, it is basically conclusive to reject the request. For example, if the same device both initiates and accepts the invitation, which is a fandom fraud and obviously does not comply with the rules of the activity, it can be rejected immediately.

Choose the worst matching mode to be cautious, it is the condensation of expert experience, must be in the production environment many times verified, recall and accuracy rate is high, there will be no “manslaughter” situation, otherwise it will encounter a large number of customers who have been rejected by mistake, serious and even hinder business development, brand value is greatly damaged.

The policy is an abstraction of a certain risk point in the current risk scenario, for example, in the invitation risk scenario, there can be device policy, mobile phone number policy, group policy, etc., and the policy package is a rule, responsible for managing the life cycle of the rule.

Rules are the smallest “atomic” units in the risk control decision flow, and the composition of rules is as follows:

Example: In a device policy package, the following rule is included: All logged-in devices have a deduplication greater than or equal to 4, which is compared as follows

A single rule hit may be more intrusive to the user, at this time it is necessary to combine rule determination, that is, rule groups, rule groups can be arranged with or evaluated logic: meet all, meet one, customize, where custom supports complex conditional expressions 1 || (2 && 3) || 4, to meet the needs of different rule combinations.

“List nodes” are an important function in the decision flow and one of the most dangerous defensive actions.

Why do I need a list? Suppose you are a black account, if there is no list, every time you need to execute a decision flow, which is a great waste of computing and cost, then at this time in order to improve performance and cost considerations, directly mark you black, at this time in the decision flow entrance to arrange a new list node, you can simply understand that this is a large “cache” module, then the blackened user will directly refuse, and do not need to run down the decision flow, the same reason, judged to be a high value and low risk good users, you can also directly whiten, immediately passed, no need to wait Only truly pure new users or “swinging” customers need to run the decision flow to judge the risk.

Black and white list is simple and rough is very easy to use, simple and rough means that it is easy to have problems, a careless will put themselves “pit to death”, a random addition of blacklist data may directly infringe on most of the normal users, the same way, the random addition of the white list may directly open the door for malicious users.

So how did these lists come about?

Extract from historical malicious data, device, mobile phone number, IP, etc. At the same time, combined with third-party partners, we will build a blacklist library (after all, people have been precipitating for a long time).

The simplest and most brutal means is to look at the value and risk of the four quadrants, high-value low-risk users must be our target customers (non-absolute, can also be disguised), at this time can be directly whitened to this part of the user.

The list must be time-sensitive, and there must be barriers within the list, which can be understood as domain isolation, how to understand? That is, this user is bad in the current scene, but it is good in other scenarios, then it is only necessary to isolate and block in the current segment. Timeliness is to solve the kind of black production that knows how to raise a number, or at the beginning disguised as a high-value black user, the risk control program needs to regularly sort out and recalculate which users can add white and black, so that the reward and punishment are clear, try not to be wronged, do not spoil anyone.

The decision flow graph needs “branch nodes” to divert the flow, the data nodes (start, list, and policy nodes are all data nodes) are responsible for spitting out the calculated data, and after the branch nodes get the data, according to the conditional expression, diversion to the corresponding subsequent nodes.

After the decision flow reaches the branch node, it executes the conditional expressions according to the condition on the branch, and as long as one of the conditions is satisfied, it is executed downward.

Fork & Join is a high-level concept node in decision flow orchestration, the decision flow is actually a huge DAG (directed acyclic graph), if each path is synchronized to execute again, too time-consuming, the business side left the risk control decision time will not exceed 200ms, but risk control involves a large number of calculations and I/O operations, at this time you can configure the Fork/Join node concurrent execution process reassembly and shorten the path time.

Decision flow performance optimization is very challenging, this is just a small optimization point, limited by space, the follow-up will open a specific introduction to performance optimization.

Stability is a cliché, not to mention a risk control system that can master the “life and death” encyclopedia. Risk control on the stability of the system construction is higher than the implementation of the strategy, that is, the bottom strategy is through, without affecting the normal user experience, allowing a part of the black users to go in, we can have a variety of means to fish out the black household after the event to ban (provided that the offline response is fast enough, the system is good enough, the black production itself is very efficient).

The business leaves no more than 200 ms for risk control strategy execution, and risk control handles a large amount of computational logic in just 200 ms, which is contrary.

Risk control is one of the few systems I have seen that will really use concurrency to “fly up”, in order to save time, a large number of parallel computing with timeout design, to space for time thinking to the extreme, calculate in advance and put there, wait for the decision to directly calculate memory. The most difficult thing for risk control research and development is to consider whether the current implementation can not be higher than the timeout time, which has greater technical challenges.

The on-line of the strategy is a “high-risk” action, if the operation of the offline analysis error, then may lead to the production of a large number of good users are intercepted, resulting in a large number of customer complaints, this loss is heavy, so the decision flow at the beginning of the design should consider the new version of grayscale online and other functions of support, can be in accordance with 0-100 traffic gradually put volume, the loss is minimized.

When there is a production problem, give priority to observing what production changes are currently there, and if it is a decision flow change, you need to support the version rollback function to ensure that the problem can be restored in the first time.

For large promotions or large traffic bursts in, it is necessary to limit the flow/fuse function according to the specific risk scenario, and develop a set of anti-collapse modes with reference to the industry’s open source sentinel to ensure the stability of the system.

Monitoring is a cliché, but it’s really important! Imagine if you do not monitor the rejection results of decisions in a certain scenario, in case a large number of rejections are caused by a change on the line, the user cannot go down normally, and it is terrible to think!

The decision engine is the brain of risk control, risk control can efficiently fight against the black industry, and the decision engine is the façade.

At present, the decision engine is configured and orchestrated, and it is trying to build an intelligent and automated direction to help business personnel better deploy rules and improve efficiency. At the same time, we are also thinking about how to make the business fast, “unaware”, or try to minimize intrusion (low-cost) access to risk control, which is also a challenge. Black production is also human, they are also evolving in confrontation, there will be more and more new means to challenge the barriers of risk control and safety, there is a long way to go!

•Performance optimization essentials – flame map • How did I get into the business of risk control • Flink in the risk control scene real-time feature landing practice[1]

Welcome to the public number: Gollum chicken technical column

Personal Technical Blog:

[1] Flink in the risk control scene real-time feature landing practice: