Because of its strong nonlinear ability and explanatory ability, the integrated learning tree model is frequently used and performs well in tasks such as tabular data mining.
Model explanatory nature is highly valued in some areas (such as financial risk control), and for the explanatory nature of the tree model, we can often output the structure of the tree model or use the method of explanatory frameworks such as shap:
For the interpretable work of the correlation tree, refer to the following paper: https://www.cs.sjtu.edu.cn/~kzhu/papers/kzhu-infocode.pdf
The following are the methods of several commonly used frameworks to assist in interpreting the model:
SHAP is based on the Shapley value, which is a game theory concept proposed by economist Lloyd Shapley. Its core idea is to calculate the marginal contribution of features to the model output, and then interpret the “black box model” from both the global and local levels. The following lines of code can show the influence of the variables of the model on decision-making, taking the Insterest historical interest rate as an example, the higher the characteristic value of the interest rate (blue is low, red is high), the higher the corresponding shap value, indicating that the decision result is closer to 1 (in this example, the larger the value in the financial risk control project, the easier it is to default)
In the field of interpretability, one of the earliest known methods was LIME. It can help explain what machine learning models are learning and why they predict in a certain way. Lime currently supports the interpretation of table data, text classifiers, and image classifiers.
Shapach is a Python library that enables machine learning to be interpreted and understood by everyone. Shapach offers several types of visualizations that show clear labels that everyone can understand. Data scientists can more easily understand their models and share results. End users can use the most standard summary to understand how the model makes judgments. ”
The Shapash library generates interactive dashboards and collects many visualizations. Related to shape/lime explanatory. It can use SHAP/Lime as a backend, which means it only provides better looking charts.
ELI5 is a Python library that can help debug machine learning classifiers and interpret their predictions. The following machine learning frameworks are currently supported: scikit-learn, XGBoost, LightGBM CatBoost, Keras
However, the explanatory nature of the tree model is also limited, and after understanding the decision-making logic of the tree model, unlike logistic regression (LR), it is easier to adjust the feature binning and the model to conform to the business logic (for example, the lower the income, the more likely the credit card is overdue, and the model may hold the opposite logic when making decisions, which needs to be adjusted).
Once we find that the tree structure or shap value does not conform to the business logic, because the tree model learning is usually more complicated, it is a bit tricky to adjust the tree structure according to the business logic, so many times we can only tear down the original model, data cleaning, filtering, feature selection, etc. Relearn a new model until the feature decision is explained in the business.
Here, this article briefly explores a method that can quickly adjust the structure of the lightgbm tree model.
Firstly, export the structure of the lightgbm single tree and the corresponding model file:
The leaf5 score value is 0.147
The higher the score value, the closer the decision-making result of the leaf is to 1 (in this case, the larger the value in the financial risk control project, the easier it is to default)
In the field of financial risk control, it is very important to pay attention to the explainability of decision-making, and sometimes we may find that the decision-making of a certain leaf node is not in line with business interpretation. For example, if the business thinks that the higher the interest rate, the lower the probability of default, then the node in the graph above is not in line with business experience (Note: This is just a hypothesis, in fact, the decision of the graph node is still in line with business experience).
Then the fastest way to fine-tune the tree model is to directly prune off the leaf node of the model, leaving only the internal nodes to make decisions.
So, how to quickly manually adjust the tree structure (such as pruning) for lightgbm?
There is a clever pruning method here, which can modify the score value of a specific leaf node to the score value of their parent node under the premise of retaining the original tree structure, which is logically equivalent to “pruning”
The model effect of the corresponding test set
We then verify the difference in model effect of the test set before and after pruning: AUC decreased by 1%, and ks did not change much;
By pruning to optimize the complexity of the model or to conform to reasonable business experience, the regularization effect of the model can reduce the impact of statistical noise (reduce overfitting) and have a better generalization effect.
Of course, this method is based on a small-scale integrated learning tree model, if you move hundreds of large-scale tree models, artificially adjust the tree structure of each one, this is not realistic.
1. Pans 100 riot operations
2. pandas data cleaning
3 Machine Learning Original Series