About the author

McZhao, Ctrip’s senior R&D manager, focuses on natural language processing technology.


With the continuous emergence of consumer hotspots and new terriers of Internet celebrities, in the NLP task of e-commerce platforms, some words that have not been seen before often appear. These words are not in the system’s existing thesaurus and are called “unlogged words”.

On the one hand, the absence of words in the thesaurus affects the word breaker quality of some thesaurus-based word separators, indirectly affecting the quality of text recall and the quality of highlighting prompts, that is, the accuracy of the user’s text search and the interpretability of the search results.

On the other hand, in the mainstream NLP deep learning algorithm BERT/Transformer, etc., word vectors are often used instead of word vectors when dealing with Chinese. In theory, the effect of using word vectors should be better, but the effect of using word vectors in practice is better due to the reason that words are not registered. If the thesaurus is more perfect, then the effect of using word vectors is better than that of using word vectors.

In summary, new word discovery is a problem we need to solve now.

First, the traditional unsupervised method

Chinese new word found this problem, there is already a more mature solution in the industry. The inputs are corporas that are NGRAM segments to produce candidate segments. Calculate some of the statistical characteristics of these fragments, and then determine whether the fragment is a word based on these characteristics.

The mainstream practice in the industry is to count and observe the indicators in these three aspects: heat, cohesion, and the richness of left and right neighbors. There are also many articles describing these three indicators on the Internet, here is a brief introduction, the details can refer to the two new word discovery articles of Hello NLP and Smooth NLP.

1.1 Heat

Use word frequency to indicate heat. Count the number of occurrences of all fragments of all corpus, and those high-frequency fragments are often one word.

1.2 Cohesion

Measure cohesion using point mutual information:

For example, we judge whether Hanting is a word, log(P(“Hanting”)/P(“Han”)P(“Han”)P(“Ting”)). The probability of Hanting becoming a word is directly proportional to the heat of “Hanting” and inversely proportional to the popularity of “Han” and “Ting”. This is very easy to understand, like the most common word in Chinese characters “of”, the probability of any Chinese character and “the” collocation is very high, but it does not mean that the “x” or “x” is a word, and the popularity of the word “the” here plays a role in suppressing.

1.3 Left and right neighbor richness

Left and right adjacency entropy to indicate the richness of left and right characters. Left and right adjacency entropy is the randomness of the distribution of words that appear to the left or right of the candidate word fragment. You can take apart the entropy on the left and the entropy on the right, or you can combine the two entropies into one indicator.

For example, the heat and cohesion of the “Shangri-La” segment are very high, and the heat and cohesion of the corresponding sub-fragment “Shangri-La” are also very high, but because most of the cases after the “Shangri-La” appear “La”, its right neighbor entropy is very low, which inhibits its wording, and it can be judged that the three words “Shangri-ri” cannot be used alone.

Second, the limitations of the classical method

The problem with the classic approach is that the threshold parameters need to be set manually. After an NLP expert understands the probability distribution of the fragments in the current corpus, these indicators are combined by formulas or used independently, and then the threshold is set as the judgment criterion, and the results of judging by this standard can also be achieved with high accuracy.

However, the probability distribution or word frequency is not static, and as the corpus becomes more and more abundant, or the weighted heat of the corpus (usually the corresponding commodity heat) fluctuates, the parameters and thresholds in the formula set by the expert also need to be constantly adjusted. This wastes a lot of manpower, making artificial intelligence engineers into spenders.

Third, new word discovery based on deep learning

3.1 Word frequency probability distribution plot

The three indicators of the above-mentioned algorithms in the industry have only one feature of the fundamental source, that is, word frequency. In the statistical method, some simple and key statistics are usually displayed in the form of pictures, such as histograms, boxplots, etc., even if there is no model intervention, just by looking at people, you can still make correct judgments at a glance. The corpus can be cut out of all fragments of limited length, the word frequency of the fragment can be normalized to 0-255, mapped to a two-dimensional matrix, the line represents the starting character, the column represents the ending character, a pixel is a fragment, and the brightness of the pixel is the popularity of the candidate word fragment.

The above figure is the word frequency probability distribution map of the short sentence “Ramada Pudong Airport”, and we are pleasantly surprised to find that with our naked eyes alone, we can roughly distinguish some brighter, isosceles right triangle blocks, such as: “Pudong”, “Pudong Airport”, “Airport”, “Ramada Hotel” and so on. These blocks can tell that the corresponding fragment is exactly the word we need.

3.2 Classic image segmentation algorithms

By looking at the word frequency probability distribution map, we can turn a short sentence word segmentation problem into an image segmentation problem. Early image segmentation algorithm, and the above new word discovery algorithm is similar, is also based on the threshold detection of edge grayscale changes algorithm, with the development of technology, now generally use deep learning algorithms, of which the more famous is the U-Net image segmentation algorithm.

The first half of U-Net uses convolutional downsampling to extract multiple layers of features of different granularities, the second half is upsampled, these features are concat up at the same resolution, and finally the pixel-level classification results are obtained by fully connected layer + Softmax.

3.3 New word discovery algorithms based on convolutional networks

The slicing of the word frequency probability distribution plot is similar to the slicing of the graph, which is to cut out the parts with adjacent locations and similar gray scales. Therefore, the segmentation of short sentences can also refer to the image segmentation algorithm and use a full convolutional network to do it. The reason we use convolution is that when we cut short sentences or images, we pay more attention to local information, that is, those pixels close to the cutting edges. The reason for using multi-layer networks, multi-layer pooling can show the threshold judgment of different layer features, for example, when we cut the map terrain, we need to consider both slope (first-order conduction/differential) and also consider the change of slope (second-order derivative/differential), the two take the threshold respectively and combine the way not only simple linear weighting but serial network.

For the new word discovery scenario, we design the following algorithm:

1) First fill the word frequency distribution map of the short sentence with 0 to 24×24;

2) First have two 3×3 convolutional layers and output 4 channels;

3) Convolution layer concat up, do another 3×3 convolution, and output a single channel;

4) The loss function uses logistic=T, so the last layer can be used for classification without doing softmax output;

Compared to U-Net, there are the following differences:

1) Downsampling and upsampling are abandoned, because the short sentences generally used to segment are relatively short, and the resolution of the word frequency distribution map is not high, so the model is also simplified.

2) U-Net is tricategorization (chunk 1, chunk 2, on the edge), and this algorithm only requires two classifications (whether a pixel is a word or not). So the final output result is also different, U-Net outputs some continuous chunks and dividing lines, and we only need a certain point to be positive.

The following figure is the result of using the model to predict after training the model. We can see that in the output results, the pixels corresponding to the words “Shanghai” (above this line, sea column), “Hongqiao” and “business district” are recognized.

Using the trained model and entering the landmarks in Ctrip’s landmark library, you can automatically cut and find some new words, as shown in the following figure, although there are individual badcases, the overall accuracy is OK.

After these words were imported into the thesaurus, the accuracy of search word segmentation increased, and the thesaurus coverage of word segmentation results increased. Because search word segmentation generally tends to over-recall and eliminate missed recalls, the industry has a more aggressive practice of word segmentation recalls, and the accuracy rate is generally solved through subsequent sorting. Therefore, the accuracy of word segmentation has improved, and the accuracy of search results has not been significantly improved in the eyes of users. However, you can resolve the issue that some of the highlighting prompts caused by word breaker errors are incorrect.

Fourth, the internal analysis of the model

If you want to explore how the model works, you can look at the convolutional kernels of the middle layer. We first simplify the number of convolutional cores of the convolutional layer of the model from 4 to 1, and after training, we can view the middle layer through the TensorFlow API: model.get_layer (‘Conv2’).__dict__. We found that the convolutional kernels of the Conv2 layer are as follows:

You can see that the first and second lines have opposite effects on the model, corresponding to the difference (with weights) of the previous row of the pixel minus the current line, and if the grayscale difference is larger, the more likely the string represented by this pixel is to become a word.

It can also be seen that the absolute value of 0.04505884 in the second column of the first row is relatively small, possibly because the positive parameters of the first row minus the second row and the negative parameters of the third column minus the second column cancel each other out.

Fifth, optimize the space

This article describes a very simple model of a fully convolutional network, with a lot of room for improvement.

The first is to expand the range of feature selection. For example, the input feature in this article is only the word frequency, and if the left and right adjacency entropy is also included in the input feature, the effect of slicing will be more accurate.

The second is to increase the depth of the network. Through model analysis, it was found that the first layer of convolution is mainly to deal with the case generated by pixels filled with 0, the actual convolution of the real heat is only one layer, if it is a 3×3 convolutional kernel can only see the first-order difference result, the second row and the second column of the current pixel are not taken into account. The size of the convolutional kernel can be appropriately enlarged or the network deepened to make the view of the model larger. But deepening the network also brings with it the problem of overfitting.

Finally, this model can not only be used to supplement the thesaurus to improve the effect of word segmentation, but also can be used directly as a reference for word segmentation, and the prediction results of this model can be applied in the two steps of word segmentation process candidate recall and word segmentation path scoring.

【Recommended Reading】

Exploration and practice of Ctrip’s physical linking technology

Hundreds of business scenarios, the application of semantic matching technology in Ctrip’s intelligent customer service

Ctrip’s AI inference performance automation optimization practice

The practice of the weakly supervised learning framework Snorkel in the task of “automatic labeling” of large-scale text datasets

 “Ctrip Technology” public account

  Share, communicate, grow