The DSSM twin tower model that has to be said in the recommendation system

Posted May 25, 202010 min read

This article is from the OPPO Internet Basic Technology Team. Please reprint the author. At the same time, welcome to pay attention to our public number:OPPO \ _tech, to share with you OPPO's cutting-edge Internet technology and activities.

This article mainly introduces the DSSM twin-tower model used for business interest modeling in the project. As a double-tower model of fire in the recommendation field, it is widely used in recommendation systems by major manufacturers because of its good effect and good friendliness to industry.

By constructing two independent subnets of user and item, the user embedding and item embedding in the two trained "towers" are cached in the in-memory database. When predicting online, you only need to calculate the similarity operation in memory. The DSSM twin tower model is an important model that has to be met in the recommendation field.

The article is mainly divided into the following parts:

  • Why study DSSM twin tower model
  • DSSM model theoretical knowledge
  • DSSM twin tower model in the recommended field
  • The two-tower model recommended by actual combat ads
  • to sum up

1 . Why study DSSM twin tower model

The main service object of the OPPO label group is advertisers, and the service goal is to provide advertisers with better advertising conversion effects. Two types of modeling are involved here.

One is natural interest modeling, which obtains user-item associations based on user operation terminal behavior, marks different data sources to obtain item-tag associations, and finally joins the above two associations to obtain user-tag associations for users Putting an interest label is equivalent to recommending a crowd for advertisers from the label dimension.

The other is business interest modeling. On the basis of natural interest modeling, recommending crowds for advertisers from the advertising dimension, then the DSSM twin tower model of the current fire is needed.

Take the Youtube video recommendation system as an example. There are two processes in the general recommendation system.

The first step is to recall the model, mainly for the initial screening operation, to select a part of the video data subset that may be of interest to the user from the massive video resource pool. In terms of quantity, it may be a hundred levels from ten million levels.

The second step is the fine-arrangement model. The main function is to further refine the hundred-level video subset found above. In terms of quantity, it may be a few dozen levels selected from the hundred-level. Then, according to the ranking of the score, a ranking list is generated as the user's candidate playlist to complete the video recommendation task.

The DSSM twin tower model used in our advertising recommendation field recommends a certain number of people for advertisers from the advertising dimension. In terms of numbers, it finds millions of people from tens of billions of people for advertising, so it is a recall model. .

2 . DSSM model theoretical knowledge

2.1 Principle of DSSM model

DSSM(Deep Structured Semantic Models) is also called deep semantic matching model. It was the first article published by Microsoft that was applied to the task of computing semantic similarity in the field of NLP.

The principle of the DSSM deep semantic matching model is very simple:obtain the massive exposure and click log data of the user search query and doc in the search engine, and use the complex deep learning network to construct the query embedding of the query-side features and the doc embedding of the doc-side features, respectively In the online infer, the semantic similarity is expressed by calculating the cos distance of two semantic vectors, and finally the semantic similarity model is obtained. This model can not only obtain sentence embedding of low-dimensional semantic vector expressions of sentences, but also predict the semantic similarity of two sentences.

2.2 Overall structure of DSSM deep semantic matching model

The DSSM model can be divided into three layers in general, namely the input layer, the presentation layer and the matching layer. The structure is shown below:

2.2.1 Input layer

The main function of the input layer is to convert the text into a low-dimensional vector space and convert it into a vector for deep learning networks. There is a big difference between Chinese and English in the NLP field, and the input layer is processed differently.

(1) English scene

The English input layer is processed by Word Hashing, which is based on the n-gram of letters, and its main function is to reduce the dimension of the input vector. For example, if there is a word boy and the start and end characters are indicated by #, then the input is(# boy #). Convert the word to the form of the letter n-gram. If n is set to 3, then you can get(# bo, boy, oy #) three sets of data, which are represented by n-gram vectors.

The problem with using the Word Hashing method is that it may cause conflicts. Because two different words may have the same n-gram vector representation. The following figure is the vector space and word collision statistics when using 2-gram and 3-gram for Word Hashing in different English dictionaries:

It can be seen that if 2-gram is used in the 50W word dictionary, that is, the granularity of two letters is used to segment the word, the vector space is compressed to 1600 dimensions, and there are 1192 words that conflict(the conflict here refers to two words The vector representation is exactly the same, because the word reserves are really limited, I wanted to find a few examples to explain, but the result was not found). If 3-gram vector space is used to compress to 3W dimension, there are only 22 conflicting words. Synthesize the word segmentation using 3-gram in the paper.

(2) Chinese scene

The Chinese input layer is very different from English. The first thing to face is the word segmentation problem. If you want to segment words, recommend jieba or Peking University pkuseg, but now many models have not segmented words. For example, the pre-trained model of BERT Chinese directly uses words as the minimum granularity.

2.2.2 Presentation layer

The DSSM model representation layer uses the BOW(bag of words) word bag model, without considering the word order information. Regardless of word order, there is an obvious problem, because a sentence may have the same word, but the semantics are different by thousands of miles. For example, "I love girlfriend" and "girlfriend love me" may be quite different Experience yourself).

The following figure is the structure of the DSSM presentation layer:

The bottom Term Vector to Word Hashing maps words into a 3W-dimensional vector space. Then pass through two hidden layers of 300 dimensions separately, and finally output 128-dimensional vectors uniformly.

2.2.3 Matching layer

Now that we have converted query and doc into two 128-dimensional semantic vectors, how do we calculate their semantic similarity next? It is sufficient to calculate the cosine similarity of these two vectors through the cos function. The formula is as follows:

2.3 The advantages and disadvantages of the DSSM model

Let me talk about the advantages of the DSSM model:

  • Solved the dictionary explosion problem of LSA, LDA, Autoencoder and other methods, thereby reducing the calculation complexity. Because the number of words in English is much higher than the number of letters n-gram;
  • Chinese uses words as the finest granularity, which can reuse the semantics of each word, reduce the dependence of word segmentation, and thus improve the generalization ability of the model;
  • The n-gram of letters can handle new words better, with strong robustness;
  • Use supervised methods to optimize the mapping problem of semantic embedding;
  • Eliminate the artificial feature engineering;
  • Adopt supervised training with high precision. The traditional input layer uses embedding(such as Word2vec's word vectors) or theme model(such as LDA's theme vectors) to do word mapping, and then stitch or accumulate the vectors of each word. Since Word2vec and LDA are both unsupervised training, it will introduce errors to the model.

Let me talk about the shortcomings of the DSSM model:

  • Word Hashing may cause word conflicts;
  • The bag of words model is used, which loses context word order information. This is why there will be variants of DSSM models such as CNN-DSSM, LSTM-DSSM, etc .;
  • The ranking of the search engine is determined by many factors. The more the doc rank is clicked, the easier it is to click when the user clicks. Just click to judge the positive and negative samples, the noise is large, the model is difficult to converge
  • The effect is uncontrollable. Because it is an end-to-end model, the advantage is to save the artificial feature engineering, but it also brings the problem that the effect of the end-to-end model is not controllable.

3.1 DSSM from cross-border in NLP to recommendation

The DSSM deep semantic matching model was first applied to the task of computing semantic similarity in the NLP field. Because semantic matching itself is a sorting problem, which coincides with the recommendation scenario, the DSSM model is naturally introduced into the recommendation field. The DSSM model uses two relatively independent complex networks to construct user embedding of user-related features and item embedding of item-related features, so it is called a twin tower model.

3.2 Simple DSSM twin tower model, 2015

The biggest feature of the twin tower model is that user and item are two independent subnets, which are very friendly to industry. The two towers are cached separately, and only a similarity calculation needs to be performed in memory when predicting online. The following is the simple DSSM double tower model structure in 2015:

3.3 Baidu's twin tower model

Baidu's dual-tower model uses complex networks to embedding user-related features and advertising-related features, respectively, to form two independent towers. There is no interaction between user features and advertising features before the final cross-layer. This solution is to introduce more features during training to complete offline training of complex networks, and then store the obtained user embedding and item embedding in a memory database such as redis. Lightweight models such as LR and shallow NN or more convenient similar distance calculation methods are used for online prediction. This is also the construction method of the recommended system adopted by many big manufacturers in the industry.

3.4 Google's twin tower model, 2019

In 2019, Google launched its own two-tower model. The core idea of the article is:In a large-scale recommendation system, the two-tower model is used to model the interaction between user-item pairs to learn the [user, context]vector and [item]Vector association. For large-scale streaming data, an in-batch softmax loss function and streaming data frequency estimation method are proposed to better adapt to multiple data distributions of item.

A two-tower model is used to construct a Youtube video recommendation system. For the user-side tower, user embedding is constructed based on the characteristics of the user's viewing video, and for the video-side tower, video emebdding is constructed based on the video characteristics. The two towers are separate networks.

4.1 Advertising recommendation business scenario

Speaking of a lot of the above, it is for this section to build the DSSM twin tower model recommended by our advertisement. Corresponding to our advertising business is to build a DSSM double tower model, the user side enters the user's historical behavior characteristics of the advertisement(including clicks, downloads, payment, etc.) to obtain a fixed length of user embedding, and similarly enters the advertisement side to obtain the same length The ad embedding is stored in the redis memory database.

During online infer, an advertisement ad is given, and then the similarity with the full number of users is respectively found, the user subset with the "closest distance" is found, and advertisements are placed on this group of people to complete the advertisement recommendation task.

The overall structure of the model is shown in the following figure, which is also divided into three layers:the input layer, the presentation layer, and the matching layer.

4.2.1 Input layer

The model training is divided into two different "towers", which are actually two different neural networks. One of the towers is used to generate user embedding. Input user feature training data, user features include user dense features and user sparse features, where user dense features are subjected to one-hot encoding operations, user sparse features are embedding reduced to low-dimensional space(64 or 32 dimensions), and then feature stitching operating. The advertising side is similar to the user side.

Regarding the characteristics inside, it is not what you want, but what you have. The whole project is super complex is the characteristic work of this piece. I will not repeat them here.

4.2.2 Presentation layer

After the stitched features are obtained, they will be provided to the respective deep learning network models. The user characteristics and advertising characteristics are converted into fixed-length vectors through their two fully connected layers. Here, user embedding and ad embedding with the same dimensions are obtained. The number and dimensions of the network layers inside each tower can be different, but the dimensions of the output must be the same, so that operations can be performed at the matching layer. The user embedding and ad embedding dimensions in the project are both 32.

4.2.3 Matching layer

After the model is trained, you will get user embedding and ad embedding separately, and store them in a memory database like redis. If you want to recommend a crowd for a particular ad, then calculate the cos similarity between the ad embedding of the ad and the user embedding of all crowds. Select the closest subset of the N personal group as the advertising crowd, thus completing the advertising recommendation task. During the model training process, the results obtained by the cos function are entered into the sigmoid function and the real label calculation logloss to see whether the network has converged. Model evaluation mainly uses auc indicators.

In summary, this section describes how we use the DSSM dual tower model to complete the advertising recommendation task. The overall structure of the model is divided into input layer, presentation layer and matching layer. First, process the data in the input layer to obtain features; then in the presentation layer, get user embedding and ad embedding through the deep learning network; and finally, advertise in the matching layer.

4.3 A little thought

There are many variants of the DSSM twin tower model, such as CNN-DSSM, LSTM-DSSM, etc. The presentation layer in the project uses a two-layer fully connected network as a feature extractor. Transformer is now recognized as the strongest feature extractor in the field of deep learning. Can Transformer be added in the future?

5 . Summary

This article mainly introduces the DSSM twin tower model used in the project for business interest modeling. As the double-tower model of the fire in the field of recommendation, the biggest feature is that the effect is good and it is very friendly to the industry, so it is widely used in recommendation systems by major manufacturers.

By constructing two independent sub-networks of user and item, the user embedding and item embedding in the two trained towers are cached in the in-memory database. When predicting online, you only need to perform similarity calculation in memory.

First, the theoretical knowledge of the DSSM semantic matching model was introduced. It was first applied to the semantic similarity task in the NLP field; then, because it was a ranking problem, it was introduced into the recommendation field. From the simple DSSM twin tower model to the long twin tower models; finally, we talked about the use of the DSSM twin tower model to the advertising recommendation scene.

6 . References

1.LearningDeep Structured Semantic Models for Web Search using Clickthrough Data

2.Sampling-bias-corrected neural modeling for largecorpus item recommendations

Related Posts