MLIM: A CTR prediction model describing evolution law of user interest

Abstract

With the advent of the digital economy era, business systems such as web advertising and recommendation system have put forward the demand for predicting the click through rate (CTR) of items. However, the current CTR prediction research is not enough to mine user behavior, resulting in the lack of accuracy of user interest representation. In this paper, we propose a CTR prediction model, called MLIM, which can deep mine the evolution law of user interest. Specifically, we first use BiGRU to obtain the low-level user interest representation in the interest extraction layer, and then continue to use attention mechanism, BiGRU and sliding time window multi-components collaborative modeling in the interest evolution layer to obtain multi-level user interest representation with richer information, which can improve the accuracy of CTR prediction to a certain extent. Comprehensive experiments on two real datasets show that the proposed model achieves better performance than the mainstream baselines integrating user behavior analysis.

Keywords

User behavior modeling interest representation CTR personalized recommendation precision marketing

1. Introduction

As a prominent representative of the digital economy, e-commerce is growing at an alarming rate and has become an important driving force for the innovative development of China’s economy and society. E-commerce provides a platform for the vigorous development of online advertising. Online advertising is a kind of advertising placed on online media, and is also an advertising information dissemination activity among advertiser, e-commerce media and user by means of the Internet. In this activity, brand image, goods and services and other advertising related information are spread to the target audience through the Web.

The CTR prediction value is the key reference index for online advertising putting. CTR prediction refers to predicting the probability of a user clicking on an advertisement (item) in a specific context [1], and is usually regarded as a binary classification problem: Given information about the user, the advertisement, and the context, it is required to predict the probability of the current user clicking on the advertisement event.

At present, researchers in academia and industry generally use machine learning methods to learn the patterns behind users’ clicking advertisement event from historical data. Among the traditional methods, logistic regression and factorization machine are widely used [2–6]. As deep learning has worked well in the fields of graphics and images, speech recognition and natural language processing, some researchers have also applied it to CTR prediction. In 2016, FNN [7], SNN [7] and other models appeared, and then researchers continued to research and improve, resulting in more complex CTR models such as PNN [8], ONN [9], NFM [10]. These methods used multiple neural network components to learn the deep cross features in data in the form of single channel structure. Since these methods could not fully learn the multi-level cross features in advertising data, some researchers continued to propose Wide & Deep Learning[11], DeepFM [12], Deep & Cross Network [13], xDeepFM [14] and other CTR models with dual channel structure. Most of the above methods focused on extracting cross features from different levels, and paied little attention to user behavior analysis and user interest preferences. Therefore, some researchers proposed CTR models including user behavior modeling, such as DIN [15], DIEN [16], DSIN [17]. However, these methods still had some problems, such as insufficient exploration of the evolution law of user interest, lack of future foresight and guidance to users.

In view of this, this study proposes a novel CTR prediction model, named MLIM (Multi-Level Interest Modeling), which is based on the large-scale user behavior data in the recommendation scenario and can fully mine the potential user preferences. Specifically, we design an interest extraction layer, which use a BiGRU [18] network to model the front and back dependencies in the behavior sequence, so as to obtain the low-level user interest representation. On the basis of low-level user interest representation, we also design an interest evolution layer, which continues to extract user interest via using attention mechanism, BiGRU network and sliding time window [19] to obtain multi-level user interest representation reflecting the evolution law. This study cannot only enhance the user experience, but also bring more traffic and economic benefits to online media. At the same time, it will provide valuable theoretical basis and application ideas for digital marketing.

The main contributions of this paper are summarized as follows:

∙ We propose a novel CTR prediction model describing the evolution law of user interest. The model adopts a two-stage structure to model user behavior, focusing on describing the evolution law of interest, and obtain the interest representation containing rich information, which are helpful to improve the accuracy of CTR prediction.

∙ In order to effectively capture the interest evolution law from user behaviors, we design a special interest evolution layer. On the one hand, we use attention mechanism to model the correlation between each user interest point and the current target item, and obtain the long-term user interest representation with diversity. On the other hand, we continue to use BiGRU to model the potential correlation between each interest point to obtain the user interest representation with relevance. At the same time, we also use sliding time window to model the user interest in different time stages, and obtain the short-term user interest representation with concentration and periodicity, forming some future foresight and guidance to users.

∙ We have conduct comprehensive experiments on real datasets, and the results show that the proposed model achieves better results than current advanced and mainstream CTR models incorporating user behavior analysis. In addition, in order to verify the effectiveness of the proposed model, we also conduct extensive and deep studies on the influence of key parameters and model structure on the performance.

The rest of this paper is organized as follows. In Section 2, the related work is introduced. In Section 3, MLIM model and its architecture are described in detail. In Section 4, the comprehensive experiments are conducted to verify the performance of the proposed model. In Section 5, the study work is summarized and the future work is prospected.

2. Related work

Due to its good nonlinear characteristics and multi-layer nature, deep neural network can realize automatic feature combination and theoretically has the expression ability of infinite approximation to the nature of data. Some researchers have used it for CTR prediction and recommendation [7–17,20–34].

Jiang et al. proposed a CTR prediction model DBNLR [20], which used deep belief network to learn potential association relationship in data, obtained the high-order feature representation, and then used logistic regression to calculate the click through rate. Zhang et al. proposed two CTR prediction models FNN and SNN [7], both of which used deep neural network to automatically learn effective patterns from categorical features, and the difference lied in the different embedding layer structure for processing input. Qu et al. proposed a novel inner production based CTR prediction model PNN [8]. Compared with the FNN model, the difference was that it added an inner product layer on top of the embedding layer of FNN, and captured rich low-order feature interactions through inner product. On the basis of PNN, Yang et al. proposed the ONN model [9], which used operation-aware embedding to replace the ordinary embedding layer. For each feature field, sufficient coefficients should be trained in the operation-aware embedding layer to generate enough intermediate result vectors for subsequent inner product operations. He et al. proposed a novel CTR prediction model for sparse data, NFM [10], which used a Bi-Interaction layer to calculate the quadratic term in FM after the embedding vector of the categorical data was obtained at the embedding layer. Then, the obtained results were input into the deep neural network to capture the nonlinear relationship between features, and the 1-order terms and offset term of FM were added to the output of the last layer of the deep neural network. Jiang also proposed an intelligent recommendation approach for online advertising [21], which firstly used embedding mapping network to process sparse input data, and then used FM to model low-order feature interactions. On this basis, stacked denoising autoencoder was adopted to learn high-order feature interactions, and finally logistic regression was adopted to calculate the click through rate.

Deep CTR models mostly follow the paradigm similar to embedding & MLP. Based on this basic paradigm, more and more models focus on the feature interactions learning. Cheng et al. proposed a Wide & Deep Learning model [11], in which Wide model was used to model low-order feature interactions to ensure the memory ability of the model, while Deep model was used to model high-order feature interactions to ensure the generalization ability of the model. Since Wide & Deep Learning required manual feature engineering, Guo et al. proposed an end-to-end learning model DeepFM [12], which integrated factorization machine and MLP into a new neural network architecture, in which the Wide and Deep parts shared the same input, and no special feature engineering was required. Wang et al. proposed the Deep & Cross Network model [13], which introduced another cross network to explicitly perform feature crossing in each layer on the basis of retaining the advantages of MLP. On the basis of DeepFM, Lian et al. proposed the xDeepFM model [14], which designed a compressed interaction network structure named CIN for display feature interaction modeling, providing feature combination capability together with MLP. With the success of attention mechanism in the field of natural language processing, some researchers began to introduce attention mechanism into deep recommendation model [23–28].

The above CTR prediction models can effectively fit the complex nonlinear relationship in the data and obtain better prediction effect because they adopt deep structure to learn high-order feature interactions. But they ignore the mining of user behavior information. In recommendation system, in addition to user, items and context information, there are also a large number of user behavior data. Making full use of these behavior data is helpful to accurately understand user’s intent, obtain high-quality user interest representation, and improve the accuracy of recommendations [15–17].

Zhou et al. proposed a deep interest network for CTR prediction, DIN [15], which used the deep network with attention to adaptively learn the user interest representation from the historical behavior related to the target advertisement. Zhou et al. continued to improve DIN and proposed a deep interest evolution network model DIEN [16]. In this model, the interest extraction layer was designed to obtain interest points from historical behavior sequences and the interest evolution layer was designed to capture the interest evolution process related to the target advertisement. Feng et al. proposed a deep session interest network model DSIN [17], which used a self-attention module with bias coding to extract the user’s interest features in each session, and used Bi-LSTM to capture the interaction and evolution of the user’s interest in multiple historical sessions.

Some scholars studied the technology of combining users’ long-term preference and short-term intention [29–31], and achieved good recommendation effect. Bogina et al. introduced resident time and recursive neural network into session-based recommendation [30], taking into account the length of the user’s stay on the item in the session. The longer the resident time stayed, the more interested the user was. Yu et al. proposed an adaptive user personalized recommendation model SLi-Rec based on long-term and short-term preferences [31].

The above CTR models including user preference modeling learn user interest representation from user behavior and inputs it into deep neural network together with user, advertisement, context and other information for CTR prediction, achieving good effect. However, there are still some problems in these methods, such as poor representation of the evolution law of user interest, lack of foresight for the future and guidance for user. In this study, we will fully mine user preferences from a large number of user behaviors and accurately describe the evolution law of user interest such as relevance, diversity, concentration and periodicity, so as to improve the accuracy of CTR prediction.

3. MLIM model

Interest points of the same user often show diversity, and different interest points have different effects on CTR prediction results. user’s interest will show a certain internal correlation over time, and also reflect the concentration in a short time due to some events. Using the excellent characteristics of attention mechanism, GRU network and sliding time window [19], and referring to the ideas of literatures [11,16], we propose MLIM model, which can fully mine user preferences from behaviors sequences data, obtain user interest representation reflecting relevance, diversity, concentration and periodicity, and improve the accuracy of CTR prediction.

3.1. MLIM model architecture

The architecture of MLIM model is shown in Fig. 1.

Fig. 1.

MLIM model architecture.

Figure 1 shows that the proposed model is a stacked structure composed of the embedding layer, the user interest representation module, the MLP network and the full connection layer. The role of each module in CTR prediction modeling is as follows:

(1) Embedding layer: corresponds to the olive green part at the bottom of Fig. 1, which is used to transform input various ad data features into low-dimensional dense vectors, as detailed in Section 3.2.

(2) User interest representation module: contains the interest extraction layer and interest evolution layer, corresponding to the dark yellow part in the middle left part of Fig. 1 and the blue, pink and grass green parts in the middle part respectively, which are used to mine user preferences from behavioral data and obtain interest representation. See Section 3.3 and 3.4 for detailed descriptions.

(3) MLP network: corresponding to the dark pink part in the upper part of Fig. 1, it is a deep neural network composed of multiple perceptrons [35], which is used to implicitly learn feature interactions and obtain high-order abstract features containing complex information. See Section 3.5 for details.

(4) Full connection layer: corresponding to the white part at the top of Fig. 1, it is the full connection layer with an output node, which is used to calculate the predicted click through rate. See Section 3.6 for details.

As can be seen from the above, due to the introduction of user interest representation module, MLIM model can obtain accurate user interest representation describing the evolution law from user behavior. At the same time, by using the strong nonlinear fitting ability of MLP, it can learn the potential relationship in complex data, and improve the accuracy of CTR prediction to a certain extent. In the following Sections, we will introduce each submodule and its role in the CTR prediction modeling.

3.2. Feature representation based on embedding layer

In this study, the main features used are: user profile, user behavior, advertisement and context. Each of these four types of features contains multiple fields. For example, the fields of user profile include gender, age, etc; The field of user behavior contains the list of user access goods IDs; The field of context includes time, etc; The fields of the advertisement include ad ID, store ID, etc; It is worth noting that advertisement is also regarded as a goods or item.

We use an embedding layer to transform the above four types of features into embedding vectors that are convenient for deep learning model processing. As shown in Fig. 2, the input of embedding layer is the above four types of features composed of one-hot coding of categorical fields and numerical fields. The output is the obtained user behavior sequence items embedding vectors, user profile embedding vector, advertisement embedding vector and context embedding vector.

Fig. 2.

Embedding layer architecture.

Each categorical field feature can be encoded into a one-hot vector, for example, the “female” feature in the user profile is encoded as $[01]$ , and each numerical feature is directly concatenated after normalization. The one-hot vectors and numerical features from user profile, user behavior, advertisement and context fields attributes are concatenated together to form input vector $x_{u}, x_{b}, x_{a}, x_{c}$ . It is worth noting that $x_{b} = (b_{1}, b_{2}, \dots b_{t}, \dots, b_{T}) \in R^{K \times T}$ , where $b_{t}$ represents the one-hot vector corresponding to the t-th behavior, T is the number of items in the user’s historical behaviors, and K is the total number of items that the user can click on.

In the embedding layer, we adopt look_up table embedding mapping network to map the high-dimensional one-hot vector to the low-dimensional space $R^{k} (k ≪ m)$ . The network is a fully connected bipartite graph, and its structure is shown in Fig. 3.

Fig. 3.

Look_up table embedding mapping network.

As can be seen from Fig. 3, each field feature in the mapping network corresponds to a trainable look_up table matrix $M \in R^{m \times k}$ , which defines a linear mapping from $R^{m}$ to $R^{k}$ : $v \mapsto x M$ . Specifically, let the matrix $M_{U}$ be the user look_up table matrix, and $x_{i}$ be the one-hot vector of a categorical feature of the user profile, then its embedding vector $v_{i}$ can be obtained from $x_{i} M$ , as described in Eq. (1): $\begin{array}{l} (1) & v_{i} = x_{i} M = [\begin{matrix} 0 & 0 & 1 & 0 & 0 \end{matrix}] \times [\begin{matrix} 0.112 & 0.226 & 0.012 \\ 0.213 & 0.016 & 0.029 \\ 0.033 & 0.107 & 0.048 \\ 0.311 & 0.143 & 0.021 \\ 0.124 & 0.045 & 0.128 \end{matrix}] = [\begin{matrix} 0.033 & 0.107 & 0.048 \end{matrix}] \end{array}$

After each categorical field feature is transformed by the above mapping network, a K-dimensional embedding vector is generated, and the embedding vectors corresponding to all F categorical features are concatenated into a $F \times K$ dimensional embedding vector, namely user embedding vector $e_{u} = [v_{1}; v_{2}; \dots; v_{i}; \dots; v_{F}]$ . Similarly, the advertisement embedding vector $e_{a}$ and the context embedding vector $e_{c}$ can be obtained. In particular, for behavior feature $b_{t}$ , if $b_{t} [j_{t}] = 1$ , its corresponding embedding vector is $e_{t}$ , and the list of ordered items embedding vectors of a user behavior can be represented as $e_{b} = (e_{1}, e_{2}, \dots e_{t}, \dots, e_{T})$ .

3.3. Low-level user interest representation based on BiGRU

In the embedding layer, we have obtained the ordered items embedding vector of user behavior. In the following interest extraction layer, we extract the low-level user interest representation from the items embedding vector. The architecture of the interest extraction layer is shown in Fig. 4. The input of this layer is the ordered items embedding vector $e_{b} = (e_{1}, e_{2}, \dots e_{t}, \dots, e_{T})$ , and the output is the obtained low-level user interest representation.

Fig. 4.

Interest extraction layer architecture.

Because the user behavior sequence has the characteristics of gradual change over time, RNN can just model the change process and state of sequence data. As a variant of RNN, GRU (Gate Recurrent Unit) overcomes the defect of gradient disappearance and has the advantage of faster training speed. Therefore, in the interest extraction layer, we adopt GRU to extract low-level user interest representation. The structure of GRU network is shown in Fig. 5.

Fig. 5.

GRU structure.

GRU network can be formally described as follows, $\begin{array}{l} (2) & z_{t} = σ (W^{z} \cdot [i_{t}, h_{t - 1}]), \\ (3) & r_{t} = σ (W^{r} \cdot [i_{t}, h_{t - 1}]), \\ (4) & {\tilde{h}}_{t} = tanh (W^{h} \cdot [i_{t}, r_{t} \circ h_{t - 1}]), \\ (5) & h_{t} = (1 - z_{t}) \circ h_{t - 1} + z_{t} \circ {\tilde{h}}_{t}, \end{array}$ where σ is the sigmoid activation function, ∘ is the element product, $i_{t}$ is the input vector of GRU and $i_{t} = e_{b} [t]$ is the t-th behavior in the user behaviors sequence, $[]$ denotes the concatenation of the vector, $W^{z}, W^{r}, W^{h}$ is the corresponding weight matrix, $z_{t}$ and $r_{t}$ are the vectors obtained after the update gate and reset gate respectively, and $h_{t}$ is the t-th hidden state (interest point).

Because of the forgetfulness of recurrent neural network, the information contained in the last state is lossy. We use BiGRU to overcome this defect, as shown in Fig. 6, BiGRU contains two subnetworks that deal with left and right behavior sequences respectively, corresponding to forward and backward transmission. Using element-wise sum to combine the forward and backward outputs, and the output of BiGRU neural network is shown in Eq. (6), $\begin{array}{l} (6) & h_{i} = [{\vec{h}}_{i}; {\overset{\leftarrow}{h}}_{i}], \end{array}$ where ${\vec{h}}_{i}$ is the representation vector of the t-th interest point output by the hidden layer of the forward GRU, ${\overset{\leftarrow}{h}}_{i}$ is the representation vector of the t-th interest point output by the hidden layer of the backward GRU, and [] denotes the vector concatenation from head to tail.

Fig. 6.

BiGRU network.

We also use auxiliary loss to supervise the learning of interest state $h_{t}$ (see the dotted box in the right part of Fig. 4). Set N pairs of behavior embedding vector sequences: ${e_{b}^{i}, {\hat{e}}_{t}^{b}} \in D_{B}, i \in 1, 2, \dots, N$ , where $e_{b}^{i} \in R^{T \times n_{E}}$ denotes the positive sample sequence with click behavior, ${\hat{e}}_{b}^{i} \in R^{T \times n_{E}}$ denotes the negative sample sequence without click behavior, T is the number of historical behaviors, $n_{E}$ is the embedding vector dimension, $e_{b}^{i} [t] \in G$ is the embedding vector of the t-th item that the user click, G denotes the whole items set, ${\hat{e}}_{b}^{i} [t] \in G - e_{b}^{i} [t]$ is the embedding vector of an item sampled from the items set other than the item clicked by user i in step t. The auxiliary loss is shown in Eq. (7), $\begin{array}{l} (7) & L_{aux} = & - \frac{1}{N} (\sum_{i = 1}^{N} \sum_{t} log σ (h_{t}^{i}, e_{b}^{i} [t + 1]) + log (1 - σ (h_{t}^{i}, {\hat{e}}_{b}^{i} [t + 1]))), \end{array}$ where $σ (x_{1}, x_{2}) = \frac{1}{1 + exp (- [x_{1}, x_{2}])}$ is the sigmoid activation function, and $h_{t}^{i}$ is the corresponding hidden state of the t-th behavior of user i through BiGRU.

$L_{aux}$ will be added to the objective function in Section 3.6 as the auxiliary loss function to participate in the parameter’s optimization of the whole CTR prediction model. With the help of auxiliary loss, each hidden state $h_{t}$ has sufficient expressive power to represent the user’s interest state after taking action $i_{t}$ , and all T hidden states are the low-level interest representation sequence $(h_{1}, h_{2}, \dots, h_{T})$ output by the interest extraction layer.

3.4. Multi-level user interest representation based on multi-components collaborative modeling

Although the low-level user interest representation sequence is obtained in the interest extraction layer, each interest point in the sequence has the same influence on the final output. And low-level user interest representation cannot fully reflect the diversity, concentration and periodicity of interest. Therefore, in the interest evolution layer, we innovatively adopt attention mechanism, BiGRU and sliding time window to continue modeling, as shown in Fig. 7. The input of this layer is T interest points $(h_{1}, h_{2}, \dots, h_{T})$ output by the interest extraction layer, and the output is the multi-level user interest representation with richer information.

Fig. 7.

Interest evolution layer architecture.

∙ Long term interest representation based on attention mechanism

Firstly, in order to model the influence of different interest points on CTR prediction results, we use attention mechanism to learn the correlation between the current target item and each user interest point to obtain the long-term user interest representation $h_{Attention}$ . As shown in Eq. (8): $\begin{array}{l} (8) & h_{Attention} = h_{t} \cdot a_{t}, \end{array}$ where $h_{t}$ is the t-th hidden state of the interest extraction layer, $a_{t}$ is the attention score, and · denotes the vector-scalar product.

The attention function used is as follows: $\begin{array}{l} (9) & a_{t} = \frac{exp (h_{t} W e_{a})}{\sum_{j = 1}^{T} exp (h_{j} W e_{a}))}, \end{array}$ where $e_{a}$ is the embedding vector of target advertisement, $W \in R^{n_{H} \times n_{A}}$ is the trainable transformation matrix, $n_{H}$ is the dimension of hidden state, and $n_{A}$ is the dimension of $e_{a}$ . The attention score reflects the relationship between $e_{a}$ and input $h_{t}$ , and the higher the correlation is, the higher the attention score is.

Long-term user interest representation $h_{Attention}$ also depicts the diversity of user interest to a certain extent.

∙ User interest representation based on BiGRU and input with attention score

Secondly, we continue to use BiGRU to model the potential correlation between different interest points and further mine the dependency relationship in user interest. We take the long-term user interest representation with attention score $h_{Attention}$ as the input of this module. The process of input signal passing through BiGRU is the same as the BiGRU based modeling process in the interest extraction layer, and the final hidden state $h_{T}^{'}$ obtained is the user interest representation with relevance.

In this process, the local activation of each input step can strengthen the interest related to the target advertisement, weaken the interference caused by interest drift, and also describe the influence of each interest point on the final result to a certain extent.

∙ Short term interest representation based on sliding time window

And then, in order to capture the concentration and periodicity of user interest reflected in the process of changing over time, we model user short-term interest in different periods based on sliding time window, as shown in Eq. (10): $\begin{array}{l} (10) & h_{Timewindow} = \frac{1}{N} \sum_{j = 1}^{N} \frac{1}{S_{j}} \sum_{i = 1}^{S_{j}} ω (e_{a}, h_{i}) \cdot h_{i}, \end{array}$ where N is the size of the time window, $S_{j}$ is the number of interactive items generated by the user in the day, $ω (e_{a}, h_{i})$ is the weight factor, specifically the similarity between low-level interest $h_{i}$ and the target embedding vector $e_{a}$ , which is calculated by the dot product initially.

∙ Multi-level user interest representation

Finally, we concatenate interest representation $h_{Attention}$ , $h_{T}^{'}$ with $h_{Timewindow}$ to form the final multi-level interest representation $I_{Interest}$ output from this layer, as shown in Eq. (11). $\begin{array}{l} (11) & I_{Interest} = [h_{T}^{'}; h_{Attention}; h_{Timewindow}], \end{array}$

3.5. High-order feature extraction based on MLP

MLP has strong nonlinear fitting ability and can implicitly learn the potential association relationship between features. We concatenate multi-level user interest representation $I_{Interest}$ with user embedding vector $e_{u}$ , advertisement embedding vector $e_{a}$ and context embedding vector $e_{c}$ to form a joint input vector $ge = [I_{Interest}; e_{u}; e_{a}; e_{c}]$ , and then feed it into MLP to learn high-order feature interactions.

The forward propagation process of MLP network is formally described as follows: $\begin{array}{l} (12) & h^{2} = σ (W^{(2)} ge + b^{2}), \\ (13) & h^{l + 1} = σ (W^{(l + 1)} h^{l} + b^{l}), \end{array}$ where l is the number of the hidden layer, $W^{(l)}$ and $b^{l}$ are respectively the connection weight matrix and bias of the hidden layer of the l-th layer of MLP, σ is the nonlinear activation function, and $h^{l}$ is the output vector of the hidden layer of the l-th layer.

After the combination learning of MLP, high-order abstract feature $h^{'}$ are obtained.

3.6. Objective function

When the high-order abstract feature $h^{'}$ passes through the top fully connected layer, the sigmoid function is used to calculate the actual CTR prediction value $P (h^{'})$ , and then the parameters of CTR model need to be fine-tuned by the objective function to obtain a stable and convergent model. CTR prediction is a typical binary classification problem, and we use negative log likelihood function as loss function, $\begin{array}{l} (14) & L_{target} = - \frac{1}{N} \sum_{(h^{'}, y) \in D}^{N} (y log P (h^{'}) + (1 - y) log (1 - P (h^{'}))), \end{array}$ where D is the training set with N samples, and $y \in {0, 1}$ is the user’s click label.

In the interest extraction layer (Section 3.3), in order to make up for the deficiency of the traditional negative logarithm likelihood function, we once introduced the auxiliary loss function $L_{aux}$ (see Eq. (7)) to conduct supervised learning for BiGRU based extraction of low-level user interest representation. Therefore, the objective function of current CTR model can be shown in Eq. (15): $\begin{array}{l} (15) & L = L_{target} + α L_{aux} + β ‖ W ‖_{2} . \end{array}$ where α is a hyperparameter, which is used to balance the user interest representation in Section 3.3 and CTR prediction task, $‖ W ‖_{2}$ is the regularization term, and β is the penalty factor, which is used to balance the regularization term and the loss function.

4. Experiments

In the section, we introduce the experiments in detail, including the datasets, evaluation metrics, baseline methods, experimental parameters settings, results analysis, and ablation study.

4.1. Datasets and task description

Amazon product dataset1

¹
Amazon product dataset.[EB/OL]. http://jmcauley.ucsd.edu/data/amazon/.

consists of product reviews and metadata from Amazon, which contains rich user behavior. We use two subsets of Books and Electronics products to verify the effect of the proposed model.

∙ Books dataset

Review data in the Books dataset contains the “reviewerID”, “asin”, “reviewerName”, “helpful”, “reviewText”, “overall”, “summary”, “unixReviewTime”, “reviewTime” fields; Metadata contains the “asin”, “title”, “price”, “imUrl”, “related”, “salesRank”, “brand”, and “categories” fields.

∙ Electronics dataset

Review data and Metadata in the Electronics dataset contain the same fields as the Books Dataset.

The statistics of the above datasets is shown in Table 1.

Table 1

Statistics of datasets used in this paper

Dataset	Users	Items	Categories	Samples
Books	603668	367983	1601	8,898,041
Electronics	192403	63001	801	1,689,188

∙ Task description

We treat reviews as behaviors and sort a user’s reviews by time. Suppose that all the behaviors of a user u are $(b_{1}, b_{2}, \dots b_{k}, \dots, b_{n})$ , and the task is to predict whether the user u will write the (k + 1)-th review item by using the first k review items (behaviors). In the training phase, training dataset with $k = 1, 2, \dots, n - 2$ is generated for each user. In the test phase, given the first $n - 1$ review items, predict whether the last one will be written or not. Features used include item_id, category_id, item_id list for user reviews, and category_id list.

4.2. Evaluation metrics

We use AUC (area under the receiver operating characteristic curve) [36] as the main evaluation metric of current prediction task. The larger the AUC value is, the better the discrimination ability of the model is, and the higher the accuracy of CTR prediction is. In addition, we also use RMSE (root mean squared error) [37] as an auxiliary evaluation metric, which measures the deviation between the predicted value and the real value. The smaller the value, the better.

4.3. Baseline methods

Since the contributions and innovations of the proposed model lie in modeling user behavior and deep mining user preferences, we mainly compare the mainstream CTR models that include user behavior analysis or that can model user behavior.

$LR$ . Logistic regression can deal with large-scale features well and is widely used in industry, so we take it as a weak baseline model.

$PNN$ [8]. PNN can be regarded as an improvement of embedding & MLP paradigm, which introduces the product layer after the embedding layer to fully model the low-order feature interactions.

$Wide&Deep$ [11]. Wide & Deep learning is a widely used model in the current industry. It consists of two parts: i) Wide model, which deals with the cross features of manual design; ii) Deep model, which automatically extracts the nonlinear relationship between features. Following the practice in literature [12], we take the cross product of user behavior and candidate item as the input of wide module.

$DIN$ [15]. DIN uses the attention mechanism to activate the user behavior related to the target advertisement and adaptively learns the user interest representation.

$DIEN$ [16]. On the basis of DIN, DIEN continues to explore the relevance and diversity between user behaviors by using AUGRU.

4.4. Experimental parameters settings and sensitivity analysis

We adopt TensorFlow to implement the proposed model. In order to determine the hyperparameters of MLIM model, we first fix other parameters and change the parameter to be determined for optimization. For MLIM model that are trained from scratch, we randomly initialize model parameters with a Gaussian distribution (with a mean of 0 and standard deviation of 0.01). According to experience, we set the number of hidden layers of MLP as 3, and take PReLU as the activation function to conduct experiments on hidden layer units with different shapes. As shown in Fig. 8, we find that the relatively optimal result is obtained when the hidden layer units are [200–100–50] on Books dataset, which may be that this shape is more suitable for the current dataset. For the learning rate, the test is conducted in the range of $[0.0001, 0.001, 0.005, 0.01]$ , and it is found that the effect is better when the learning rate is 0.001. As shown in Fig. 9, we also found that it works better when the optimizer is Adam [38] and the drop rate is 0.8. As shown in Fig. 10, for the size of sliding time window, the test is conducted in the range of [2–5], and the effect is relatively better when 4 was taken. According to experience, we set the balance factor α as 0.5. When the penalty factor β is in the range of $[0.0001, 0.001, 0.01, 0.1]$ , the effect is better when 0.0001 is taken. When the hidden layer units are [100–100–100] on Electronics dataset, the result is relatively optimal, and other hyperparameters settings are the same as Books dataset.

Fig. 8.

Hyperparameters settings and sensitivity analysis I.

Fig. 9.

Hyperparameters settings and sensitivity analysis II.

Fig. 10.

Hyperparameters settings and sensitivity analysis III.

4.5. Results and analysis

In order to be fair, we adopt a consistent experimental environment for the baseline models. When the baseline models include deep neural network, the same embedding dimension, hidden layer shape (number of layers and units per layer), learning rate, drop rate and penaly factor as the proposed model are used.

Table 2 shows the results of each model on Books dataset and Electronics dataset. It can be seen that all deep structure models are significantly better than LR model with simple structure and limited expression ability, and it also proves the effectiveness of the deep neural network in nonlinear transformation and feature extraction. It is found that the performance of Wide & Deep model with manual design features is not good, while PNN is better than Wide & Deep model thanks to the low-order feature interaction in the dot product layer, which indicates that multi-level feature interactions are helpful to improve the accuracy of CTR prediction.

In deep learning based baselines, PNN and Wide & Deep perform poorly, which verifies the importance of extracting user interest from user behaviors. Generally, users don’t show their interest clearly, whereas DIN model designs a local activation unit structure, strengthening the relevant user interest by soft searching some user behaviors related to candidate advertisement, and obtaining the adaptive change representation of user interest. Compared with PNN and Wide & Deep, this structure greatly improves the expression ability of model. Although DIN activates some user behaviors related to candidate ad, it ignores the sequential information in user behaviors. DIEN uses a two-layer GRU structure to capture the evolution of user interest, and achieves slightly better performance than DIN.

Table 2
Results of all models on both datasets

Model Books_AUC Books_RMSE Electronics_AUC Electronics_RMSE

LR 0.6476 0.5781 0.6335 0.5922

PNN 0.7131 0.5502 0.7211 0.5425

Wide & Deep 0.5571 0.7108 0.7106 0.5368

DIN 0.7427 0.4871 0.7328 0.4803

DIEN 0.7535 0.4401 0.7433 0.4543

MLIM 0.7571 0.4329 0.7489 0.4447

Model	Books_AUC	Books_RMSE	Electronics_AUC	Electronics_RMSE
LR	0.6476	0.5781	0.6335	0.5922
PNN	0.7131	0.5502	0.7211	0.5425
Wide & Deep	0.5571	0.7108	0.7106	0.5368
DIN	0.7427	0.4871	0.7328	0.4803
DIEN	0.7535	0.4401	0.7433	0.4543
MLIM	0.7571	0.4329	0.7489	0.4447

DIN and DIEN obtain user interest representation through modeling user behavior, and achieve good effect in improving the accuracy of CTR prediction. On the basis of DIEN, the proposed model further mines the evolution law of interest. Specifically, on the one hand, we improve the two-layer GRU, and use HBiGRU to mine the relevance of interest. On the other hand, we use attention mechanism and sliding time window to model the diversity, concentration and periodicity of interest, so as to obtain the final user interest representation with richer information. Compared to the baseline models, the optimal results are obtained on current datasets with rich user behavior.

4.6. Ablation study

In this section, we explore the influence of using single GRU, BiGRU-attm, HBiGRU, AI-HBiGRU, HBiGRU-AM, and HBiGRU-AMTW, auxiliary loss in the user interest modeling part of the proposed model on prediction performance, respectively.

Fig. 11.

AUC values obtained by different interest modeling modules in MLIM model.

Figure 11 shows the prediction results using different modules in modeling user interest representation on two datasets. It can be seen that on the two datasets, BiGRU-attm obtains better performance than single GRU, because the accuracy of interest representation is improved to some extent by increasing the attention score on the basis of each interest point. HBiGRU also achieves slightly better results than BiGRU-attm, which illustrates the necessity of using hierarchical BiGRU to learn the correlation between user interest points. Because the attention score is used to influence the input of each step in the second BiGRU, AI-HBiGRU achieves better results than BiGRU-attm and HBiGRU.

Although AI-HBiGRU has made progress, there is still a certain degree of information loss in the process of interest evolution. Therefore, we continue to explore the influence of increasing long-term user interest representation based on attention mechanism and short-term user interest representation based on sliding time window on prediction performance. It is found that increasing long-term user interest representation results in a certain degree of performance improvement on Electronics dataset. Continuing to increase short-term user interest representation has a significant performance improvement on Books dataset, but a limited performance improvement on Electronics dataset. This may be because the user interest of Books dataset is more concentrated and phased, while the user interest of Electronics dataset tends to be more stable. A consistent rule can be found on both datasets: by increasing long-term user interest representation module and short-term user interest representation module, a multi-level interest representation with richer information can be obtained, which improves the accuracy of CTR prediction to different degrees.

We further explore the influence of auxiliary loss, in which the negative examples used are generated by random sampling. It can be observed from Fig. 12 that the global loss L and auxiliary loss $L_{aux}$ have a similar trend on Books dataset, and there is a general downward trend on Electronics dataset, which indicates that both the global loss predicted by CTR and the auxiliary loss expressed by interest play a role. It can also be seen from Fig. 12 that the auxiliary loss can improve the performance of both datasets to a certain extent, which reflects the importance of supervision information for sequence interest learning and embedding representation.

Fig. 12.

Change curve of loss and auxiliary loss during training.

5. Conclusions

5.1. Main conclusions

This work mainly studies CTR prediction modeling in web advertising or recommendation scenarios with rich user behavior data. In the previous content based deep CTR model, user interest preference information is generally not fully mined. This study proposes a deep CTR prediction model called MLIM, which focuses on modeling the evolution law of user interest. Specifically, we design an interest extraction layer to capture low-level interest sequence, and use auxiliary loss to supervise the interest states. On this basis, we design an interest evolution layer to obtain multi-level user interest representation that reflects diversity, relevance, concentration and periodicity. Comprehensive experiments show that our proposed model is helpful to improve the prediction performance to a certain extent.

5.2. Managerial insights

This study has some managerial implications as well. Precision marketing can help advertisers spend their advertising expenses in the right place, and is an important weight for enterprises to improve their performance. Mining user preferences is an effective way for precision marketing. The method proposed in this paper can predict the future and guide users to a certain extent by describing the evolution law of user interest in CTR modeling, and enrich the data-driven marketing decisions. However, users’ needs change over time, and precision marketing can only be relatively accurate.

5.3. Future work

Given the conclusions and insights of this study, our work can be extended as follows. Future research could 1) mine more side information to obtain more accurate user interest representation, 2)study more feature combination learning methods.

Footnotes

Acknowledgements

This work is supported by the Scientific Research Project of Guizhou University of Finance and Economics (NO.2020XYB08) and Scientific Research Initiation Project for Introduction Talents of Guizhou University of Finance and Economics (NO.2021YJ049).

References

Su,

Jin et al., Improving click-through rate prediction accuracy in online advertising by transfer learning, in: ACM International Conference on Web Intelligence (WI 2017), San Francisco, USA, 2017, pp. 1018–1025. doi:10.1145/3106426.3109037.

Chapelle, Modeling delayed feedback in display advertising, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2014), New York, USA, 2014, pp. 1097–1105. doi:10.1145/2623330.2623634.

Yan,

W.J.

Li,

G.R.

Xue and

Han, Coupled group lasso for web-scale CTR prediction in display advertising, in: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), Beijing, China, 2014, pp. 802–810.

R.J.

Oentaryo,

E.P.

Lim,

J.W.

Low,

Lo and

Finegold, Predicting response in mobile advertising with hierarchical importance-aware factorization machine, in: Proceedings of the 7th ACM International Conference on Web Search & Data Mining (WSDM 2014), New York City, USA, 2014, pp. 123–132.

Pan,

Chen,

Liu,

Xu,

Ma and

Lin, Sparse factorization machines for click-through rate prediction, in: IEEE 16th International Conference on Data Mining (ICDM 2016), Barcelona, Spain, 2016, pp. 400–409.

Juan,

Lefortier and

Chapelle, Field-aware factorization machines in a real-world online advertising system, in: Proceedings of the 26th International Conference on World Wide Web (WWW 2017), Perth, Australia, 2017, pp. 680–688.

Zhang,

Du and

Wang, Deep learning over multi-field categorical data: A case study on user response prediction, in: The 38th European Conference on Information Retrieval (ECIR 2016), Padua, Italy, 2016, pp. 45–57.

Qu,

Cai,

Ren and

Zhang, Product-based neural networks for user response prediction, in: IEEE 16th International Conference on Data Mining (ICDE 2016), Barcelona, Spain, 2016, pp. 1149–1154. doi:10.1109/ICDM.2016.0151.

Yang,

Xu,

Shen,

Shen and

Zhao, Operation-aware neural networks for user response prediction, Neural Networks 121 (2020), 161–168. doi:10.1016/j.neunet.2019.09.020.

10.

He and

T.S.

Chua, Neural factorization machines for sparse predictive analytics, in: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017), Shinjuku, Tokyo, Japan, 2017, pp. 355–364.

11.

H.T.

Cheng,

Koc,

Harmsen,

Shaked et al., Wide & deep learning for recommender systems, in: Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, New York, NY, USA, 2016, pp. 7–10. doi:10.1145/2988450.2988454.

12.

Guo,

Tang,

Ye,

Li and

He, DeepFM: A factorization-machine based neural network for CTR prediction, in: Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI 2017), Melbourne, Australia, 2017, pp. 1725–1731.

13.

Wang,

Fu,

Fu and

Wang, Deep & cross network for ad click predictions, in: The 23rd ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD 2017), Halifax, NS, Canada, 2017, pp. 12:1–12:7.

14.

Lian,

Zhou,

Zhang,

Chen,

Xie and

Sun, xDeepFM: Combining explicit and implicit feature interactions for recommender systems, in: The 24rd SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2018), London, United Kingdom, 2018, pp. 1754–1763.

15.

Zhou,

Song,

Zhu,

Fan,

Zhu,

Ma,

Yan,

Jin and

Li, DIN: Deep interest network for click-through rate prediction, in: The 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (SIGKDD 2018), London, UK, 2018, pp. 1059–1068. doi:10.1145/3219819.3219823.

16.

Zhou,

Mou,

Fan,

Pi,

Bian,

Zhou,

Zhu and

Gai, DIEN: Deep interest evolution network for click-through rate prediction, in: The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI 2019), Honolulu, Hawaii, USA, 2019, pp. 5941–5948.

17.

Feng,

Lv,

Shen,

Wang,

Sun,

Zhu and

Yang, Deep session interest network for click-through rate prediction, in: The 28th International Joint Conference on Artificial Intelligence (IJCAI 2019), Macao, China, 2019, pp. 2301–2307.

18.

She and

Jia, A BiGRU method for remaining useful life prediction of machinery, Measurement 167 (2020), 108277. doi:10.1016/j.measurement.2020.108277.

19.

Wang and

Wu, Dynamic imbalanced business credit evaluation based on Learn

+ +

with sliding time window and weight sampling and FCM with multiple kernels, Information Sciences 520 (2020), 305–323. doi:10.1016/j.ins.2020.02.011.

20.

Jiang,

Gao and

Dai, Research on CTR prediction for contextual advertising based on deep architecture model, Control Engineering and Applied Informatics 18(1) (2016), 11–19.

21.

Jiang and

Gao, An intelligent recommendation approach for online advertising based on hybrid deep neural network and parallel computing, Cluster Computing. 23(3) (2020), 1987–2000. doi:10.1007/s10586-019-02959-5.

22.

Guo,

Ye,

Su,

Liu,

Sun and

Xiang, Visualizing and understanding deep neural networks in CTR prediction, in: Proceedings of the SIGIR 2018 Workshop on eCommerce, Co-Located with the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR eCom 2018), Tokyo, Japan, 2018.

23.

Xiao,

Ye,

He,

Zhang,

Wu and

T.S.

Chua, Attentional factorization machines: Learning the weight of feature interactions via attention networks, in: The Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI 2017), Melbourne, Australia, 2017, pp. 3119–3125.

24.

Ying,

Zhuang,

Zhang,

Liu,

Xu,

Xie,

Xiong and

Wu, Sequential recommender system based on hierarchical attention networks, in: The 27th International Joint Conference on Artificial Intelligence (IJCAI 2018), Stockholm, Sweden, 2018, pp. 3926–3932.

25.

Song,

Shi,

Xiao,

Duan,

Xu,

Zhang and

Tang, AutoInt: Automatic feature interaction learning via self-attentive neural networks, in: The 28th ACM International Conference on Information and Knowledge Management (CIKM 2019), Beijing, China, 2019, pp. 1161–1170.

26.

Shi,

Tang and

Liu, Functional and contextual attention-based LSTM for service recommendation in mashup creation, IEEE Transactions on Parallel and Distributed Systems. 30(5) (2019), 1077–1090. doi:10.1109/TPDS.2018.2877363.

27.

Xing,

Liu,

Wang,

Zhao and

Li, A hierarchical attention model for rating prediction by leveraging user and product reviews, Neurocomputing 332 (2019), 417–427. doi:10.1016/j.neucom.2018.12.027.

28.

Zhao,

Xiong,

Zu,

Ju,

Li and

Li, A hierarchical attention recommender system based on cross-domain social networks, Complexity 2020 (2020), 9071624.

29.

Devooght and

Bersini, Long and short-term recommendations with recurrent neural networks, in: Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization, Bratislava, Slovakia, 2017, pp. 13–21. doi:10.1145/3079628.3079670.

30.

Bogina and

Kuflik, Incorporating dwell time in session-based recommendations with recurrent neural networks, in: Proceedings of the 1st Workshop on Temporal Reasoning in Recommender Systems Co-Located with 11th International Conference on Recommender Systems (RecSys 2017), Como, Italy, 2017, pp. 57–59.

31.

Yu,

Lian,

Mahmoody,

Liu and

Xie, Adaptive user modeling with long and short-term preferences for personalized recommendation, in: The Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI 2019), Macao, China, 2019, pp. 4213–4219.

32.

Zhang, Grorec: A group-centric intelligent recommender system integrating social, mobile and big data technologies, IEEE Transactions on Services Computing 9(5) (2016), 786–795. doi:10.1109/TSC.2016.2592520.

33.

Zhang,

Li,

Wang and

M.S.

Hossain, MASR: Multi-aspect-aware session-based recommendation for intelligent transportation services, IEEE Transactions on Intelligent Transportation Systems 99 (2020), 1–10. doi:10.1109/TITS.2020.3041950.

34.

Deng,

Shi,

Chen,

Kwak and

Tang, Recommender System for Marketing Optimization, Vol. 23, World Wide Web, 2020, pp. 1497–1517.

35.

Dash and

H.S.

Behera, A comprehensive study on evolutionary algorithm-based multilayer perceptron for real-world data classification under uncertainty, Expert Systems 36(1) (2019), e12327.

36.

B.-S.

Ke,

A.J.

Chiang and

Y.-C.I.

Chang, Influence analysis for the area under the receiver operating characteristic curve, Journal of Biopharmaceutical Statistics 28(4) (2018), 722–734. doi:10.1080/10543406.2017.1377728.

37.

Vranjes,

Rimac-Drlje and

Vranjes, Foveation-based content adaptive root mean squared error for video quality assessment, Multimedia Tools and Applications. 77(16) (2018), 21053–21082. doi:10.1007/s11042-017-5544-6.

38.

D.P.

Kingma and

Ba, Adam: A method for stochastic optimization, in: 3rd International Conference on Learning Representations (ICLR 2015), San Diego, USA, 2015, pp. 1–15.

MLIM: A CTR prediction model describing evolution law of user interest

Abstract

Keywords

1. Introduction

2. Related work

3. MLIM model

3.1. MLIM model architecture

3.6. Objective function

4. Experiments

4.1. Datasets and task description

1 Amazon product dataset.[EB/OL]. http://jmcauley.ucsd.edu/data/amazon/.

4.3. Baseline methods

4.4. Experimental parameters settings and sensitivity analysis

5.1. Main conclusions

5.2. Managerial insights

5.3. Future work

Footnotes

Acknowledgements

References

¹
Amazon product dataset.[EB/OL]. http://jmcauley.ucsd.edu/data/amazon/.