Notes on introduction to industrial recsys
2025-02-09
The following notes are based on the wonderful introductory course to industrial recommendation system by Shusen Wang in the context of Xiaohongshu (Rednote). The course is in Chinese, and these notes are written in English to aid in interview preparation. This is not a comprehensive review, nor does it strictly follow the course structure.
Retrieval
-
Item CF
-
Idea: if the user likes item1, and item1 is similar to item2 => the user may like item2
-
The user’s interest in the item is measured as a weighted sum of the user’s past preferences for previously interacted items (
$like$
), where the weights are determined by the similarity between the new item and the previously interacted items ($sim$
)$$\text{Interest($user$ in $item$)} = \sum_j like(user, item_j) \cdot sim (item_j, item) \qquad (1) $$
where
$sim(i_1, i_2)$
can be defined using the cosine similarity of the preference vectors between users who interacted with these two items:$$sim(i_1, i_2) = \frac{\sum_{v \in \mathcal{V}} like(v, i_1) \cdot like(v, i_2)}{\sqrt{\sum_{u_1 \in \mathcal{W_1}} like^2(u_1, i_1) \cdot \sum_{u_2 \in \mathcal{W_2}} like^2(u_2, i_2)}} \qquad (2) $$
with
-
$\mathcal{W_1}$
: users who like item$i_1$
-
$\mathcal{W_2}$
: users who like item$i_2$
-
$\mathcal{V} = \mathcal{W_1} \cap \mathcal{W_2}$
If we do not differentiate the level of preferences (
$like$
),$sim(i_1, i_2)$
simplifies to $$sim(i_1, i_2) = \frac{|\mathcal{V}|}{\sqrt{|\mathcal{W_1}| \cdot |\mathcal{W_2}|}} \qquad (3)$$ -
-
Indexing
Maintain two index tables for retrieval (resulting in extensive offline computations but minimal online computations)
-
user-item table: the
$n$
items the user interacted recently -
item-item table: top
$k$
most similar items
-
-
-
Swing CF
-
Idea: Similar to item CF, but also considers whether users belong to the same circle when computing item similarity. The influence of users within the same circle should be downweighted.
Given
-
$\mathcal{J_i}$
: the set of items user$u_i$
likes -
$\mathcal{I} = \mathcal{J_1} \cap \mathcal{J_2}$
Define the overlap between user
$u_1$
and$u_2$
as the size of their intersection$$Overlap(u_1, u_2) = |\mathcal{I}| = |\mathcal{J_1} \cap \mathcal{J_2}| \qquad (4)$$
Similarity between item
$i_1$
and$i_2$
is then defined as$$sim(i_1, i_2) = \sum_{u_1 \in \mathcal{V}} \sum_{u_2 \in \mathcal{V}} \frac{1}{\alpha + overlap(u_1, u_2)} \qquad (5)$$
with
$\alpha$
being a hyperparameter to be tuned. -
-
-
User CF
-
Idea: if user1 is similar to user2, and user2 likes an item => user1 may like the item
Compared to item CF (eq. 1:
$\sum_j like(user, item_j) \cdot sim (item_j, item)$
), the similarity is now based on the users' side instead of items'.
$$\text{Interest(
$user$
in$item$
)} = \sum_j sim(user, user_j) \cdot like (user_j, item) \qquad (6)$$- User similarity (cosine similarity based on the interacted items)
$$sim(u_1, u_2) = \frac{overlap(u_1, u_2)}{\sqrt{|\mathcal{J_1}| \cdot |\mathcal{J_2}|}} = \frac{\sum_{l \in \mathcal{I}} 1}{\sqrt{|\mathcal{J_1}| \cdot |\mathcal{J_2}|}} \qquad (7)$$
-
Need to lower the weight of popular items. They are unable to reflect the unique interests of users
$$sim(u_1, u_2) = \frac{\sum_{l \in I} \frac{1}{\log(1 + n_l)}}{\sqrt{|\mathcal{J_1}| \cdot |\mathcal{J_2}|}} \qquad (8)$$
with
$n_l$
being the number of users who liked item$l$
(reflects popularity)
-
-
Two-tower model
-
Training methods
-
Pointwise: view each sample (positive or negative) independently, do binary classification
- based on heuristics, #positive/ #negative samples = 1/2 or 1/3
-
Pairwise: the model takes one positive and one negative sample at each time
-
Training
-
Let
$\mathbf{b^+}$
denote the embedding of a positive item,$\mathbf{b^-}$
the embedding of a negative item,$\mathbf{a}$
the embedding of a user, and$\cos(\mathbf{a}, \mathbf{b})$
the cosine similarity between the embedding vectors$\mathbf{a}$
and$\mathbf{b}$
. For this triplet: -
Idea: encourage
$cos(\mathbf{a}, \mathbf{b^+})$
to be greater than$cos(\mathbf{a}, \mathbf{b^-})$
-
Triplet hinge loss:
$L(\mathbf{a}, \mathbf{b^+}, \mathbf{b^-}) = \max\{0, \cos(\mathbf{a}, \mathbf{b^-}) + m - \cos(\mathbf{a}, \mathbf{b^+})\} \qquad (9)$
with
$m$
being the margin enforced between positive and negative pairs, a hyperparameter to be tuned -
Triplet logistic loss:
$L(\mathbf{a}, \mathbf{b^+}, \mathbf{b^-}) = \log(1 + \exp[\sigma \cdot (\cos(\mathbf{a}, \mathbf{b^-}) - \cos(\mathbf{a}, \mathbf{b^+}))]) \qquad (10)$
with
$\sigma >0$
being a hyperparameter controlling the shape of the loss-
$\exp(\cdot)$
: small violations are penalized lightly, large violations are penalized much more heavily -
$\log(1 + x)$
: compresses large values into a smaller range, ensuring the loss is bounded and stable -
Advantages
-
a soft margin (in contrast to the hard margin in the hinge loss)
-
a smooth gradient, beneficial for optimization than the hinge loss
-
numerical stability
-
-
-
-
-
ref: Huang, Jui-Ting, et al. Embedding-based retrieval in facebook search. SIGKDD 2020
-
-
Listwise: The model takes one positive sample, and multiple negative samples at each time
-
Training
-
Idea: encourage
$\cos(\mathbf{a}, \mathbf{b^+})$
to be as large as possible, and$\cos(\mathbf{a}, \mathbf{b_1^-}), \cos(\mathbf{a}, \mathbf{b_2^-}), \cdots, \cos(\mathbf{a}, \mathbf{b_n^-})$
to be as small as possible -
With softmax as the activation function
$\cos(\mathbf{a}, \mathbf{b^+})$
-> softmax ->$s^+$
(groud truth label:$y^+ = 1$
)$\cos(\mathbf{a}, \mathbf{b_1^-})$
-> softmax ->$s_1^-$
(groud truth label:$y_1^- = 0$
)$\cos(\mathbf{a}, \mathbf{b_2^-})$
-> softmax ->$s_2^-$
(groud truth label:$y_2^- = 0$
)$$\cdots$$
$\cos(\mathbf{a}, \mathbf{b_n^-})$
-> softmax ->$s_n^-$
(groud truth label:$y_n^- = 0$
)
Cross_entropy_loss
$(y, s) = -y^+ \log s^+ - \sum_{i=1}^n y_i^- \log s_i^- = -\log s_i^+$
-
-
ref: (Youtube) Yi, Xinyang, et al. Sampling-bias-corrected neural modeling for large corpus item recommendations. Recsys 2019
-
-
-
Selecting positive and negative samples
-
A well-chosen selection strategy can have a greater impact than refining the model architecture
-
Positive samples: items from the impressions that the user interacted with
-
Issue: A small number of items account for the majority of the clicks, therefore most of the positive samples are just the popular ones
-
Solution: Oversampling the unpopular items, or downsampling the popular ones
-
-
Negative samples
The goal of the retrieval stage is to distinguish between items the user may or may not be interested in, rather than evaluating the degree of the user’s preference for each item.
-
Simple negative samples: items which were not retrieved
-
Sampling from all the items
Since the retrieval stage only returns a very small fraction of the items, the items not retrieved
$\approx$
all items=> we can simply do sampling from all the items
-
Uniform sampling (❌): unfair to unpopular items
-
Non-uniform sampling (✅)
-
To penalize popular items, the negative sampling rate should be positively correlated with their popularity (e.g., number of clicks)
Sampling rate
$\propto$
(number of clicks)$^{0.75}$
(0.75 is based on heuristics)
-
-
-
In-batch negative sampling
Suppose user_1 clicked item_1, user_2 clicked item_2, …, user_n clicked item_n. We then have n positive pairs:
$$ (\mathbf{a}_1, \mathbf{b}_1), (\mathbf{a}_2, \mathbf{b}_2), \cdots, (\mathbf{a}_n, \mathbf{b}_n) $$
Use the unmatched pairs as negative samples, which leads to n(n-1) simple negative samples:
$$ (\mathbf{a}_i, \mathbf{b}_j), \quad \forall i \neq j $$
-
Issue: The sampling rate
$\propto \text{(number of clicks)}$
instead of$\propto \text{(number of clicks)}^{0.75}$
(over-sampling the popular items as negative samples) -
One possible solution:
Let
$p_i \propto \text{number of clicks}$
be the probability item$i$
being sampledTo estimate the interest of a user on item
$i$
, instead of$\cos(\mathbf{a}, \mathbf{b_i})$
, adjust to$\cos(\mathbf{a}, \mathbf{b_i}) - \log p_i$
during training, and keep$\cos(\mathbf{a}, \mathbf{b_i})$
during online inference (ref)
-
-
-
More difficult samples
-
Items retrieved but filtered out in the 1st stage ranking (difficult)
-
Items with low scores in the 2nd stage ranking (very difficult)
-
-
Typical proportions of negative samples in training data
-
50% from all the items (simple negative samples)
-
50% from the more difficult samples
-
-
Impression without clicks (❌): No click does not necessarily mean no interest. These samples should be used for ranking instead of retrieval. Using these samples in retrieval can degrade performance.
-
Xiaohongshu generally follows the approaches introduced in the two aforementioned references and has achieved excellent results
-
-
-
Online serving
-
Item embeddings are precomputed offline and stored in a database, as item features remain relatively stable
-
User embeddings are typically computed online per request to account for the dynamic nature of user interests
-
-
Model update strategies
-
Full update
-
Trains previous day’s model for 1 epoch using data from that day
-
Requires minimal support for data streams and system resources
-
-
Incremental update
-
Updates model parameters every few minutes using online learning
-
Primarily updates the embedding layer, while other parameters (e.g., in the FC layer) remain fixed
-
-
Combining full and incremental updates (recommended)
-
Full updates use randomly shuffled data from the entire day
-
Incremental updates continuously adapt the model throughout the day by processing data sequentially from morning to evening. However, shorter time intervals can introduce statistical bias
-
-
-
Combining self-supervised learning with two-tower
-
ref: (Google) Yao, Tiansheng, et al. Self-supervised learning for large-scale item recommendations. CIKM 2021 (widely recognized as effective)
-
Idea: The embeddings of the same item should remain similar even after different feature transformations, while embeddings of different items should remain distinct
-
Feature transformation
-
Random mask: randomly selects and masks certain categorical features
-
Dropout: applicable to discrete features with multiple possible values
-
Complementary transformation
-
Masking associated features
- Based on mutual information to selectively mask related attributes
-
-
-
-
- ref 1: (Bytedance) Gao, Weihao, et al. Deep Retrieval: Learning A Retrievable Structure for Large-Scale Recommendations, CIKM 2021 (deployed in production)
similar to
-
Other retrieval channels
-
Geo hash / same-city retrieval
-
Followed author / Interacted author / similar author retrieval
-
Retrieval from cached items
-
Ranking
-
Interaction between users and notes
-
System records num. of impressions / clicks / likes / collections / shares
-
Metrics: the rate of clicks, likes, etc.
-
Click-through rate: #clicks / #impressions
-
Like rate: #likes / #clicks
…
-
-
The ranking model first predicts these rates and then combine them (e.g. weighted average), rank and threshold based on the combined scores
-
-
Multi-task learning
-
Training
-
Loss function: weighted average cross entropy
$$\sum_1^N w_i \cdot y_i \log p_i$$
where
-
$w_i$
: weight of task$i$
-
$y_i$
: true label for task$i$
-
$p_i$
: the prediction for task$i$
-
$N$
: the total number of tasks
-
-
Issues: unbalanced classes
-
The majority of impressions do not result in clicks
-
The majority of clicks do not result in collections
…
-
-
Solution
- Downsampling negative samples => saves computations
-
-
Calibrating prediction
-
Let
$n_+$
and$n_-$
denote the number of positive and negative samples, respectively$(n_- \gg n_+)$
. With$\alpha \in (0, 1)$
being the downsampling rate for negative samples, we use$\alpha \cdot n_-$
negative samples for training. The predicted CTR will be greater than the actual CTR due to the smaller number of negative samples used -
Expectation of real CTR (before negative downsampline):
$\displaystyle \mathbb{E}[p_{true}] = \frac{n_+}{n_+ + n_-}$
-
Expectation of predicted CTR:
$\displaystyle \mathbb{E}[p_{pred}] = \frac{n_+}{n_+ + \alpha \cdot n_-}$
=> Calibrated
$\displaystyle \mathbb{E}[p_{true}] = \frac{\alpha \cdot p_{pred}}{(1 - p_{pred}) + \alpha \cdot p_{pred}}$
-
ref: He, Xinran, et al. Practical lessons from predicting clicks on ads at facebook, ADKDD 2014
-
-
-
Multi-gate Mixture-of-Experts (MMoE)
A variant of the mixture-of-experts (MoE) model designed to improve multi-task learning. It enhances traditional MoE by using multiple gating networks instead of a single shared gate.
-
Architecture
-
Input features from multiple aspects (users, items, etc.) are concatenated and fed into specialized ‘experts’ neural networks (FC networks, same architecture w/o sharing parameters)
-
Task-specific gating networks learn to assign weights (output of softmax layer) to the experts, enabling a weighted average output tailored to each specific task
-
-
Issue: polarization
-
Outputs from softmax function (weights): one element
$\rightarrow1$
, all others$\rightarrow0$
=> only using one expert => same as one multi-task NN -
Solution: dropout on the output of softmax
- The probability of a value output from the softmax being masked is
$p$
=> the probability of abandoning an ‘expert’ being$p$
- The probability of a value output from the softmax being masked is
-
ref 1: (Google) Ma, Jiaqi, et al. Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts, KDD 2018
-
ref 2: (Google) Zhao, Zhe, et al. Recommending What Video to Watch Next: A Multitask Ranking System, RecSys 2019
-
-
-
Video
-
metrics: time, complete rate, in addition to for text-based metrics (e.g. click, like)
-
ref: Covington, Paul, et al. Deep Neural Networks for YouTube Recommendations, RecSys 2016 (best in practice, needs adjustment)
-
Modeling video watch time
cross entropy loss p = sigmoid(z) <-> y=t/(1+t) metric1 metric2 ... / prediction prediction z | | | | FC layer FC layer FC layer FC layer /|\ | Neural network /|\ | [User features; Video features; Statistical features; ...]
-
Let
$z$
be the output of the last FC layer. Let$p = sigmoid(z) = \frac{z}{1 + exp(z)}$
-
Let
$t$
be the observed actual watch time ($t = 0$
if no click happens) -
Training: minimize cross-entropy loss
$$\displaystyle - \frac{t}{1+t} \log p - \frac{1}{1+t} \log (1-p) $$
-
Inference:
$exp(z)$
as the prediction of watch time
-
-
Modeling video complete rate
-
Method 1: regression
- Video length=10min, actual watch time=4min => actual complete rate
$y = 0.4$
Let
$p$
be the prediction$$\text{loss} = y \log p + (1 - y) \log (1-p)$$
- Video length=10min, actual watch time=4min => actual complete rate
-
Method 2: binary classification
- e.g. > 80% as positive samples (1), otherwise negative (0)
-
-
Incoporating the predicted complete rate into the combination of predictions
-
Should not directly incorporate the score
-
Longer videos are less likely to complete than shorter videos => unfair to longer videos
-
Need to adjust the score, e.g.
$$p_{finish} = \frac{\text{predicted complete rate}}{f(\text{video length})}$$
where
$f(\text{video length})$
is negatively associated with video length
-
-
-
Features of ranking model
-
User profile (Mostly static)
-
Item profile (Static)
-
Statistical features (Dynamic)
In the recent 7 days / 1 day / 1 hour
- User behavior: number of impressions, clicks, likes, …
- Item behavior: number of impressions / clicks, bucketing based on gender / age / geography
-
Context
- GeoHash, city, time (bucketing / quantization), device information (model, OS)
-
Feature preprocessing
-
$\log(1+x)$
transform to reduce skewness in heavily tailed distributions (e.g. impressions, clicks) -
Smoothing to stabilize fluctuations caused by randomness
-
-
-
-
1st stage ranking
-
Related
-
2nd stage ranking (early feature fusion)
-
Input the concatenated features into one NN (shared bottom, large and complicated) -> input to multiple heads (FC layer + sigmoid) for different tasks
- Heavy online computations: score n notes => n inferences
-
-
Two-tower for retrieval (late feature fusion)
-
User features & item features are input to different NNs respectively; only compute similary at the last stage
-
Light online computations: online inference only needs user tower for one inference for a user, item embeddings are precomputed and saved in vector database without requiring online inference
-
Less precise than early feature concatenation, suitable for retrieval
-
-
-
-
Three-tower model for 1st stage ranking
-
ref: (Alimama) Wang, Zhe, et al., COLD: Towards the next generation of pre-ranking system, DLP-KDD 2020 (adapted version used by rednote, performance is between retrieval and 2nd stage ranking)
-
Architecture
-
User tower (very large)
-
Input: user features, context features
-
One user => one inference
-
-
Item tower (large)
-
Input: item features (static)
-
n items => n inferences theoretically. However, can cache embeddings on parameter server to save most of the online inferences
-
-
Interaction tower (small)
-
Input: statistical features of users and items (dynamic), and interations between user & item features
-
n items => n inferences
-
typically 1 layer & of small size
-
=> output embeddings concatenation & cross
=> multiple heads (FC layer + sigmoid) for different tasks (n items => n inferences)
-
-
-
Feature interaction
-
Deep & cross networks (DCN V2)
-
ref: Wang, Ruoxi, et al. DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems, WWW 2021 (applicable for retrieval and ranking, widely adopted in industry)
-
Cross layer
$$\mathbf{x}_{i+1} = \mathbf{x_0} \circ (\mathbf{W}_i \mathbf{x}_i + \mathbf{b}_i ) + \mathbf{x}_i $$
where
-
$\mathbf{x}_{0}$
is the very first input to the NN,$\mathbf{x}_{i}$
is the input to the$i$
th cross layer layer -
$\mathbf{W} \mathbf{x}_i + \mathbf{b}$
is the FC layer,$\mathbf{W}$
and$\mathbf{b}$
are the parameters -
$\circ$
: Hadamard product -
Skip connection (
$+\mathbf{x}_i$
): similar to Resnet, helps prevent gradient vanishing
-
-
-
DCN v2 applications in multi-tower architectures
-
User tower & item tower: both can incorporate DCN v2
-
Multi-task learning
-
The shared bottom in multi-tasking learning for ranking
-
The expert networks in MMoE
-
-
-
LHUC: Learning Hiden Unit Contributions (effective in industry, but only for 2nd stage ranking)
-
Originally used for speech classification, Kuaishou applied it to 2nd stage ranking (PPNet)
speech -> FC -> vec1 signal layer \ /item Hadamard -> vec3 -> FC layer -> vec4 -> Hadamard -> final output features product product / /|\ spaker -> NN --> vec2 | features ---------------> NN -------------------------> vec5 --------- /user features NN: multiple FC layers -> 2 * sigmoid -> output
-
SENet (for ranking)
-
ref: Hu, Jie, et al, Squeeze-and-Excitation Networks, TPAMI 2018
-
SENet: field-wise weighting for discrete feature embeddings
- field: a group of related features that share the same semantic meaning
|-----------------------------------------------------| /|\ \|/ feature --> avg --> FC+ReLU --> FC+sigmoid --> row-wise multiplication embedding pooling (m/r,1) (m,1) (m,k) (m*k) (m,1) (r > 1) (values in [0,1])
- embeddings dimensions (
$k$
) can be different (i.e.$k_1, k2$
, …)
-
-
Bilinear Cross (for ranking)
-
Type 1:
$ f_{ij} = \mathbf{x}^T_{i} \cdot \mathbf{W}_{ij} \cdot \mathbf{x}_j$
-
$m$
fields,$m^2$
real-valued measure of interactions -
$m$
fields,$m^2 / 2$
$\mathbf{W}_{ij}$
parameter matrices => can only be applied to selected features
-
-
Type 2:
$ \mathbf{f}_{ij} = \mathbf{x}_{i} \circ (\mathbf{W}_{ij} \cdot \mathbf{x}_j)$
$m$
fields,$m^2$
vectors denoting interactions
-
-
FiBiNET (SENet + Bilinear Cross)
-
Application in RedNote
-
SENet & Bilinear cross improved performance of RedNote recsys
-
Rednote applied a customized version of FiBiNet, incorporating substantial changes in model architecture and bilinear cross and obtained improved 2nd stage ranking
-
User behavior
Last-N feature (user feature)
-
ref: Covington, Paul, et al. Deep Neural Networks for YouTube Recommendations, RecSys 2016
-
Boosts performance of retrieval and ranking. Applicable for two-tower for retrieval, three-tower for 1st stage ranking and 2nd stage ranking
-
Last N items the user interacted with => embeddings of the N items => average (one embedding)
- The averaged embedding indicates the interest of the user (simple average)
-
Xiaohongshu computes an averaged embedding for Last N clicks, likes, collections of item IDs (and others) respectively => concatenate as a user feature
-
Simple average
-
Works for two/ three-tower model
-
Use the average of Last-N embeddings as input to the user tower
-
-
DIN (Deep Interest Network)
-
ref: (Alibaba) Zhou et al. Deep interest network for click-through rate prediction, KDD 2018
-
Weighted average instead of simple average
-
Weights: the similarity of the candidate item and each of the user’s Last-N items
-
In essense: attention
-
-
Attention does not work for user tower in two/ three-tower model
- requires Last-N + candidate items, but user tower does not have the candidate item
-
Computation is positively associated with the length of the user behavior sequence
=> Can only work for short-term interest
-
-
SIM (Search-based Interest Model)
-
ref: (Alibaba) Qi et al. Search-based User Interest Modeling with Lifelong Sequential Behavior Data for Click-Through Rate Prediction, CIKM 2020 (recognized as effective in the industry, also used by Rednote)
-
Can keep the long history of a user (
$n \rightarrow $
thousands) -
For every candidate item, do quick look-up in the Last-N records of the user
-
Last-N -> Top-K => input to attention layer
-
Computation reduces from N to K
-
-
Step 1: Look-up
-
Method 1: Hard search
-
Only keep Last-N items that are in the same category as the candidate item
-
Fast, simple, no need for training
-
-
Method 2: Soft search
-
Items => embeddings
-
Candidate item as the query, keep the closest k items in Last-N using KNN
-
Better AUC, but more complicated. More demanding for infra
-
-
-
Step 2: Attention
-
Candidate item embedding & Top K interacted items embeddings => input to attention layer
-
The use of time information
-
$\delta$
: The time interval between the moment of user interaction and the current moment -
Descretize
$\delta$
into predefined intervals (e.g. recent 1, 7, 30 days, etc.) and compute the corresponding embedding$\mathbf{d}$
from it. Concatenate$\mathbf{d}$
with the item embedding -
Due to the possible long sequence used in SIM, time information is essential to capture relavance and temporal dynamics
-
-
-
-
Summary
-
Long sequences (long-term interest) better than short sequences (short-term interest)
-
Attention is better than simple average
-
Soft search v.s. hard search? depends on infra
-