User representation modeling is a critical component of personalized recommendation systems, making it both representative and scalable is a key challenge frequently discussed in ML system design interviews. At WWW ’24, Meta published an insightful paper[1] on their online user modeling framework, which presents valuable design discussions and considerations. In this post, I’ll deep-dive this work using the Machine Learning System Design structure to help us refine our user modeling designs—whether for interviews or real-world applications.
Problem Formulation
Design a large-scale user representation system for a social media platform with billions of users. The system should support various downstream personalized recommendation and ranking models.
Problem Clarification
- Scale and scope
- User scale: the system needs to serve billions of users, including hundreds of millions of daily active users (DAU).
- Downstream task scale: support a broad range of personalization models, such as ads ranking, feed recommendations, and other content-driven recommendations.
- Quality requirements
- Generate representative and fresh user embeddings to ensure high ranking qualities in downstream tasks
- Embeddings should be generalizable and consistent across multiple downstream tasks to maintain cohesive personalization efforts.
- Feature dynamics
- User features are expected to change frequently and significantly over time
- Latency constraints
- The system should be able to handle requests with a serving latency of 30ms or less.
- Offline vs. online
- Given the dynamic nature of user features, the system should support online inference
Objectives
ML Objectives
- Inputs: vast amounts of user behavior data and profile features collected from the platform
- Outputs: user embeddings that support hundreds of downstream ranking and recommendation models
- Embedding quality: ensure high-quality embeddings that improve the performance of downstream ranking tasks.
- Embedding freshness: maintain up-to-date user embeddings that reflect the most recent user behavior and feature changes
Non-functional Objectives
- Latency: support low-latency or latency-free inference (~30ms)
- Infrastructure efficiency: Optimize storage, training throughput, and memory usage to ensure that the system can handle large-scale production loads effectively.
Architecture
The high-level architecture follows an upstream-downstream paradigm, as illustrated in following digram
- Upstream:
- Train a small number of large-scale upstream user models with sophisticated architectures under diverse supervisions, like clicks and conversions.
- The user models process a vast amount of user-side signal (features) to synthesize compact user embeddings (representations).
- The primary goal is to centralize and share user representations across multiple downstream models
- Online inference:
- Apply Online Asynchronous Platform enabling latency-free, asynchronous serving that achieves embedding freshness while overcoming the latency limit regardless of model complexity
- Downstream:
- Downstream models consume the embeddings generated by the upstream models as input features
- By utilizing these pre-learned embeddings, downstream models can focus on optimizing their specific tasks without needing to handle raw user data directly
Data and features
The system processes an extensive range of user-side input features collected from the platform, which can be categorized into 2 types:
- Dense Features: numerical or continuous variables, such as the frequency of clicks on a particular page.
- Sparse Features: categorical or binary variables with high cardinality, such as user ID and page ID.
Both types of features are passed through their respective embedding layers to generate input embeddings, which are then fed into the user models.
Model and training
The model is built upon the widely recognized DLRM architecture [2] and can be split into two components:
User tower
A pyramid-like structure with successive Interaction Modules to compress input features.
- Interaction Modules: the core of the user tower within the upstream models. Each interaction module is composed of various feature extractors running in parallel. The feature extractors capture both low-level and high-level interactions with different techniques such as
- Multi-Layer Perceptrons (MLPs)
- Dot Compression with Attention
- Deep Cross Networks (DCN)
- MLP-Mixers
- Residual Connections: used to retain information from previous layers, ensuring that important features are not lost during compression
Mixed tower
Mixed tower adopts a Deep and Hierarchical Ensemble Network architecture[3] to combine all other features with user embeddings from user tower. Notably, the mixed tower model does not directly consider user features, with two key objectives:
- Enhancing upstream training throughput
- Encouraging the model to refine user representations primarily through the user tower
Model training
The model is trained using a multitask cross-entropy loss function, as shown in the following formula:
Here w are the task weights which denote the relative importance of each task in the final loss calculation.
The model is trained offline in a recurring manner which allows the model to retain historical patterns while incrementally adapting to evolving user preferences
Model serving
Serving architecture
Given the frequent changes in user features within the user tower and the system’s tight inference budget, we design the Online Asynchronous Platform following an Async Serving paradigm
The serving workflow operates as follows:
- 0: the downstream model receives a user request
- W1: the Compute Center computes the current embedding using the latest snapshot
- W2: the embedding is written to the Feature Store
- R1: the Client immediately reads the previous embeddings for this user from Feature Store
- R2, R3: the Client forwards the average pooled previous embeddings to the downstream model, without waiting for the Compute Center’s inference.
Trade-offs
The Writing and Reading flows are in parallel to separate the expensive and latency-intensive write path from the read path. This async workflow trades off a slight amount of realtime-ness to boost scaling potential. However, it still provides better real-time performance compared to traditional offline inference methods.
Embedding distribution shifts
As model training recurs, user embeddings computed from the same inference input change over time. This leads to a potential mismatch between embeddings used during downstream training and those used during inference. Two mitigation strategies are available:
- Make sure a new version of embedding features has been seen by downstream models during training, before it is used in inference. However, this increases the complexity and dependency over the whole systems.
- Reduce the difference between new embeddings and old embeddings. This can be done with different approaches
- regularization
- average pool
- distillation
Other scale up strategies
- Reduce memory usage
- Lower the dimensionality and quantity of generated user embeddings.
- Quantize embeddings from fp32 to fp16 to save memory.
- Leverage Distributed Inference (DI) for scaling up the user tower.
Evaluation metrics
- Offline metrics
- Normalized Entropy (NE): measures the accuracy of model predictions.
- Feature Importance Ranking: evaluate the significance of generated embeddings in downstream models
- Online metrics
- Business metrics: the online performance of generated embeddings can be evaluated using the business metrics of downstream tasks. Meta’s products utilize polished metrics tailored to their objectives, so we won’t expand on them here.
- Model freshness: to assess the impact of model freshness on performance, conduct online A/B testing experiments comparing different serving strategies.
- Latency consideration: since the system employs asynchronous serving, we can exclude latency from our online evaluation metrics.
Topics we didn’t cover
In the context of a Machine Learning System Design (MLSD) interview, we typically don’t focus on experiments and evaluation results. However, to better understand the effectiveness and trade-offs of the proposed designs, here’s a summary of the related discussions from the paper:
- Offline Evaluation:
- Higher gains were observed in Click-Through Rate (CTR) tasks since the embeddings were trained on CTR data.
- Smaller gains were noted for Instagram models due to domain discrepancies.
- For smaller models, the benefits of large-scale user representation sharing were more pronounced.
- Asynchronous Serving:
- Asynchronous serving outperformed both frozen models and offline serving.
- The gap between real-time serving and asynchronous serving showed only a 10% average performance loss.
- Embedding Distribution Shift:
- Averaging across three embeddings significantly reduced embedding shift and improved performance compared to relying on the most recent embedding alone.
Missed topics in this paper
While the paper primarily focuses on scalability and generalization, some frequent MLSD interview topics weren’t covered in depth, including:
- Effective modeling of user behavior histories.
- Handling cold-start problems and new users.
- Monitoring and conducting online experiments.
- …
We’ll explore these topics in greater detail in upcoming blog posts, diving into other relevant works!
Have questions or ideas related to this deep-dive? Are there any other interesting areas you’d like to explore? Let’s continue the discussion in the comments!
References
- Zhang, W., Li, D., Liang, C., Zhou, F., Zhang, Z., Wang, X., Li, R., Zhou, Y., Huang, Y., Liang, D. and Wang, K., 2024, May. Scaling User Modeling: Large-scale Online User Representations for Ads Personalization in Meta. In Companion Proceedings of the ACM on Web Conference 2024 (pp. 47-55).
- Naumov, M., Mudigere, D., Shi, H.J.M., Huang, J., Sundaraman, N., Park, J., Wang, X., Gupta, U., Wu, C.J., Azzolini, A.G. and Dzhulgakov, D., 2019. Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091.
- Zhang, B., Luo, L., Liu, X., Li, J., Chen, Z., Zhang, W., Wei, X., Hao, Y., Tsang, M., Wang, W. and Liu, Y., 2022. DHEN: A deep and hierarchical ensemble network for large-scale click-through rate prediction. arXiv preprint arXiv:2203.11014.