In many ML design scenarios, challenges around model supervision are common. These include issues such as noisy labels, limited labeled data, or labels that are only indirectly related to the target tasks. Demonstrating the ability to apply self-supervised and semi-supervised learning techniques to tackle these problems can significantly differentiate candidates by showcasing data-efficient and robust model designs. Among these techniques, contrastive learning stands out as one of the most powerful tools.
In this post, we will explore practical strategies for applying contrastive learning in the context of Machine Learning System Design (MLSD), providing both depth and breadth to help you prepare for ML interviews.
Contrastive Learning One-liner
In one sentence,
Contrastive learning is a training approach that teaches models to group similar data points together while pushing dissimilar ones apart.
By learning the similar/dissimilar relationships, models can better capture meaningful data patterns and representations that can be transferred to downstream tasks.
When Can We Use Contrastive Learning?
We should consider using contrastive learning approaches in scenarios where:
Many ML System Design problems require building embeddings for users, documents, videos, or products. In other cases, we may need to design inference-efficient systems that serve lightweight classifiers on pre-calculated embeddings.
In scenarios where suitable pre-trained encoder models aren’t available, or when domain-specific encoders are preferred, contrastive learning can help by identifying contrastive signals from available data and training custom encoders.
When dealing with large datasets that lack explicit labels—such as using purchase history to build a product search model—contrastive learning enables the model to learn from unsupervised or indirect data relationships. This reduces the dependency on labeled data while still producing effective, high-performing models.
Contrastive learning can be critical when designing distance-based or similarity-based models, such as k-NN or retrieval systems, by training the underlying tower or representation models.
A Recipe for Contrastive Learning in MLSD
How do we design a contrastive learning algorithm tailored to our needs? While there are many contrastive learning variants across different tasks, we can follow a unified framework to create a customized version. Below, we introduce the core design recipe and discuss the key ingredients to adapt it to specific scenarios.
Defining Contrastive Relationships
The starting point is to define positive (similar) and negative (dissimilar) data points. These relationships can be derived from:
- Supervised labels: Data points with the same labels are positives, while those with different labels are negatives.
- Weak/self-supervision: Implicit signals such as user behavior, hashtags on posts, or videos and their captions can serve as cues for contrastive relationships.
- Data augmentation: Even when no direct supervision is available, we can generate contrastive positive pairs by augmenting the data. We will discuss this in the next section.
Data augmentation & sampling
Building meaningful positive and negative samples is essential in contrastive learning. Here we introduce key points we should consider when we design our data sampling approaches:
In the absence of explicit positive supervision, we generate ‘similar’ data by adding noise to the original anchor samples. This noise modifies the non-essential aspects of the data while preserving the semantics we aim to capture. For example:
- Text data: Back translation, token masking, or embedding dropout.
- Image data: Cropping, color distortion, or patch sampling.
To train strong models, we need challenging negative samples—those whose embeddings are close to the anchor sample, sometimes even closer than the positive ones. Techniques for emphasizing hard negatives include:
- Importance sampling, where we increase the sampling probability of hard negatives.
- Filtering out very easy negatives
- Increasing the loss weight on hard negatives.
Because we don’t know the label during sampling, we may accidentally create false negative by sampling examples randomly. To resolve this issue, if the prior information about positive negative ratios is available, we can debias the false negative by sampling more positive samples (details can be found in [1])
Encoders
The encoders are the main output of contrastive learning, converting data samples into low-dimensional embeddings. The choice of encoder depends on the input modality and task (e.g., BERT for sentence encoding, ViT for vision tasks).
Contrastive loss
The main intuition of contrastive losses is to penalize distances among positive samples while reward the distances among negative ones. Based on the contrastive units and positive/negative samples they consider, we roughly introduce some commonly used contrastive losses as follows
- pairwise loss[2]: minimizes/maximizes the embedding distance when they are from the same/different class(es)
- triplet loss[3]: push the the distance between positive and anchor + margin to be smaller than the distance between negative and anchor
- N-pair Loss[4]: extend to N-1 negative samples vs. 1 positive sample against 1 anchor point
- InfoNCE(Noise Contrastive Estimation)[5]: Use softmax loss to differentiate a positive sample from a set of noise examples
- NT-Xent[6]: InfoNCE with Cosine Similarity on Normalized Embeddings. It uses temperature to control the relative importance of the distances between point pairs
- Soft-Nearest Neighbors Loss[7]: Extend to different numbers of positive (M) and negative examples (N).
- Lifted Structured Loss[8]: Consider all pairwise edges within the batch.
Model training
As contrastive learning relies on weaker supervisions, we need to increase the data efficiency during training to let the models be trained with more examples, especially negative examples. There are several commonly used strategies for model training to increase example exposures during training.
Momentum Contrast (MoCo[11]) builds upon the memory bank concept by introducing a more stable and efficient way of maintaining negative examples during contrastive learning. While a traditional memory bank stores embeddings from past batches, MoCo enhances this by using a momentum-updated encoder to generate the embeddings.
In MoCo, two models are used: a query encoder and a key encoder. The query encoder is updated normally with back-propagation, while the key encoder is updated more slowly, using a momentum mechanism. This slow update helps maintain a consistent set of embeddings in the memory bank, ensuring that negative samples evolve gradually rather than shifting drastically between iterations. This stability improves the quality of the contrastive learning process by providing a more reliable pool of negative examples, leading to better generalization and more robust feature representations.
Notable Contrastive Learning Works Using This Recipe
To demonstrate how our contrastive learning recipe can be applied to real-world scenarios, let’s review a few well-known contrastive learning models. These examples highlight how different models have adopted and adapted the core principles of contrastive learning to solve specific tasks.
- Problem: learn visual representations without labeled data
- Contrastive Relationships: Positive pairs are created from the same image with different data augmentation operations. Other pairs are treated as negative pairs.
- Data Augmentation & Sampling
- Data augmentation: image augmentations including random crop, resize with random flip, color distortions, and Gaussian blur.
- Negative data sampling: in-batch negatives
- Encoders: ResNet encoders are used due to their effectiveness in image representation
- Contrastive Loss: NT-Xent loss is chosen for scaling batch sizes and managing temperature, enhancing distinction between positive and negative pairs.
- Model Training: Large batch sizes
- Paper: Ting Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations” ICML 2020.
- Problem: learn visual representations without labeled data and large batches
- Contrastive Relationships: Similar to SimCLR
- Data Augmentation & Sampling
- Data augmentation: Similar to SimCLR
- Negative data sampling: FIFO Memory Bank
- Two encoders: a query encoder and a momentum-updated key encoder for stability across batches. The key encoder is used for update embeddings in memory bank
- Contrastive Loss: InfoNCE
- Model Training: Momentum Contrast
- Paper: Kaiming He et al. “Momentum Contrast for Unsupervised Visual Representation Learning” CVPR 2020.
- Problem: creates high-quality sentence embeddings in NLP without labeled data.
- Contrastive Relationships: 2 sources of positive pairs 1) labeled data from NLI datasets, 2) data augmentation; negatives come from other sentences in the batch.
- Data Augmentation & Sampling
- Data augmentation: different from image, data augmentation on discrete text data is challenges. Instead, dropout adds enough noise to the internal representation to form effective positive pairs while preserving semantic meaning.
- Negative data sampling: in-batch negatives. Use supervised NLI labeled data (Contradiction) as hard negative labels
- Encoder: Pre-trained BERT models
- Contrastive Loss: NT-Xent
- Model Training: Large batch size
- Paper: Tianyu Gao et al. “SimCSE: Simple Contrastive Learning of Sentence Embeddings.” arXiv preprint arXiv:2104.08821 (2020).
- Problem: creates high-quality sentence embeddings in NLP for passage retrieval tasks on large-scale document collections like open-domain QA
- Contrastive Relationships: supervised positive data pairs consist of a query and its relevant passage, while negatives are irrelevant passages from other queries.
- Data Augmentation & Sampling
- Data augmentation: DPR mostly rely on supervised datasets of Question Answering,
- Negative data sampling: in-batch negatives and some sampling tricks to get hard negatives
- use BM25 to fetch passages have common terms but not the positive one
- positive passage for other queries
- Encoders: Separate BERT encoders for queries and passages optimize for retrieval performance
- Contrastive Loss: Pairwise loss, which minimizes the distance between the query and the correct passage while maximizing the distance from negative passages.
- Model Training: Large batch size + hard negative samples
- Paper: Vladimir Karpukhin et al. ”Dense Passage Retrieval for Open-Domain Question Answering” EMNLP 2020.
- Problem: enables joint vision and language representations for zero-shot tasks where models can generalize to new tasks without additional training on task-specific datasets.
- Contrastive Relationships: Positive pairs are formed from an image and its corresponding text description, while negatives are formed from unrelated image-text pairs.
- Data Augmentation & Sampling
- Data augmentation: CLIP doesn’t rely heavily on augmentations since it works with paired image-text data.
- Negative data sampling: in-batch negatives
- Encoders: Separate encoders for text and images align cross-modal representations.
- Contrastive Loss: NT-Xent
- Model Training: Large batch size
- Paper: Alec Radford, et al. “Learning Transferable Visual Models From Natural Language Supervision” arXiv preprint arXiv:2103.00020 (2021)
To summarize our discussion, we’ve talked about the “what, when, and how” of applying contrastive learning in ML System Design interviews. The key takeaway is the flexibility of the recipe we’ve outlined, which you can adapt to different interview scenarios. While we’ve focused on practical and interview-friendly insights, further technical details can be found in the references for each method.
However, only knowing the recipe is not enough, there are also some pitfalls and follow-ups we need to take care of whenever we use contrastive learning in our interviews. We will introduce them in the next part of our discussions with more real interview examples. Stay tuned!
Have thoughts or questions about your own contrastive learning design? Let’s discuss them in the comments!
References
- Ching-Yao Chuang et al. “Debiased Contrastive Learning.” NeuriPS 2020.
- Sumit Chopra, Raia Hadsell and Yann LeCun. “Learning a similarity metric discriminatively, with application to face verification.” CVPR 2005.
- Florian Schroff, Dmitry Kalenichenko and James Philbin. “FaceNet: A Unified Embedding for Face Recognition and Clustering.” CVPR 2015.
- Kihyuk Sohn et al. “Improved Deep Metric Learning with Multi-class N-pair Loss Objective” NIPS 2016.
- Aaron van den Oord, Yazhe Li & Oriol Vinyals. “Representation Learning with Contrastive Predictive Coding” arXiv preprint arXiv:1807.03748 (2018).
- Ting Chen et al. “A Simple Framework for Contrastive Learning of Visual Representations” ICML 2020.
- Nicholas Frosst, Nicolas Papernot and Geoffrey Hinton. “Analyzing and Improving Representations with the Soft Nearest Neighbor Loss.” ICML 2019
- Hyun Oh Song et al. “Deep Metric Learning via Lifted Structured Feature Embedding.” CVPR 2016.
- Alec Radford, et al. “Learning Transferable Visual Models From Natural Language Supervision” arXiv preprint arXiv:2103.00020 (2021)
- Zhirong Wu et al. “Unsupervised feature learning via non-parametric instance-level discrimination.” CVPR 2018.
- Kaiming He et al. “Momentum Contrast for Unsupervised Visual Representation Learning” CVPR 2020.
- Tianyu Gao et al. “SimCSE: Simple Contrastive Learning of Sentence Embeddings.” arXiv preprint arXiv:2104.08821 (2020).
- Vladimir Karpukhin et al. ”Dense Passage Retrieval for Open-Domain Question Answering” EMNLP 2020.
- Prannay Khosla et al. “Supervised Contrastive Learning.” NeurIPS 2020.
- Weng, Lilian. (May 2021). Contrastive representation learning. Lil’Log. https://lilianweng.github.io/posts/2021-05-31-contrastive/.
- Rui Zhang et al. “Contrastive Data and Learning for Natural Language Processing” NAACL 2022 tutorial. [reading list]