Most ML System Design interviews start with a brief prompt, such as, “Design a recommendation system for XXX.” This underscores the critical role of question clarifications. On one hand, interviewers use the clarification phase to assess candidates’ communication skills and how effectively they articulate objectives. On the other hand, candidates rely on this step to gather the necessary context to steer the discussion in the right direction and at the appropriate level of detail.

Unfortunately, I’ve seen many candidates overlook key clarification questions, leading them to make incorrect assumptions and flawed design decisions. To help avoid these pitfalls, I’ve compiled a checklist of essential clarification questions that can guide you through these interviews successfully.

General checklists

Objectives

What are the business goals and objectives for this taks?
What are the use cases?
What are the expected user interaction with the models?
- Inputs/outputs

Data

What data will be used as features?
What are the potential sources of training data?
- What is the quality of these sources?
What does the data distribution look like?
- Is the data balanced?
- Will the distribution drift over time?
- Does the data have bias?

Target Users

What are the target user profiles?

Available Resources

What data labeling resources are available?
Are there any existing models or services that can be leveraged?

Scale

Active users
Model inference traffics
Latency requirement
Data sizes

Constraints

Budget limitation
Hardware limitation
Privacy & legal constraints

Topic specific checklists

Recommendation System

Properties of items
- Content type/modality
Do we have to consider cold start?
Are there diversity or exploratory requirements?

Is personalization required?
Properties of candidate documents
- Scale
- Content type/modality
- Size

Time Series Prediction

Time horizon: 1 day, 1 week, 1 hour
Steps of predictions: rolling forecasts vs. direct multi-step prediction

Supervised Classification

Properties of labels
- Multi-label vs. single label
- Do the labels have semantic relationships with each other?
- Can there be new labels in the future?
Amount and quality of labeled data

Supervised Regression

Constraint on target values
- Is the target variable bounded?
- Outliers in target values

The checklists above cover the critical points that will help you minimize uncertainties before diving into the design discussion. Keep in mind that you’ll typically have ~5 minutes to ask these questions. If some answers can be inferred from the scenario and context, feel free to skip them and focus on the most important ones.

Do you have any questions about these clarification points? Or is there anything I missed? Let’s discuss and clarify in the comments!