One-Sentence Definition
Supervised learning is a type of machine learning where the model is trained on labeled data -- input-output pairs where the correct answer is provided -- so it can learn to predict outputs for new, unseen inputs.
How It Works
The workflow is straightforward. You start with a dataset where each example has an input (features) and a corresponding label (the correct answer). For an email spam detector, the input is the email text and metadata; the label is "spam" or "not spam." For a house price predictor, the inputs are square footage, location, and bedrooms; the label is the sale price.
During training, the model makes predictions, compares them to the true labels using a loss function, and updates its parameters to reduce the error. This loop repeats over the dataset for multiple passes (epochs). The model is then evaluated on a held-out test set it has never seen to measure how well it generalizes.
Supervised learning divides into two main categories. Classification assigns inputs to discrete categories: spam vs. not spam, cat vs. dog, malignant vs. benign. Regression predicts continuous values: house prices, stock returns, temperature forecasts. Both use the same train-evaluate-deploy cycle, but with different loss functions and evaluation metrics.
Classic supervised learning algorithms include logistic regression, decision trees, random forests, and support vector machines. Deep-learning models (CNNs, transformers) are also supervised learners when trained on labeled data. The supervised fine-tuning (SFT) stage of LLM training -- where the model learns from curated instruction-response pairs -- is a direct application of supervised learning.
Why It Matters
Supervised learning is the workhorse of applied machine learning. Most production ML systems -- fraud detection, medical diagnosis, search ranking, content moderation, credit scoring -- are supervised models trained on historical labeled data. It is the most mature, best-understood ML paradigm, with well-established best practices for data splitting, cross-validation, hyperparameter tuning, and deployment.
The main bottleneck is labeled data. Creating high-quality labels is expensive and time-consuming, which is why techniques like self-supervised learning (used in LLM pre-training) and semi-supervised learning (mixing labeled and unlabeled data) have become increasingly important.
Key Takeaway
Supervised learning trains models on labeled examples to predict outcomes for new data, and it is the most widely deployed form of machine learning in production systems worldwide.
Part of the AI Weekly Glossary.