ML in 30 Days: A Clear, Beginner-Friendly Roadmap to Machine Learning
A practical 30-day roadmap to understand Machine Learning from first principles β covering data, models, evaluation, and real-world workflows without unnecessary math or hype.
ML in 30 Days π
A clear, beginner-friendly roadmap to understand Machine Learning end to end
This page follows the exact ML in 30 Days Instagram series. Use it to track progress, revise concepts, and move from theory β practice.
ML in 30 Days
30 days remaining. Keep going!
π§ PHASE 1 β Foundations (Days 1β7)
Day 1: What is Machine Learning?
Traditional Programming vs Machine Learning
In traditional programming, you write explicit rules: "If income > 50000 AND age > 30, then approve loan." The computer follows your rules exactly.
In machine learning, you show the computer examples of inputs and outputs, and it learns the patterns itself. Instead of saying "what to do," you say "here's what happened beforeβfigure out why."
The Core Idea
Machine Learning is teaching computers to learn patterns from data rather than being explicitly programmed for each decision. You provide data (experiences) and the algorithm develops its own understanding (knowledge).
Real-World Examples
- Email spam filters learn from millions of emails
- Netflix recommends shows based on your viewing history
- Banks detect fraud by learning normal transaction patterns
- Voice assistants recognize speech by learning from thousands of voice samples
Key Insight: ML shines when rules are too complex to write manually, or when the rules change frequently and you'd need to constantly rewrite code.
Day 2: Types of Machine Learning
1. Supervised Learning β Learning with a Teacher
You provide labeled examples: both the input AND the correct output. The model learns to predict outputs from inputs.
Examples:
- Email β Spam/Not Spam (labeled by you)
- House size β Price (historical sales data)
- Image β Cat/Dog (manually labeled photos)
2. Unsupervised Learning β Learning without a Teacher
You provide only inputs. The model finds hidden patterns or structures on its ownβno labels provided.
Examples:
- Group customers by purchasing behavior (no pre-defined groups)
- Compress data by finding common patterns
- Detect anomalies (things that don't fit the pattern)
3. Reinforcement Learning β Learning from Experience
An agent takes actions in an environment and learns from rewards/punishments. It discovers through trial and error what works best.
Examples:
- AlphaGo playing chess
- Robots learning to walk
- Game AI learning strategies
Quick Comparison
| Type | Data | Goal | Analogy |
|---|---|---|---|
| Supervised | Labeled | Predict | Learning with answer key |
| Unsupervised | Unlabeled | Discover patterns | Finding groups/clusters |
| Reinforcement | Actions + Rewards | Maximize reward | Learning from consequences |
Day 3: Supervised vs Unsupervised Learning
When to Use Supervised Learning
Use supervised learning when you have:
- Historical data with known outcomes
- A clear target variable you want to predict
- Enough labeled examples to train on
Two Categories of Supervised Learning:
-
Classification β Predict categories/classes
- Email: Spam or Not Spam
- Tumor: Malignant or Benign
- Customer: Will Buy or Won't Buy
- Output is discrete (finite set of options)
-
Regression β Predict continuous values
- House price prediction
- Temperature forecasting
- Sales estimation
- Output is a number on a continuous scale
When to Use Unsupervised Learning
Use unsupervised learning when you:
- Don't have labels/outcomes available
- Want to explore data and discover patterns
- Need to segment customers/users into natural groups
- Want to reduce data complexity for visualization
Common Unsupervised Techniques:
-
Clustering β Group similar data points
- Customer segmentation
- Image compression (group similar pixels)
-
Dimensionality Reduction β Simplify without losing information
- PCA (Principal Component Analysis)
- Make high-dimensional data visualizable
The Critical Difference: Supervised = predicting known categories/values. Unsupervised = discovering unknown structure.
Day 4: Features & Labels
What is a Feature?
A feature (also called a variable, attribute, or predictor) is a measurable property or characteristic of the phenomenon you're observing.
Example: Predicting House Prices
| Feature | Type | Description |
|---|---|---|
| Square feet | Numeric | Size of the house |
| Number of bedrooms | Numeric | Count of rooms |
| Location | Categorical | Neighborhood/zip code |
| Age of house | Numeric | Years since built |
| Number of bathrooms | Numeric | Count of bathrooms |
What is a Label?
The label (or target/ground truth) is the value you're trying to predict. It's the "answer" for supervised learning.
Continuing the House Example:
- Label: Sale price ($350,000, $420,000, etc.)
Feature Engineering β Crafting Good Features
The quality of your features often matters more than the algorithm you choose.
Good Features:
- Relevant to the prediction task
- Reliable (consistent, not noisy)
- Available for new data you'll predict on
- Understandable
Feature Engineering Examples:
- Instead of raw dates, use "days since event"
- Instead of full address, use "distance to downtown"
- Combine related features (e.g., bedrooms + bathrooms = total rooms)
Feature Types:
-
Numeric/Continuous β Can take any value in a range
- Price, temperature, age
-
Categorical/Discrete β Finite set of categories
- Color (red, blue, green), Yes/No, Rating (1-5)
-
Ordinal β Categories with meaningful order
- Education level, satisfaction rating
Day 5: Training Data vs Test Data
The Core Problem
We train a model on some data, but we care about how it performs on NEW, UNSEEN data. This is called generalization.
The Solution: Train-Test Split
Split your data into:
- Training Set (typically 70-80%) β Used to teach the model
- Test Set (typically 20-30%) β Used to evaluate performance
Why This Matters
Training Data β Model learns patterns
Test Data β Model is evaluated on patterns it has NEVER seen
Critical Rule: Never touch your test set during training. Using test data for training = cheating.
Visual Example with 100 data points:
All Data (100 samples)
βββ Training Set (80 samples) βββ Model learns from this
βββ Test Set (20 samples) βββββββ Only used ONCE at the end
The Train-Test Split Code:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
features, labels, test_size=0.2, random_state=42
)random_state ensures reproducibilityβyou get the same split every time.
Validation Set (Bonus)
For complex models, you often use three splits:
- Training Set (60-70%) β Learn parameters
- Validation Set (10-20%) β Tune hyperparameters
- Test Set (10-20%) β Final evaluation
Day 6: What is a Model?
Model = Mathematical Representation of Patterns
A machine learning model is a mathematical function that takes inputs (features) and produces outputs (predictions). It captures the relationship between X (features) and y (label).
Simple Example: Linear Regression
Where:
- = slope (weight/coefficient)
- = intercept (bias)
- = feature (square footage)
- = prediction (house price)
The model "learns" m and b from training data.
Models are Templates, Not Rules
Think of models as flexible templates that mold themselves to fit your data:
- Linear Regression = straight line template
- Decision Tree = flowchart template
- Neural Network = complex pattern-matching template
The Learning Process
- Start with random/fresh parameters
- Make predictions on training data
- Calculate prediction error (how wrong?)
- Adjust parameters to reduce error
- Repeat until error is minimized
Key Insight: The model doesn't "know" anything about houses. It simply finds the mathematical relationship that best maps square footage β price based on examples.
Analogy: Learning to Catch
- Traditional programming: Someone tells you "move your hand to coordinates (x,y)"
- Machine learning: You throw 100 balls, miss most, adjust your movement each time, eventually get better
The model is your "catching strategy"βlearned from experience (data), not explicitly programmed.
Day 7: Bias in Data
What is Data Bias?
Bias in ML is systematic error that skews results in a particular direction. It comes from the data, not the algorithm.
Common Types of Data Bias:
1. Selection Bias
- Training data doesn't represent real-world distribution
- Example: Training a face detector only on photos of young people, then failing on elderly faces
2. Label Bias
- Human labels are inconsistent or prejudiced
- Example: Historical hiring data reflecting past discrimination
3. Confirmation Bias
- Collect/interpret data to confirm existing beliefs
- Example: Only tracking positive customer reviews
4. Survivorship Bias
- Only analyzing "successful" cases, ignoring failures
- Example: Studying only successful startups to predict success
Real-World Consequence Examples:
- COMPAS Recidivism Algorithm: Higher false positive rate for Black defendants
- Amazon Hiring Tool: Biased against women (trained on 10 years of resumes)
- Facial Recognition: Poor performance on darker skin tones (underrepresented in training data)
The Fix:
- Audit your data β Who/what is represented?
- Diversify data collection β Ensure broad representation
- Test on multiple groups β Check performance equity
- Acknowledge limitations β Be transparent about biases
Key Insight: A model is only as good as the data it's trained on. "Garbage in, garbage out."
π Notes (Foundations)
Write concepts in your own words.
If you can explain it simply, you understand it.
π PHASE 2 β Data & Core Concepts (Days 8β14)
Day 8: Data Preprocessing
Why Preprocess Data?
Raw data is messy. Real-world data has:
- Missing values
- Inconsistent formats
- Outliers
- Duplicate entries
- Irrelevant columns
Preprocessing prepares data for your model to learn effectively.
Common Preprocessing Steps:
1. Handling Missing Values
Options:
- Remove rows with missing values (if few)
- Fill with mean/median/mode (simple imputation)
- Use advanced techniques (KNN imputation, iterative imputation)
# Option 1: Drop rows with missing values
df.dropna()
# Option 2: Fill with mean
df['column'].fillna(df['column'].mean())2. Encoding Categorical Variables
Convert text categories to numbers:
- Label Encoding: cat β 0, dog β 1, bird β 2
- One-Hot Encoding: Creates binary columns
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['category_encoded'] = encoder.fit_transform(df['category'])3. Handling Duplicates
df.drop_duplicates()4. Fixing Inconsistent Data
# Standardize text
df['city'] = df['city'].str.lower().str.strip()
# Fix typos in categories
df['country'] = df['country'].replace({'usa': 'United States'})The Preprocessing Pipeline:
Raw Data β Clean β Transform β Feature Engineer β Model-Ready Data
Key Insight: Data scientists spend 60-80% of their time on data preprocessing. It's not glamorous, but it makes or breaks your model.
Day 9: TrainβTest Split
Recap and Deep Dive
We discussed train-test split in Day 5. Now let's understand it more deeply.
Why 80/20 Split?
- Too little training data β model can't learn patterns
- Too little test data β unreliable performance estimate
- 80/20 is a good starting point (also common: 70/30, 75/25)
The Stratified Split
When classes are imbalanced, use stratified sampling:
train_test_split(X, y, test_size=0.2, stratify=y)This ensures the train and test sets have the same class distribution.
The Data Leakage Problem
CRITICAL: Never let information from test data influence training.
Bad Examples:
- Computing mean on entire dataset before splitting
- Normalizing using training + test combined
- Feature engineering using test data knowledge
Correct Approach:
# Split FIRST
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# THEN compute statistics on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # Uses training stats!Practical Splitting Strategy:
1. Hold out test set (never touch until final evaluation)
2. Use validation set for hyperparameter tuning
3. Train on remaining data
4. Report test set performance once
Day 10: Underfitting
What is Underfitting?
Underfitting is when your model is too simple to capture patterns in the data. It performs poorly on both training AND test data.
The Model is "Underpowered"
Like trying to fit a straight line to curved dataβthe model lacks the capacity to learn the true pattern.
Symptoms of Underfitting:
- High training error
- High test error
- Model ignores important features
- Patterns in data are obvious but model misses them
Visual Example:
The data points form a curve, but the model draws a straight line:
A linear model cannot capture the curved relationship in the data.
Causes of Underfitting:
- Model too simple β Linear model for non-linear data
- Not enough features β Missing important predictors
- Too much regularization β Penalizing complexity too much
- Insufficient training β Stopped too early
How to Fix Underfitting:
- Use a more complex model
- Add more relevant features
- Reduce regularization
- Train longer (more epochs/iterations)
Example: Underfitting vs Good Fit vs Overfitting
| Model | Equation | Description |
|---|---|---|
| Underfitting | Straight line on curved data | |
| Good Fit | Captures the curve | |
| Overfitting | Complex polynomial | Wiggly line touching every point |
Key Insight: Underfitting is the model saying "I can't learn this." Overfitting is the model saying "I memorized this, but I don't understand."
Day 11: Mean, Median & Standard Deviation
These are fundamental statistics for understanding data distribution.
Mean (Average)
Sum of all values divided by count.
Example: Test scores: 70, 80, 90, 70, 90
Median (Middle Value)
The value that separates the higher half from the lower half. Sort and find the middle.
Example: Test scores: 70, 80, 90, 70, 90
Sorted: 70, 70, 80, 90, 90
Mean vs Median:
- Mean is sensitive to outliers
- Median is robust to outliers
Example: Incomes: $30k, $40k, $50k, $60k, $1M
- Mean: $236k (misleadingβmost people earn far less)
- Median: $50k (more representative)
Standard Deviation (SD)
Measures how spread out values are from the mean.
Low SD: Values cluster near the mean
High SD: Values are widely spread
Example:
- Class A scores: 78, 79, 80, 81, 82 β SD β 1.6
- Class B scores: 50, 70, 80, 90, 110 β SD β 22
Why These Matter in ML:
- Feature scaling (normalize to similar ranges)
- Outlier detection (values far from mean/median)
- Understanding data distribution
- Choosing appropriate models
Day 12: Correlation
What is Correlation?
Correlation measures how two variables change together. It tells you if one variable can predict another.
Correlation Coefficient (r)
Ranges from -1 to +1:
| Value | Meaning |
|---|---|
| +1.0 | Perfect positive correlation |
| +0.7 | Strong positive correlation |
| +0.3 | Weak positive correlation |
| 0.0 | No correlation |
| -0.3 | Weak negative correlation |
| -0.7 | Strong negative correlation |
| -1.0 | Perfect negative correlation |
Positive Correlation
As one variable increases, the other increases. Example: Height β β Weight β
Negative Correlation
As one variable increases, the other decreases. Example: Hours of exercise β β Body fat β
No Correlation
Variables move independently. Example: Shoe size β IQ score
Correlation β Causation
Just because two things correlate doesn't mean one causes the other!
Spurious Correlation Example: Ice cream sales and shark attacks both increase in summerβbut ice cream doesn't cause shark attacks.
Visualizing Correlation:
import seaborn as sns
import matplotlib.pyplot as plt
# Correlation matrix heatmap
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')Feature Selection with Correlation:
- Remove highly correlated features (redundancy)
- Keep features strongly correlated with target
- Remove features uncorrelated with target
Day 13: Overfitting
What is Overfitting?
Overfitting is when your model learns the training data too wellβincluding its noise and quirks. It memorizes rather than generalizes.
The Model is "Overly Complex"
It captures random fluctuations in training data that aren't real patterns.
Symptoms of Overfitting:
- Very low training error
- High test error
- Model works perfectly on training data, poorly on new data
Visual Example:
Overfitting: The model memorizes noise instead of learning patterns.
A complex model that fits every point, including noise.
Causes of Overfitting:
- Model too complex β Deep tree on small data
- Too many features β More predictors than samples
- Training too long β Continued learning after patterns are found
- Insufficient data β Not enough examples to learn true patterns
How to Fix Overfitting:
- Regularization β Penalize complexity
- Get more data β More examples = better generalization
- Feature selection β Remove irrelevant features
- Cross-validation β Better performance estimation
- Simplify model β Reduce model complexity
- Early stopping β Stop training before memorizing
The Bias-Variance Tradeoff Visualized:
Too simple β’ Misses patterns
Straight line β misses curved pattern
Just right β’ Captures trend
Smooth curve β captures pattern
Too complex β’ Memorizes noise
Jagged line β follows every point
Day 14: Bias vs Variance
Understanding the Two Sources of Error
All prediction errors can be decomposed into:
** Bias-Variance Tradeoff**
Watch how bias and variance change with model complexity. The optimal model complexity minimizes total error.
Bias (Underfitting)
Bias is error from overly simplistic assumptions. The model is "biased" toward missing important patterns.
High Bias Symptoms:
- Misses relevant relationships
- Underestimates/overestimates systematically
- Performs poorly on all data
Variance (Overfitting)
Variance is error from sensitivity to noise in training data. The model "varies" too much with different training sets.
High Variance Symptoms:
- Captures random noise
- Performs differently on different training sets
- Training error is low, test error is high
The Bias-Variance Tradeoff
| Situation | Problem | Solution |
|---|---|---|
| High Bias + Low Variance | Underfitting (Consistent but wrong) | More complex model |
| Low Bias + High Variance | Overfitting (Flexible but unstable) | Regularization, more data |
| Low Bias + Low Variance | Good Fit (Right balance) | Optimal! |
Practical Implications:
| Situation | Problem | Solution |
|---|---|---|
| High train error, high test error | Underfitting | More complex model, more features |
| Low train error, high test error | Overfitting | Regularization, more data, simpler model |
| High train error, low test error | Rare (possible data leakage) | Check data pipeline |
Key Insight: You cannot simultaneously minimize both bias and variance perfectly. The goal is to find the sweet spot where total error is minimized.
π Notes (Data & Concepts)
Focus on why things break, not formulas.
π€ PHASE 3 β ML Algorithms (Days 15β21)
Day 15: Linear Regression
What It Does
Linear Regression finds the best-fitting straight line through your data points. It predicts a continuous value based on input features.
The Equation
Where:
- = predicted value (target)
- = input feature
- = slope (weight/coefficient)
- = intercept (bias)
Multiple Linear Regression (multiple features):
** Linear Regression**
The model learns to find the best-fit line through data points by minimizing the sum of squared errors.
How It Works
The algorithm finds the line (or hyperplane) that minimizes the sum of squared errors (MSE).
Where is the predicted value.
Simple Example: Predicting Ice Cream Sales
| Temperature (Β°C) | Sales ($) |
|---|---|
| 20 | 150 |
| 25 | 200 |
| 30 | 280 |
| 35 | 350 |
Model learns: Sales Temperature
When to Use Linear Regression:
- Target is continuous (not categorical)
- Features are linearly related to target
- You need interpretability
- Baseline model for comparison
When NOT to Use:
- Relationships are non-linear
- Target has complex interactions
- Outliers heavily influence results
Code Example:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
# Interpret coefficients
print(f"Coefficient: {model.coef_[0]}")
print(f"Intercept: {model.intercept_}")Day 16: Logistic Regression
What It Does Despite the Name
Despite "regression" in the name, this is a classification algorithm. It predicts the probability of belonging to a class.
The Sigmoid Function
Logistic regression uses the sigmoid function to squash outputs between 0 and 1:
Where is the linear combination of features.
** Sigmoid Function**
The sigmoid function maps any real number to a probability between 0 and 1.
Binary Classification Example: Spam Detection
| Feature | Value |
|---|---|
| Has word "free" | 1 |
| Number of exclamation marks | 3 |
| From unknown sender | 1 |
Prediction: SPAM (probability > 0.5 threshold)
Decision Boundary
Typically, we use 0.5 as the threshold:
- β Class 1
- β Class 0
Multiclass Classification
Logistic Regression can be extended to 3+ classes:
- One-vs-Rest (OvR): Train one classifier per class
- Multinomial: Directly model class probabilities
When to Use Logistic Regression:
- Binary classification problems
- Need probability estimates
- Interpretable model (coefficients show feature importance)
- Well-separated classes
Code Example:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
probabilities = model.predict_proba(X_test)
predictions = model.predict(X_test)Day 17: K-Nearest Neighbors (KNN)
The Intuition
KNN makes predictions based on similarity. "Tell me who your neighbors are, and I'll tell you who you are."
How It Works
- Choose K (number of neighbors)
- For a new data point, find the K closest points
- Vote: The majority class among neighbors wins
** KNN Decision Boundary**
The new point (green circle) is classified based on the majority class of its K nearest neighbors.
Distance Metrics
The most common is Euclidean distance:
With K=3: 2 A's (blue), 1 B (red) β Predict A. With K=5: 2 A's, 3 B's β Predict B.
Choosing K
- Small K: Sensitive to noise, may overfit
- Large K: Smoother boundaries, may underfit
- Common: Try K = 3, 5, 7, sqrt(n)
When to Use KNN:
- Small to medium datasets
- Quick baseline model
- No training phase (lazy learner)
- Multi-class classification
When NOT to Use:
- Large datasets (slow prediction)
- High-dimensional data
- Features on very different scales
Code Example:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
predictions = model.predict(X_test)Day 18: Decision Trees
What It Is
A flowchart-like structure where each node represents a feature test, each branch is an outcome, and each leaf is a prediction.
** Decision Tree Splitting**
A decision tree splits the feature space into rectangular regions, each assigned to a class.
Key Concepts:
Information Gain (ID3/C4.5): Measures reduction in uncertainty after a split
Gini Impurity (CART): Measures probability of misclassification
Pruning: Removing branches that have little predictive power to prevent overfitting
When to Use Decision Trees:
- Interpretability is important
- Non-linear relationships
- Mixed feature types (numeric + categorical)
- Fast training and prediction
Limitations:
- Prone to overfitting
- Sensitive to small data changes
- Can create biased trees with imbalanced data
Code Example:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=5, random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
# Visualize
from sklearn.tree import plot_tree
plot_tree(model, feature_names=features, filled=True)Day 19: Random Forest
What It Is
An ensemble method that combines multiple decision trees to create a more powerful and robust model.
The Wisdom of Crowds
- Individual trees: Good, but can make mistakes
- Forest of trees: Mistakes tend to cancel out
How It Works (Bagging):
- Bootstrap: Sample data with replacement for each tree
- Feature Randomness: Each tree sees random subset of features
- Aggregate: Majority vote (classification) or average (regression)
Visual Concept:
Tree 1 ββ
Tree 2 ββΌββββ Final Prediction (Vote/Average)
Tree 3 ββ€
Tree 4 ββ
Why Random Forest Works:
- Reduces overfitting: Individual trees overfit, but averaging reduces it
- Handles non-linearity: Trees capture complex patterns
- Robust to outliers: Individual trees may be affected, forest is not
- Feature importance: Shows which features matter most
When to Use Random Forest:
- Most classification/regression tasks
- Good default choice
- Need robust predictions
- Feature importance analysis
Code Example:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
# Feature importance
importances = model.feature_importances_Day 20: Support Vector Machines (SVM)
What It Is
SVM finds the optimal hyperplane that best separates classes in the feature space.
** SVM Maximizing Margin**
SVM finds the hyperplane that maximizes the margin between classes. The points closest to the boundary are called support vectors.
The Key Idea: Maximize the Margin
SVM doesn't just find any separating lineβit finds the line with the largest margin between classes.
The Kernel Trick
SVM can separate non-linear data by mapping it to higher dimensions:
A circular pattern in 2D becomes linearly separable when mapped to higher dimensions.
Common Kernels:
- Linear: Straight-line separation
- RBF (Radial Basis Function): Flexible, curved boundaries
- Polynomial: Curved surfaces
When to Use SVM:
- Binary classification
- High-dimensional data
- Clear margin of separation
- Small to medium datasets
Limitations:
- Slow on large datasets
- Sensitive to parameter choice
- Requires feature scaling
Code Example:
from sklearn.svm import SVC
model = SVC(kernel='rbf', C=1.0, gamma='scale')
model.fit(X_train_scaled, y_train)
predictions = model.predict(X_test_scaled)Day 21: Naive Bayes
What It Is
A probabilistic classifier based on Bayes' Theorem with a "naive" assumption of feature independence.
Bayes' Theorem
** Bayes' Theorem**
The posterior probability depends on both the prior probability and the likelihood of evidence.
The Naive Assumption
Assume all features are independent given the class. This simplifies calculation dramatically (even though it is rarely true in practice).
Why It is Called "Naive"
Real-world features are often correlated, but Naive Bayes ignores this. Surprisingly, this works well anyway!
Text Classification Example: Spam Detection
Types of Naive Bayes:
| Type | Best For |
|---|---|
| Gaussian | Continuous features (assumes normal distribution) |
| Multinomial | Word counts, text classification |
| Bernoulli | Binary features (present/absent) |
When to Use Naive Bayes:
- Text classification (spam, sentiment)
- Multi-class classification
- Fast training and prediction
- Works well with small data
Code Example:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train, y_train)
predictions = model.predict(X_test)π Notes (Algorithms)
Ask yourself:
βWhat kind of problem does this solve best?β
π PHASE 4 β Evaluation & Learning Process (Days 22β29)
Day 22: Model Evaluation Metrics
How Do You Know If Your Model Is Good?
You need metrics to quantify performance. Different problems require different metrics.
Regression Metrics:
1. Mean Absolute Error (MAE)
Interpretation: Average absolute difference between predictions and actual values. In the same units as target.
2. Mean Squared Error (MSE)
Penalizes larger errors more heavily. Units are squared.
3. Root Mean Squared Error (RMSE)
Interpretation: Back to original units. More interpretable than MSE.
4. RΒ² Score (Coefficient of Determination)
Interpretation: Proportion of variance explained. 1.0 = perfect, 0 = predicts mean always.
Classification Metrics:
1. Accuracy
Interpretation: Proportion of correct predictions. Good for balanced classes.
2. Precision
Interpretation: Of all positive predictions, how many are correct? Important when false positives are costly.
3. Recall (Sensitivity)
Interpretation: Of all actual positives, how many did we find? Important when false negatives are costly.
4. F1 Score
Interpretation: Harmonic mean of precision and recall. Good for imbalanced classes.
Choosing the Right Metric:
| Problem Type | Common Metrics |
|---|---|
| Regression | MAE, RMSE, RΒ² |
| Binary Classification (balanced) | Accuracy |
| Binary Classification (imbalanced) | Precision, Recall, F1 |
| Multi-class | Macro/Micro F1 |
Day 23: Confusion Matrix
What Is a Confusion Matrix?
A table that describes the performance of a classification model. It shows actual vs predicted classifications.
Binary Classification Example:
Where:
- TN = True Negative (Correctly predicted negative)
- TP = True Positive (Correctly predicted positive)
- FN = False Negative (Missed positive - Type II Error)
- FP = False Positive (Wrongly predicted positive - Type I Error)
Example: Cancer Detection (100 patients)
Calculations:
Multi-Class Confusion Matrix:
For 3+ classes, you get an NΓN matrix showing all class predictions.
Visualizing:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, predictions)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()Why It Matters:
- Shows not just accuracy, but types of errors
- Identifies which classes are confused with each other
- Reveals class imbalance issues
Day 24: ROC Curve & AUC
ROC Curve
Receiver Operating Characteristic curve plots:
- True Positive Rate (Recall) on Y-axis
- False Positive Rate on X-axis
At various classification thresholds.
AUC (Area Under the Curve)
The area under the ROC curve. Single number summary:
| AUC | Meaning |
|---|---|
| 0.5 | Random guessing (useless) |
| 0.7-0.8 | Fair |
| 0.8-0.9 | Good |
| 0.9+ | Excellent |
** ROC Curve**
The ROC curve shows the tradeoff between True Positive Rate and False Positive Rate at different thresholds. The diagonal line represents random guessing.
Why Use ROC-AUC?
- Threshold-independent evaluation
- Works well for imbalanced datasets
- Shows tradeoff between TPR and FPR
When to Use:
- Binary classification
- Comparing multiple models
- Imbalanced classification problems
Code Example:
from sklearn.metrics import roc_auc_score, roc_curve
# Get probabilities for positive class
y_proba = model.predict_proba(X_test)[:, 1]
# AUC Score
auc = roc_auc_score(y_test, y_proba)
# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)Day 25: Cross-Validation
The Problem with Single Split
A single train-test split gives one performance estimate. But what if that split is unlucky?
** K-Fold Cross-Validation**
Watch how K-fold CV uses each fold as test set exactly once, giving a more reliable performance estimate.
K-Fold Cross-Validation Solution
Split data into K folds, train K times:
| Fold | Split |
|---|---|
| 1 | [Test] [Train] [Train] [Train] [Train] |
| 2 | [Train] [Test] [Train] [Train] [Train] |
| 3 | [Train] [Train] [Test] [Train] [Train] |
| 4 | [Train] [Train] [Train] [Test] [Train] |
| 5 | [Train] [Train] [Train] [Train] [Test] |
Final Score = Average of K scores
Common Values of K:
- K=5: Good balance of speed and reliability
- K=10: Standard, very reliable
- K=LOOCV: Leave-One-Out (K=n), very slow but uses maximum data
Stratified K-Fold
For classification, use stratified K-fold to maintain class distribution in each fold.
Nested Cross-Validation
For hyperparameter tuning:
Outer Loop: Evaluate model performance
ββ Inner Loop: Tune hyperparameters
Code Example:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(
model, X, y,
cv=5, # 5-fold
scoring='accuracy'
)
print(f"Scores: {scores}")
print(f"Mean: {scores.mean():.3f}")
print(f"Std: {scores.std():.3f}")Why It Matters:
- More reliable performance estimate
- Uses all data for training and testing
- Reduces variance of performance estimate
Day 26: Feature Scaling
Why Scale Features?
Many algorithms are sensitive to feature scales:
- Distance-based algorithms (KNN, SVM)
- Gradient descent algorithms
- Regularization
Without Scaling:
Age: 25-65 (small range)
Income: 20,000-200,000 (large range)
β Income dominates completely
Types of Scaling:
1. Standardization (Z-score normalization)
Result: Mean = 0, Std = 1
2. Min-Max Normalization
Result: Range [0, 1]
3. Robust Scaling
Uses median and IQR (outlier-resistant):
Code Example:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Min-Max Scaling
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)Important: Fit scaler on training data only, transform both train and test with same scaler.
When to Scale:
-
KNN, SVM, K-Means
-
Logistic Regression, Linear Regression (with regularization)
-
Neural Networks
-
PCA, clustering
-
Tree-based models (Decision Tree, Random Forest) β invariant to scale
Day 27: Gradient Descent
What Is Gradient Descent?
An optimization algorithm that minimizes a loss function by iteratively moving in the direction of steepest descent.
The Intuition
Imagine being blindfolded on a mountain and trying to reach the bottom by feeling the slope under your feet. Gradient descent is the mathematical version of this.
The Update Rule
Where:
- = model parameters (weights)
- = learning rate (step size)
- = gradient (slope of loss function)
Visual Example:
** Gradient Descent**
Watch how gradient descent iteratively moves toward the minimum of the loss function. The learning rate determines step size.
Types of Gradient Descent:
1. Batch Gradient Descent
- Uses all data per iteration
- Stable, but slow for large datasets
2. Stochastic Gradient Descent (SGD)
- Uses one sample per iteration
- Fast, noisy, can escape local minima
3. Mini-Batch Gradient Descent
- Uses small batches (32, 64, 128 samples)
- Best of both worlds β most common in practice
Learning Rate Matters:
- Too small: Converges very slowly
- Too large: Oscillates or diverges
- Just right: Converges efficiently
Code Example (Concept):
# Simplified gradient descent for linear regression
for epoch in range(epochs):
for each sample in batch:
prediction = model(x)
error = prediction - y
gradient = error * x
weights -= learning_rate * gradientDay 28: Regularization
What Is Regularization?
A technique to prevent overfitting by adding a penalty to the loss function that discourages complex models.
The Bias-Variance Tradeoff Again
Regularization intentionally increases bias to reduce variance, finding a better total error.
Types of Regularization:
1. L1 Regularization (Lasso)
Adds absolute value of weights to loss:
Loss = MSE + Ξ» Γ Ξ£|weights|
Effect: Pushes some weights to exactly zero (feature selection)
2. L2 Regularization (Ridge)
Adds squared weights to loss:
Loss = MSE + Ξ» Γ Ξ£(weights)Β²
Effect: Shrinks weights toward zero (but rarely to exactly zero)
3. Elastic Net (Combination)
** Regularization Effect**
Adjust lambda (regularization strength) to see how it affects the model complexity. Higher lambda = simpler model = less overfitting.
Lambda (Ξ») Controls Strength:
- Ξ» = 0: No regularization (risk of overfitting)
- Ξ» = Large: Very strong regularization (risk of underfitting)
Code Example:
from sklearn.linear_model import Ridge, Lasso, ElasticNet
# Ridge (L2)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
# Lasso (L1)
lasso = Lasso(alpha=1.0)
lasso.fit(X_train, y_train)
# Elastic Net
elastic = ElasticNet(alpha=1.0, l1_ratio=0.5)
elastic.fit(X_train, y_train)Day 29: Hyperparameter Tuning
What Are Hyperparameters?
Parameters set BEFORE training starts (not learned from data):
- Learning rate
- Number of trees in Random Forest
- K in KNN
- Regularization strength
- Max depth of Decision Tree
Why Tuning Matters
Small changes can dramatically affect performance:
Default Random Forest: 82% accuracy
Tuned Random Forest: 89% accuracy
Tuning Methods:
1. Grid Search
Try all combinations in a predefined grid.
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 15],
'learning_rate': [0.01, 0.1, 0.3]
}
grid_search = GridSearchCV(
RandomForestClassifier(),
param_grid,
cv=5,
scoring='accuracy'
)
grid_search.fit(X_train, y_train)2. Random Search
Randomly sample hyperparameters (often more efficient).
from sklearn.model_selection import RandomizedSearchCV
random_search = RandomizedSearchCV(
RandomForestClassifier(),
param_distributions=param_grid,
n_iter=30,
cv=5
)
random_search.fit(X_train, y_train)3. Bayesian Optimization (Advanced)
Uses probability models to smartly select promising hyperparameters.
Important Rules:
- Tune on validation set, not test set
- Use cross-validation for small datasets
- Don't tune too many parameters at once
π Notes (Evaluation)
This phase is what separates
βtutorial MLβ from βreal MLβ.
π§© PHASE 5 β The Big Picture (Day 30)
Day 30: End-to-End Machine Learning Workflow
Putting it all togetherβthis is how real ML projects work:
The Complete Pipeline:
-
Define Problem β What are we predicting? What data do we need?
-
Collect Data β Get, scrape, or buy data relevant to the problem
-
Explore Data (EDA) β Understand distributions, correlations, patterns
-
Preprocess Data β Clean, handle missing values, encode categories, scale
-
Split Data β Train/validation/test split
-
Choose Baseline β Simple model to beat (linear regression, majority class)
-
Try Multiple Models β Compare 3-5 different algorithms
-
Evaluate & Tune β Use validation set, cross-validation, hyperparameter tuning
-
Final Evaluation β Evaluate on test set ONCE
-
Deploy & Monitor β Put in production, track performance over time
Key Insight:
| Stage | Time |
|---|---|
| Data Collection & Cleaning | 60% |
| Model Building & Tuning | 20% |
| Evaluation & Deployment | 20% |
Real ML is 80% data work.
Final Project Idea:
Build a complete ML pipeline end-to-end:
- Collect data (Kaggle dataset or scrape)
- Clean and preprocess
- Try 3+ models
- Tune the best one
- Evaluate on held-out test set
- Write a short report explaining your choices
π Notes (Workflow)
ML is a process, not a script.
π οΈ PROJECT IDEAS
ML basics, features, labels, train/test split
| Project | Difficulty | Time |
|---|---|---|
| Predict house prices using fake/simple data | β | 2-3 hours |
| Classify students as pass/fail using manual rules | β | 1-2 hours |
| Manually label a small dataset (build intuition) | β | 1 hour |
| Build a simple guessing game based on rules | β | 1 hour |
Skills Practiced: Understanding inputs β outputs, thinking in features, manual pattern recognition
Preprocessing, correlation, underfitting/overfitting
| Project | Difficulty | Time |
|---|---|---|
| Clean a messy CSV (handle missing values, outliers) | ββ | 3-4 hours |
| Visualize correlations with matplotlib/seaborn | ββ | 2-3 hours |
| Create synthetic data and identify underfitting vs overfitting | ββ | 3-4 hours |
| Exploratory data analysis (EDA) on a real dataset | ββ | 4-5 hours |
Skills Practiced: Data cleaning, visualization, statistical thinking, model behavior recognition
Linear Regression, KNN, Decision Trees, Random Forest, SVM, Naive Bayes
| Project | Difficulty | Time |
|---|---|---|
| Linear Regression from scratch (NumPy only) | βββ | 4-5 hours |
| Spam classifier using Naive Bayes | βββ | 4-5 hours |
| Build a Decision Tree classifier | βββ | 3-4 hours |
| Compare KNN vs Logistic Regression on same dataset | βββ | 4-5 hours |
| Titanic survival prediction with Random Forest | βββ | 5-6 hours |
Skills Practiced: Algorithm implementation, model comparison, feature interpretation
Metrics, cross-validation, regularization, hyperparameter tuning
| Project | Difficulty | Time |
|---|---|---|
| Evaluate model with Confusion Matrix, Precision, Recall, F1 | ββ | 2-3 hours |
| Plot and interpret ROC-AUC curves | βββ | 3-4 hours |
| Implement 5-fold cross-validation from scratch | βββ | 4-5 hours |
| Feature scaling experiment (with/without StandardScaler) | ββ | 2-3 hours |
| Grid search vs Random search comparison | βββ | 4-5 hours |
Skills Practiced: Model evaluation, robust validation, performance optimization
Apply the complete ML workflow to a real problem
House prices, stock prices, demand forecasting
Focus: Regression, feature engineering
Customer churn, fraud detection, disease diagnosis
Focus: Metrics, class imbalance
Customer segmentation, document grouping
Focus: Unsupervised learning
Movie/product recommendations
Focus: Collaborative filtering
- Define the problem clearly
- Load and explore data (EDA)
- Preprocess and engineer features
- Train multiple models
- Evaluate rigorously (metrics + validation)
- Document your reasoning and choices
Focus on reasoning, not accuracy. A well-reasoned wrong answer is more valuable than a lucky correct one.
π― How to Use This Page
- One day = one concept
- Tick only when you understand
- Revisit notes weekly
- Build at least 1 small project
π§ Final Reminder
You do NOT need to know everything.
If you can do these three things, you already know Machine Learning:
You already know Machine Learning. π