ML in 30 Days: A Clear, Beginner-Friendly Roadmap to Machine Learning

Machine Learning Cheatsheets Roadmaps

A practical 30-day roadmap to understand Machine Learning from first principles — covering data, models, evaluation, and real-world workflows without unnecessary math or hype.

ML in 30 Days 🚀

A clear, beginner-friendly roadmap to understand Machine Learning end to end

This page follows the exact ML in 30 Days Instagram series. Use it to track progress, revise concepts, and move from theory → practice.

ML in 30 Days

0/30 days completed0%

Foundations0/7

Data & Concepts0/7

Algorithms0/7

Evaluation0/8

Workflow0/1

30 days remaining. Keep going!

🧠 PHASE 1 — Foundations (Days 1–7)

Day 1: What is Machine Learning?

Traditional Programming vs Machine Learning

In traditional programming, you write explicit rules: "If income > 50000 AND age > 30, then approve loan." The computer follows your rules exactly.

In machine learning, you show the computer examples of inputs and outputs, and it learns the patterns itself. Instead of saying "what to do," you say "here's what happened before—figure out why."

The Core Idea

Machine Learning is teaching computers to learn patterns from data rather than being explicitly programmed for each decision. You provide data (experiences) and the algorithm develops its own understanding (knowledge).

Real-World Examples

Email spam filters learn from millions of emails
Netflix recommends shows based on your viewing history
Banks detect fraud by learning normal transaction patterns
Voice assistants recognize speech by learning from thousands of voice samples

Key Insight: ML shines when rules are too complex to write manually, or when the rules change frequently and you'd need to constantly rewrite code.

Day 2: Types of Machine Learning

1. Supervised Learning — Learning with a Teacher

You provide labeled examples: both the input AND the correct output. The model learns to predict outputs from inputs.

Examples:

Email → Spam/Not Spam (labeled by you)
House size → Price (historical sales data)
Image → Cat/Dog (manually labeled photos)

2. Unsupervised Learning — Learning without a Teacher

You provide only inputs. The model finds hidden patterns or structures on its own—no labels provided.

Examples:

Group customers by purchasing behavior (no pre-defined groups)
Compress data by finding common patterns
Detect anomalies (things that don't fit the pattern)

3. Reinforcement Learning — Learning from Experience

An agent takes actions in an environment and learns from rewards/punishments. It discovers through trial and error what works best.

Examples:

AlphaGo playing chess
Robots learning to walk
Game AI learning strategies

Quick Comparison

Type	Data	Goal	Analogy
Supervised	Labeled	Predict	Learning with answer key
Unsupervised	Unlabeled	Discover patterns	Finding groups/clusters
Reinforcement	Actions + Rewards	Maximize reward	Learning from consequences

Day 3: Supervised vs Unsupervised Learning

When to Use Supervised Learning

Use supervised learning when you have:

Historical data with known outcomes
A clear target variable you want to predict
Enough labeled examples to train on

Two Categories of Supervised Learning:

Classification — Predict categories/classes
- Email: Spam or Not Spam
- Tumor: Malignant or Benign
- Customer: Will Buy or Won't Buy
- Output is discrete (finite set of options)
Regression — Predict continuous values
- House price prediction
- Temperature forecasting
- Sales estimation
- Output is a number on a continuous scale

When to Use Unsupervised Learning

Use unsupervised learning when you:

Don't have labels/outcomes available
Want to explore data and discover patterns
Need to segment customers/users into natural groups
Want to reduce data complexity for visualization

Common Unsupervised Techniques:

Clustering — Group similar data points
- Customer segmentation
- Image compression (group similar pixels)
Dimensionality Reduction — Simplify without losing information
- PCA (Principal Component Analysis)
- Make high-dimensional data visualizable

The Critical Difference: Supervised = predicting known categories/values. Unsupervised = discovering unknown structure.

Day 4: Features & Labels

What is a Feature?

A feature (also called a variable, attribute, or predictor) is a measurable property or characteristic of the phenomenon you're observing.

Example: Predicting House Prices

Feature	Type	Description
Square feet	Numeric	Size of the house
Number of bedrooms	Numeric	Count of rooms
Location	Categorical	Neighborhood/zip code
Age of house	Numeric	Years since built
Number of bathrooms	Numeric	Count of bathrooms

What is a Label?

The label (or target/ground truth) is the value you're trying to predict. It's the "answer" for supervised learning.

Continuing the House Example:

Label: Sale price ($350,000, $420,000, etc.)

Feature Engineering — Crafting Good Features

The quality of your features often matters more than the algorithm you choose.

Good Features:

Relevant to the prediction task
Reliable (consistent, not noisy)
Available for new data you'll predict on
Understandable

Feature Engineering Examples:

Instead of raw dates, use "days since event"
Instead of full address, use "distance to downtown"
Combine related features (e.g., bedrooms + bathrooms = total rooms)

Feature Types:

Numeric/Continuous — Can take any value in a range
- Price, temperature, age
Categorical/Discrete — Finite set of categories
- Color (red, blue, green), Yes/No, Rating (1-5)
Ordinal — Categories with meaningful order
- Education level, satisfaction rating

Day 5: Training Data vs Test Data

The Core Problem

We train a model on some data, but we care about how it performs on NEW, UNSEEN data. This is called generalization.

The Solution: Train-Test Split

Split your data into:

Training Set (typically 70-80%) — Used to teach the model
Test Set (typically 20-30%) — Used to evaluate performance

Why This Matters

Training Data → Model learns patterns
Test Data     → Model is evaluated on patterns it has NEVER seen

Critical Rule: Never touch your test set during training. Using test data for training = cheating.

Visual Example with 100 data points:

All Data (100 samples)
├── Training Set (80 samples) ──→ Model learns from this
└── Test Set (20 samples) ──────→ Only used ONCE at the end

The Train-Test Split Code:

from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split(
    features, labels, test_size=0.2, random_state=42
)

random_state ensures reproducibility—you get the same split every time.

Validation Set (Bonus)

For complex models, you often use three splits:

Training Set (60-70%) — Learn parameters
Validation Set (10-20%) — Tune hyperparameters
Test Set (10-20%) — Final evaluation

Day 6: What is a Model?

Model = Mathematical Representation of Patterns

A machine learning model is a mathematical function that takes inputs (features) and produces outputs (predictions). It captures the relationship between X (features) and y (label).

Simple Example: Linear Regression

$y = mx + b$

Where:

$m$ = slope (weight/coefficient)
$b$ = intercept (bias)
$x$ = feature (square footage)
$y$ = prediction (house price)

The model "learns" m and b from training data.

Models are Templates, Not Rules

Think of models as flexible templates that mold themselves to fit your data:

Linear Regression = straight line template
Decision Tree = flowchart template
Neural Network = complex pattern-matching template

The Learning Process

Start with random/fresh parameters
Make predictions on training data
Calculate prediction error (how wrong?)
Adjust parameters to reduce error
Repeat until error is minimized

Key Insight: The model doesn't "know" anything about houses. It simply finds the mathematical relationship that best maps square footage → price based on examples.

Analogy: Learning to Catch

Traditional programming: Someone tells you "move your hand to coordinates (x,y)"
Machine learning: You throw 100 balls, miss most, adjust your movement each time, eventually get better

The model is your "catching strategy"—learned from experience (data), not explicitly programmed.

Day 7: Bias in Data

What is Data Bias?

Bias in ML is systematic error that skews results in a particular direction. It comes from the data, not the algorithm.

Common Types of Data Bias:

1. Selection Bias

Training data doesn't represent real-world distribution
Example: Training a face detector only on photos of young people, then failing on elderly faces

2. Label Bias

Human labels are inconsistent or prejudiced
Example: Historical hiring data reflecting past discrimination

3. Confirmation Bias

Collect/interpret data to confirm existing beliefs
Example: Only tracking positive customer reviews

4. Survivorship Bias

Only analyzing "successful" cases, ignoring failures
Example: Studying only successful startups to predict success

Real-World Consequence Examples:

COMPAS Recidivism Algorithm: Higher false positive rate for Black defendants
Amazon Hiring Tool: Biased against women (trained on 10 years of resumes)
Facial Recognition: Poor performance on darker skin tones (underrepresented in training data)

The Fix:

Audit your data — Who/what is represented?
Diversify data collection — Ensure broad representation
Test on multiple groups — Check performance equity
Acknowledge limitations — Be transparent about biases

Key Insight: A model is only as good as the data it's trained on. "Garbage in, garbage out."

📝 Notes (Foundations)

Write concepts in your own words.
If you can explain it simply, you understand it.

📊 PHASE 2 — Data & Core Concepts (Days 8–14)

Day 8: Data Preprocessing

Why Preprocess Data?

Raw data is messy. Real-world data has:

Missing values
Inconsistent formats
Outliers
Duplicate entries
Irrelevant columns

Preprocessing prepares data for your model to learn effectively.

Common Preprocessing Steps:

1. Handling Missing Values

Options:

Remove rows with missing values (if few)
Fill with mean/median/mode (simple imputation)
Use advanced techniques (KNN imputation, iterative imputation)

# Option 1: Drop rows with missing values
df.dropna()
 
# Option 2: Fill with mean
df['column'].fillna(df['column'].mean())

2. Encoding Categorical Variables

Convert text categories to numbers:

Label Encoding: cat → 0, dog → 1, bird → 2
One-Hot Encoding: Creates binary columns

from sklearn.preprocessing import LabelEncoder
 
encoder = LabelEncoder()
df['category_encoded'] = encoder.fit_transform(df['category'])

3. Handling Duplicates

df.drop_duplicates()

4. Fixing Inconsistent Data

# Standardize text
df['city'] = df['city'].str.lower().str.strip()
 
# Fix typos in categories
df['country'] = df['country'].replace({'usa': 'United States'})

The Preprocessing Pipeline:

Raw Data → Clean → Transform → Feature Engineer → Model-Ready Data

Key Insight: Data scientists spend 60-80% of their time on data preprocessing. It's not glamorous, but it makes or breaks your model.

Day 9: Train–Test Split

Recap and Deep Dive

We discussed train-test split in Day 5. Now let's understand it more deeply.

Why 80/20 Split?

Too little training data → model can't learn patterns
Too little test data → unreliable performance estimate
80/20 is a good starting point (also common: 70/30, 75/25)

The Stratified Split

When classes are imbalanced, use stratified sampling:

train_test_split(X, y, test_size=0.2, stratify=y)

This ensures the train and test sets have the same class distribution.

The Data Leakage Problem

CRITICAL: Never let information from test data influence training.

Bad Examples:

Computing mean on entire dataset before splitting
Normalizing using training + test combined
Feature engineering using test data knowledge

Correct Approach:

# Split FIRST
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
 
# THEN compute statistics on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Uses training stats!

Practical Splitting Strategy:

1. Hold out test set (never touch until final evaluation)
2. Use validation set for hyperparameter tuning
3. Train on remaining data
4. Report test set performance once

Day 10: Underfitting

What is Underfitting?

Underfitting is when your model is too simple to capture patterns in the data. It performs poorly on both training AND test data.

The Model is "Underpowered"

Like trying to fit a straight line to curved data—the model lacks the capacity to learn the true pattern.

Symptoms of Underfitting:

High training error
High test error
Model ignores important features
Patterns in data are obvious but model misses them

Visual Example:

The data points form a curve, but the model draws a straight line:

Underfitting: Linear Model on Curved Data

A linear model cannot capture the curved relationship in the data.

Causes of Underfitting:

Model too simple — Linear model for non-linear data
Not enough features — Missing important predictors
Too much regularization — Penalizing complexity too much
Insufficient training — Stopped too early

How to Fix Underfitting:

Use a more complex model
Add more relevant features
Reduce regularization
Train longer (more epochs/iterations)

Example: Underfitting vs Good Fit vs Overfitting

Model	Equation	Description
Underfitting	$y = 2x$	Straight line on curved data
Good Fit	$y = 2x + 0.5x^2$	Captures the curve
Overfitting	Complex polynomial	Wiggly line touching every point

Key Insight: Underfitting is the model saying "I can't learn this." Overfitting is the model saying "I memorized this, but I don't understand."

Day 11: Mean, Median & Standard Deviation

These are fundamental statistics for understanding data distribution.

Mean (Average)

Sum of all values divided by count.

$\text{mean} = \frac{x_1 + x_2 + \dots + x_n}{n}$

Example: Test scores: 70, 80, 90, 70, 90

$\text{mean} = \frac{70 + 80 + 90 + 70 + 90}{5} = 80$

Median (Middle Value)

The value that separates the higher half from the lower half. Sort and find the middle.

Example: Test scores: 70, 80, 90, 70, 90

Sorted: 70, 70, 80, 90, 90

$\text{Median} = 80$

Mean vs Median:

Mean is sensitive to outliers
Median is robust to outliers

Example: Incomes: $30k, $40k, $50k, $60k, $1M

Mean: $236k (misleading—most people earn far less)
Median: $50k (more representative)

Standard Deviation (SD)

Measures how spread out values are from the mean.

Low SD: Values cluster near the mean
High SD: Values are widely spread

Example:

Class A scores: 78, 79, 80, 81, 82 → SD ≈ 1.6
Class B scores: 50, 70, 80, 90, 110 → SD ≈ 22

Why These Matter in ML:

Feature scaling (normalize to similar ranges)
Outlier detection (values far from mean/median)
Understanding data distribution
Choosing appropriate models

Day 12: Correlation

What is Correlation?

Correlation measures how two variables change together. It tells you if one variable can predict another.

Correlation Coefficient (r)

Ranges from -1 to +1:

Value	Meaning
+1.0	Perfect positive correlation
+0.7	Strong positive correlation
+0.3	Weak positive correlation
0.0	No correlation
-0.3	Weak negative correlation
-0.7	Strong negative correlation
-1.0	Perfect negative correlation

Positive Correlation

As one variable increases, the other increases. Example: Height ↑ → Weight ↑

Negative Correlation

As one variable increases, the other decreases. Example: Hours of exercise ↑ → Body fat ↓

No Correlation

Variables move independently. Example: Shoe size ↔ IQ score

Correlation ≠ Causation

Just because two things correlate doesn't mean one causes the other!

Spurious Correlation Example: Ice cream sales and shark attacks both increase in summer—but ice cream doesn't cause shark attacks.

Visualizing Correlation:

import seaborn as sns
import matplotlib.pyplot as plt
 
# Correlation matrix heatmap
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')

Feature Selection with Correlation:

Remove highly correlated features (redundancy)
Keep features strongly correlated with target
Remove features uncorrelated with target

Day 13: Overfitting

What is Overfitting?

Overfitting is when your model learns the training data too well—including its noise and quirks. It memorizes rather than generalizes.

The Model is "Overly Complex"

It captures random fluctuations in training data that aren't real patterns.

Symptoms of Overfitting:

Very low training error
High test error
Model works perfectly on training data, poorly on new data

Visual Example:

Overfitting: The model memorizes noise instead of learning patterns.

Overfitting: Complex Model Memorizing Noise

A complex model that fits every point, including noise.

Causes of Overfitting:

Model too complex — Deep tree on small data
Too many features — More predictors than samples
Training too long — Continued learning after patterns are found
Insufficient data — Not enough examples to learn true patterns

How to Fix Overfitting:

Regularization — Penalize complexity
Get more data — More examples = better generalization
Feature selection — Remove irrelevant features
Cross-validation — Better performance estimation
Simplify model — Reduce model complexity
Early stopping — Stop training before memorizing

The Bias-Variance Tradeoff Visualized:

Underfitting

Too simple • Misses patterns

Straight line — misses curved pattern

Good Fit

Just right • Captures trend

Smooth curve — captures pattern

Overfitting

Too complex • Memorizes noise

Jagged line — follows every point

Day 14: Bias vs Variance

Understanding the Two Sources of Error

All prediction errors can be decomposed into:

$\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$

** Bias-Variance Tradeoff**

Bias-Variance Tradeoff

Watch how bias and variance change with model complexity. The optimal model complexity minimizes total error.

Bias (Underfitting)

Bias is error from overly simplistic assumptions. The model is "biased" toward missing important patterns.

High Bias Symptoms:

Misses relevant relationships
Underestimates/overestimates systematically
Performs poorly on all data

Variance (Overfitting)

Variance is error from sensitivity to noise in training data. The model "varies" too much with different training sets.

High Variance Symptoms:

Captures random noise
Performs differently on different training sets
Training error is low, test error is high

The Bias-Variance Tradeoff

Situation	Problem	Solution
High Bias + Low Variance	Underfitting (Consistent but wrong)	More complex model
Low Bias + High Variance	Overfitting (Flexible but unstable)	Regularization, more data
Low Bias + Low Variance	Good Fit (Right balance)	Optimal!

Practical Implications:

Situation	Problem	Solution
High train error, high test error	Underfitting	More complex model, more features
Low train error, high test error	Overfitting	Regularization, more data, simpler model
High train error, low test error	Rare (possible data leakage)	Check data pipeline

Key Insight: You cannot simultaneously minimize both bias and variance perfectly. The goal is to find the sweet spot where total error is minimized.

📝 Notes (Data & Concepts)

Focus on why things break, not formulas.

🤖 PHASE 3 — ML Algorithms (Days 15–21)

Day 15: Linear Regression

What It Does

Linear Regression finds the best-fitting straight line through your data points. It predicts a continuous value based on input features.

The Equation

$y = mx + b$

Where:

$y$ = predicted value (target)
$x$ = input feature
$m$ = slope (weight/coefficient)
$b$ = intercept (bias)

Multiple Linear Regression (multiple features):

$y = b_0 + b_1x_1 + b_2x_2 + \dots + b_nx_n$

** Linear Regression**

Linear Regression: Finding the Best Fit Line

The model learns to find the best-fit line through data points by minimizing the sum of squared errors.

How It Works

The algorithm finds the line (or hyperplane) that minimizes the sum of squared errors (MSE).

$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

Where $\hat{y}_i$ is the predicted value.

Simple Example: Predicting Ice Cream Sales

Temperature (°C)	Sales ($)
20	150
25	200
30	280
35	350

Model learns: Sales $= 10 \times$ Temperature $- 50$

When to Use Linear Regression:

Target is continuous (not categorical)
Features are linearly related to target
You need interpretability
Baseline model for comparison

When NOT to Use:

Relationships are non-linear
Target has complex interactions
Outliers heavily influence results

Code Example:

from sklearn.linear_model import LinearRegression
 
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
 
# Interpret coefficients
print(f"Coefficient: {model.coef_[0]}")
print(f"Intercept: {model.intercept_}")

Day 16: Logistic Regression

What It Does Despite the Name

Despite "regression" in the name, this is a classification algorithm. It predicts the probability of belonging to a class.

The Sigmoid Function

Logistic regression uses the sigmoid function to squash outputs between 0 and 1:

$P(class) = \frac{1}{1 + e^{-z}}$

Where $z$ is the linear combination of features.

** Sigmoid Function**

Sigmoid Function: Squashing Values to [0, 1]

The sigmoid function maps any real number to a probability between 0 and 1.

Binary Classification Example: Spam Detection

Feature	Value
Has word "free"	1
Number of exclamation marks	3
From unknown sender	1

$z = (0.5 \times 1) + (0.3 \times 3) + (0.8 \times 1) = 2.4$

$P(\text{spam}) = \frac{1}{1 + e^{-2.4}} \approx 0.92$

Prediction: SPAM (probability > 0.5 threshold)

Decision Boundary

Typically, we use 0.5 as the threshold:

$P > 0.5$ → Class 1
$P < 0.5$ → Class 0

Multiclass Classification

Logistic Regression can be extended to 3+ classes:

One-vs-Rest (OvR): Train one classifier per class
Multinomial: Directly model class probabilities

When to Use Logistic Regression:

Binary classification problems
Need probability estimates
Interpretable model (coefficients show feature importance)
Well-separated classes

Code Example:

from sklearn.linear_model import LogisticRegression
 
model = LogisticRegression()
model.fit(X_train, y_train)
probabilities = model.predict_proba(X_test)
predictions = model.predict(X_test)

Day 17: K-Nearest Neighbors (KNN)

The Intuition

KNN makes predictions based on similarity. "Tell me who your neighbors are, and I'll tell you who you are."

How It Works

Choose K (number of neighbors)
For a new data point, find the K closest points
Vote: The majority class among neighbors wins

** KNN Decision Boundary**

KNN: Finding Nearest Neighbors

The new point (green circle) is classified based on the majority class of its K nearest neighbors.

Distance Metrics

The most common is Euclidean distance:

$d = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2}$

KNN Classification: K=3 vs K=5

With K=3: 2 A's (blue), 1 B (red) → Predict A. With K=5: 2 A's, 3 B's → Predict B.

Choosing K

Small K: Sensitive to noise, may overfit
Large K: Smoother boundaries, may underfit
Common: Try K = 3, 5, 7, sqrt(n)

When to Use KNN:

Small to medium datasets
Quick baseline model
No training phase (lazy learner)
Multi-class classification

When NOT to Use:

Large datasets (slow prediction)
High-dimensional data
Features on very different scales

Code Example:

from sklearn.neighbors import KNeighborsClassifier
 
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Day 18: Decision Trees

What It Is

A flowchart-like structure where each node represents a feature test, each branch is an outcome, and each leaf is a prediction.

** Decision Tree Splitting**

Decision Tree: Splitting Feature Space

A decision tree splits the feature space into rectangular regions, each assigned to a class.

Key Concepts:

Information Gain (ID3/C4.5): Measures reduction in uncertainty after a split

$IG = H(\text{parent}) - \sum \frac{|S_i|}{|S|} H(S_i)$

Gini Impurity (CART): Measures probability of misclassification

$\text{Gini} = \sum p_i (1 - p_i) = 1 - \sum p_i^2$

Pruning: Removing branches that have little predictive power to prevent overfitting

When to Use Decision Trees:

Interpretability is important
Non-linear relationships
Mixed feature types (numeric + categorical)
Fast training and prediction

Limitations:

Prone to overfitting
Sensitive to small data changes
Can create biased trees with imbalanced data

Code Example:

from sklearn.tree import DecisionTreeClassifier
 
model = DecisionTreeClassifier(max_depth=5, random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
 
# Visualize
from sklearn.tree import plot_tree
plot_tree(model, feature_names=features, filled=True)

Day 19: Random Forest

What It Is

An ensemble method that combines multiple decision trees to create a more powerful and robust model.

The Wisdom of Crowds

Individual trees: Good, but can make mistakes
Forest of trees: Mistakes tend to cancel out

How It Works (Bagging):

Bootstrap: Sample data with replacement for each tree
Feature Randomness: Each tree sees random subset of features
Aggregate: Majority vote (classification) or average (regression)

Visual Concept:

        Tree 1 ─┐
        Tree 2 ─┼───→ Final Prediction (Vote/Average)
        Tree 3 ─┤
        Tree 4 ─┘

Why Random Forest Works:

Reduces overfitting: Individual trees overfit, but averaging reduces it
Handles non-linearity: Trees capture complex patterns
Robust to outliers: Individual trees may be affected, forest is not
Feature importance: Shows which features matter most

When to Use Random Forest:

Most classification/regression tasks
Good default choice
Need robust predictions
Feature importance analysis

Code Example:

from sklearn.ensemble import RandomForestClassifier
 
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
 
# Feature importance
importances = model.feature_importances_

Day 20: Support Vector Machines (SVM)

What It Is

SVM finds the optimal hyperplane that best separates classes in the feature space.

** SVM Maximizing Margin**

SVM: Finding the Optimal Hyperplane

SVM finds the hyperplane that maximizes the margin between classes. The points closest to the boundary are called support vectors.

The Key Idea: Maximize the Margin

SVM doesn't just find any separating line—it finds the line with the largest margin between classes.

The Kernel Trick

SVM can separate non-linear data by mapping it to higher dimensions:

The Kernel Trick: Mapping to Higher Dimensions

A circular pattern in 2D becomes linearly separable when mapped to higher dimensions.

Common Kernels:

Linear: Straight-line separation
RBF (Radial Basis Function): Flexible, curved boundaries
Polynomial: Curved surfaces

When to Use SVM:

Binary classification
High-dimensional data
Clear margin of separation
Small to medium datasets

Limitations:

Slow on large datasets
Sensitive to parameter choice
Requires feature scaling

Code Example:

from sklearn.svm import SVC
 
model = SVC(kernel='rbf', C=1.0, gamma='scale')
model.fit(X_train_scaled, y_train)
predictions = model.predict(X_test_scaled)

Day 21: Naive Bayes

What It Is

A probabilistic classifier based on Bayes' Theorem with a "naive" assumption of feature independence.

Bayes' Theorem

$P(\text{Class} | \text{Features}) = \frac{P(\text{Features} | \text{Class}) \times P(\text{Class})}{P(\text{Features})}$

** Bayes' Theorem**

Bayes' Theorem: Updating Probabilities

The posterior probability depends on both the prior probability and the likelihood of evidence.

The Naive Assumption

Assume all features are independent given the class. This simplifies calculation dramatically (even though it is rarely true in practice).

$P(x_1, x_2, ..., x_n | y) = P(x_1 | y) \times P(x_2 | y) \times ... \times P(x_n | y)$

Why It is Called "Naive"

Real-world features are often correlated, but Naive Bayes ignores this. Surprisingly, this works well anyway!

Text Classification Example: Spam Detection

$P(\text{Spam} \mid \text{words}) \propto P(\text{words} \mid \text{Spam}) \times P(\text{Spam})$

$P(\text{Spam} \mid \text{words}) \propto P(\text{free} \mid \text{Spam}) \times P(\text{money} \mid \text{Spam}) \times P(\text{urgent} \mid \text{Spam}) \times P(\text{Spam})$

Types of Naive Bayes:

Type	Best For
Gaussian	Continuous features (assumes normal distribution)
Multinomial	Word counts, text classification
Bernoulli	Binary features (present/absent)

When to Use Naive Bayes:

Text classification (spam, sentiment)
Multi-class classification
Fast training and prediction
Works well with small data

Code Example:

from sklearn.naive_bayes import MultinomialNB
 
model = MultinomialNB()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

📝 Notes (Algorithms)

Ask yourself:
“What kind of problem does this solve best?”

📈 PHASE 4 — Evaluation & Learning Process (Days 22–29)

Day 22: Model Evaluation Metrics

How Do You Know If Your Model Is Good?

You need metrics to quantify performance. Different problems require different metrics.

Regression Metrics:

1. Mean Absolute Error (MAE)

$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$

Interpretation: Average absolute difference between predictions and actual values. In the same units as target.

2. Mean Squared Error (MSE)

$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

Penalizes larger errors more heavily. Units are squared.

3. Root Mean Squared Error (RMSE)

$\text{RMSE} = \sqrt{\text{MSE}}$

Interpretation: Back to original units. More interpretable than MSE.

4. R² Score (Coefficient of Determination)

$R^2 = 1 - \frac{\text{SS}_{\text{residual}}}{\text{SS}_{\text{total}}}$

Interpretation: Proportion of variance explained. 1.0 = perfect, 0 = predicts mean always.

Classification Metrics:

1. Accuracy

$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}$

Interpretation: Proportion of correct predictions. Good for balanced classes.

2. Precision

$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}$

Interpretation: Of all positive predictions, how many are correct? Important when false positives are costly.

3. Recall (Sensitivity)

$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}$

Interpretation: Of all actual positives, how many did we find? Important when false negatives are costly.

4. F1 Score

$\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$

Interpretation: Harmonic mean of precision and recall. Good for imbalanced classes.

Choosing the Right Metric:

Problem Type	Common Metrics
Regression	MAE, RMSE, R²
Binary Classification (balanced)	Accuracy
Binary Classification (imbalanced)	Precision, Recall, F1
Multi-class	Macro/Micro F1

Day 23: Confusion Matrix

What Is a Confusion Matrix?

A table that describes the performance of a classification model. It shows actual vs predicted classifications.

Binary Classification Example:

\begin{bmatrix} \text{TN} & \text{FP} \\ \text{FN} & \text{TP} \end{bmatrix}

Where:

TN = True Negative (Correctly predicted negative)
TP = True Positive (Correctly predicted positive)
FN = False Negative (Missed positive - Type II Error)
FP = False Positive (Wrongly predicted positive - Type I Error)

Example: Cancer Detection (100 patients)

\begin{bmatrix} 85 & 5 \\ 3 & 7 \end{bmatrix}

Calculations:

$\text{Accuracy} = \frac{85 + 7}{100} = 92\%$ $\text{Precision} = \frac{7}{7 + 5} = 58\%$ $\text{Recall} = \frac{7}{7 + 3} = 70\%$

Multi-Class Confusion Matrix:

For 3+ classes, you get an N×N matrix showing all class predictions.

Visualizing:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
 
cm = confusion_matrix(y_test, predictions)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()

Why It Matters:

Shows not just accuracy, but types of errors
Identifies which classes are confused with each other
Reveals class imbalance issues

Day 24: ROC Curve & AUC

ROC Curve

Receiver Operating Characteristic curve plots:

True Positive Rate (Recall) on Y-axis
False Positive Rate on X-axis

At various classification thresholds.

$\text{TPR} = \frac{\text{TP}}{\text{TP} + \text{FN}} \quad (\text{Recall})$ $\text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}} \quad (\text{False Alarm Rate})$

AUC (Area Under the Curve)

The area under the ROC curve. Single number summary:

AUC	Meaning
0.5	Random guessing (useless)
0.7-0.8	Fair
0.8-0.9	Good
0.9+	Excellent

** ROC Curve**

ROC Curve: Tradeoff Between TPR and FPR

The ROC curve shows the tradeoff between True Positive Rate and False Positive Rate at different thresholds. The diagonal line represents random guessing.

Why Use ROC-AUC?

Threshold-independent evaluation
Works well for imbalanced datasets
Shows tradeoff between TPR and FPR

When to Use:

Binary classification
Comparing multiple models
Imbalanced classification problems

Code Example:

from sklearn.metrics import roc_auc_score, roc_curve
 
# Get probabilities for positive class
y_proba = model.predict_proba(X_test)[:, 1]
 
# AUC Score
auc = roc_auc_score(y_test, y_proba)
 
# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)

Day 25: Cross-Validation

The Problem with Single Split

A single train-test split gives one performance estimate. But what if that split is unlucky?

** K-Fold Cross-Validation**

K-Fold Cross-Validation

Watch how K-fold CV uses each fold as test set exactly once, giving a more reliable performance estimate.

K-Fold Cross-Validation Solution

Split data into K folds, train K times:

Fold	Split
1	[Test] [Train] [Train] [Train] [Train]
2	[Train] [Test] [Train] [Train] [Train]
3	[Train] [Train] [Test] [Train] [Train]
4	[Train] [Train] [Train] [Test] [Train]
5	[Train] [Train] [Train] [Train] [Test]

Final Score = Average of K scores

Common Values of K:

K=5: Good balance of speed and reliability
K=10: Standard, very reliable
K=LOOCV: Leave-One-Out (K=n), very slow but uses maximum data

Stratified K-Fold

For classification, use stratified K-fold to maintain class distribution in each fold.

Nested Cross-Validation

For hyperparameter tuning:

Outer Loop: Evaluate model performance
  └─ Inner Loop: Tune hyperparameters

Code Example:

from sklearn.model_selection import cross_val_score
 
scores = cross_val_score(
    model, X, y, 
    cv=5,  # 5-fold
    scoring='accuracy'
)
 
print(f"Scores: {scores}")
print(f"Mean: {scores.mean():.3f}")
print(f"Std: {scores.std():.3f}")

Why It Matters:

More reliable performance estimate
Uses all data for training and testing
Reduces variance of performance estimate

Day 26: Feature Scaling

Why Scale Features?

Many algorithms are sensitive to feature scales:

Distance-based algorithms (KNN, SVM)
Gradient descent algorithms
Regularization

Without Scaling:

Age: 25-65 (small range)
Income: 20,000-200,000 (large range)

$Distance = \sqrt{(Age_{diff})^2 + (Income_{diff})^2}$

→ Income dominates completely

Types of Scaling:

1. Standardization (Z-score normalization)

$z = \frac{x - \mu}{\sigma}$

Result: Mean = 0, Std = 1

2. Min-Max Normalization

$x_{\text{scaled}} = \frac{x - \min(x)}{\max(x) - \min(x)}$

Result: Range [0, 1]

3. Robust Scaling

Uses median and IQR (outlier-resistant):

$x_{\text{scaled}} = \frac{x - \text{median}(x)}{\text{IQR}}$

Code Example:

from sklearn.preprocessing import StandardScaler, MinMaxScaler
 
# Standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
 
# Min-Max Scaling
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)

Important: Fit scaler on training data only, transform both train and test with same scaler.

When to Scale:

KNN, SVM, K-Means
Logistic Regression, Linear Regression (with regularization)
Neural Networks
PCA, clustering
Tree-based models (Decision Tree, Random Forest) — invariant to scale

Day 27: Gradient Descent

What Is Gradient Descent?

An optimization algorithm that minimizes a loss function by iteratively moving in the direction of steepest descent.

The Intuition

Imagine being blindfolded on a mountain and trying to reach the bottom by feeling the slope under your feet. Gradient descent is the mathematical version of this.

The Update Rule

$\theta_{\text{new}} = \theta_{\text{old}} - \alpha \times \nabla$

Where:

$\theta$ = model parameters (weights)
$\alpha$ = learning rate (step size)
$\nabla$ = gradient (slope of loss function)

Visual Example:

** Gradient Descent**

Gradient Descent: Finding the Minimum

Watch how gradient descent iteratively moves toward the minimum of the loss function. The learning rate determines step size.

Types of Gradient Descent:

1. Batch Gradient Descent

Uses all data per iteration
Stable, but slow for large datasets

2. Stochastic Gradient Descent (SGD)

Uses one sample per iteration
Fast, noisy, can escape local minima

3. Mini-Batch Gradient Descent

Uses small batches (32, 64, 128 samples)
Best of both worlds — most common in practice

Learning Rate Matters:

Too small: Converges very slowly
Too large: Oscillates or diverges
Just right: Converges efficiently

Code Example (Concept):

# Simplified gradient descent for linear regression
for epoch in range(epochs):
    for each sample in batch:
        prediction = model(x)
        error = prediction - y
        gradient = error * x
        weights -= learning_rate * gradient

Day 28: Regularization

What Is Regularization?

A technique to prevent overfitting by adding a penalty to the loss function that discourages complex models.

The Bias-Variance Tradeoff Again

Regularization intentionally increases bias to reduce variance, finding a better total error.

Types of Regularization:

1. L1 Regularization (Lasso)

Adds absolute value of weights to loss:

Loss = MSE + λ × Σ|weights|

Effect: Pushes some weights to exactly zero (feature selection)

2. L2 Regularization (Ridge)

Adds squared weights to loss:

Loss = MSE + λ × Σ(weights)²

Effect: Shrinks weights toward zero (but rarely to exactly zero)

3. Elastic Net (Combination)

$\text{Loss} = \text{MSE} + \lambda_1 \sum |w_i| + \lambda_2 \sum w_i^2$

** Regularization Effect**

Regularization: Controlling Model Complexity

Adjust lambda (regularization strength) to see how it affects the model complexity. Higher lambda = simpler model = less overfitting.

Lambda (λ) Controls Strength:

λ = 0: No regularization (risk of overfitting)
λ = Large: Very strong regularization (risk of underfitting)

Code Example:

from sklearn.linear_model import Ridge, Lasso, ElasticNet
 
# Ridge (L2)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
 
# Lasso (L1)
lasso = Lasso(alpha=1.0)
lasso.fit(X_train, y_train)
 
# Elastic Net
elastic = ElasticNet(alpha=1.0, l1_ratio=0.5)
elastic.fit(X_train, y_train)

Day 29: Hyperparameter Tuning

What Are Hyperparameters?

Parameters set BEFORE training starts (not learned from data):

Learning rate
Number of trees in Random Forest
K in KNN
Regularization strength
Max depth of Decision Tree

Why Tuning Matters

Small changes can dramatically affect performance:

Default Random Forest:     82% accuracy
Tuned Random Forest:        89% accuracy

Tuning Methods:

1. Grid Search

Try all combinations in a predefined grid.

from sklearn.model_selection import GridSearchCV
 
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15],
    'learning_rate': [0.01, 0.1, 0.3]
}
 
grid_search = GridSearchCV(
    RandomForestClassifier(),
    param_grid,
    cv=5,
    scoring='accuracy'
)
grid_search.fit(X_train, y_train)

2. Random Search

Randomly sample hyperparameters (often more efficient).

from sklearn.model_selection import RandomizedSearchCV
 
random_search = RandomizedSearchCV(
    RandomForestClassifier(),
    param_distributions=param_grid,
    n_iter=30,
    cv=5
)
random_search.fit(X_train, y_train)

3. Bayesian Optimization (Advanced)

Uses probability models to smartly select promising hyperparameters.

Important Rules:

Tune on validation set, not test set
Use cross-validation for small datasets
Don't tune too many parameters at once

📝 Notes (Evaluation)

This phase is what separates
“tutorial ML” from “real ML”.

🧩 PHASE 5 — The Big Picture (Day 30)

Day 30: End-to-End Machine Learning Workflow

Putting it all together—this is how real ML projects work:

The Complete Pipeline:

Define Problem — What are we predicting? What data do we need?
Collect Data — Get, scrape, or buy data relevant to the problem
Explore Data (EDA) — Understand distributions, correlations, patterns
Preprocess Data — Clean, handle missing values, encode categories, scale
Split Data — Train/validation/test split
Choose Baseline — Simple model to beat (linear regression, majority class)
Try Multiple Models — Compare 3-5 different algorithms
Evaluate & Tune — Use validation set, cross-validation, hyperparameter tuning
Final Evaluation — Evaluate on test set ONCE
Deploy & Monitor — Put in production, track performance over time

Key Insight:

Stage	Time
Data Collection & Cleaning	60%
Model Building & Tuning	20%
Evaluation & Deployment	20%

Real ML is 80% data work.

Final Project Idea:

Build a complete ML pipeline end-to-end:

Collect data (Kaggle dataset or scrape)
Clean and preprocess
Try 3+ models
Tune the best one
Evaluate on held-out test set
Write a short report explaining your choices

📝 Notes (Workflow)

ML is a process, not a script.

🛠️ PROJECT IDEAS

Pro Tip: Building projects is where theory becomes real understanding

ML basics, features, labels, train/test split

Project	Difficulty	Time
Predict house prices using fake/simple data	⭐	2-3 hours
Classify students as pass/fail using manual rules	⭐	1-2 hours
Manually label a small dataset (build intuition)	⭐	1 hour
Build a simple guessing game based on rules	⭐	1 hour

Skills Practiced: Understanding inputs → outputs, thinking in features, manual pattern recognition

Preprocessing, correlation, underfitting/overfitting

Project	Difficulty	Time
Clean a messy CSV (handle missing values, outliers)	⭐⭐	3-4 hours
Visualize correlations with matplotlib/seaborn	⭐⭐	2-3 hours
Create synthetic data and identify underfitting vs overfitting	⭐⭐	3-4 hours
Exploratory data analysis (EDA) on a real dataset	⭐⭐	4-5 hours

Skills Practiced: Data cleaning, visualization, statistical thinking, model behavior recognition

Linear Regression, KNN, Decision Trees, Random Forest, SVM, Naive Bayes

Project	Difficulty	Time
Linear Regression from scratch (NumPy only)	⭐⭐⭐	4-5 hours
Spam classifier using Naive Bayes	⭐⭐⭐	4-5 hours
Build a Decision Tree classifier	⭐⭐⭐	3-4 hours
Compare KNN vs Logistic Regression on same dataset	⭐⭐⭐	4-5 hours
Titanic survival prediction with Random Forest	⭐⭐⭐	5-6 hours

Skills Practiced: Algorithm implementation, model comparison, feature interpretation

Metrics, cross-validation, regularization, hyperparameter tuning

Project	Difficulty	Time
Evaluate model with Confusion Matrix, Precision, Recall, F1	⭐⭐	2-3 hours
Plot and interpret ROC-AUC curves	⭐⭐⭐	3-4 hours
Implement 5-fold cross-validation from scratch	⭐⭐⭐	4-5 hours
Feature scaling experiment (with/without StandardScaler)	⭐⭐	2-3 hours
Grid search vs Random search comparison	⭐⭐⭐	4-5 hours

Skills Practiced: Model evaluation, robust validation, performance optimization

Apply the complete ML workflow to a real problem

Prediction

House prices, stock prices, demand forecasting

Focus: Regression, feature engineering

Classification

Customer churn, fraud detection, disease diagnosis

Focus: Metrics, class imbalance

Clustering

Customer segmentation, document grouping

Focus: Unsupervised learning

Recommendation

Movie/product recommendations

Focus: Collaborative filtering

Capstone Requirements:

Define the problem clearly
Load and explore data (EDA)
Preprocess and engineer features
Train multiple models
Evaluate rigorously (metrics + validation)
Document your reasoning and choices

Focus on reasoning, not accuracy. A well-reasoned wrong answer is more valuable than a lucky correct one.

🎯 How to Use This Page

One day = one concept
Tick only when you understand
Revisit notes weekly
Build at least 1 small project

🧠 Final Reminder

You do NOT need to know everything.

If you can do these three things, you already know Machine Learning:

✅Understand the workflow

✅Choose the right model

✅Evaluate properly

You already know Machine Learning. 🚀

ML in 30 Days: A Clear, Beginner-Friendly Roadmap to Machine Learning

ML in 30 Days

You do NOT need to know everything.

Subscribe to Our Newsletter