ML in 30 Days: A Clear, Beginner-Friendly Roadmap to Machine Learning

A practical 30-day roadmap to understand Machine Learning from first principles β€” covering data, models, evaluation, and real-world workflows without unnecessary math or hype.


ML in 30 Days πŸš€

A clear, beginner-friendly roadmap to understand Machine Learning end to end

This page follows the exact ML in 30 Days Instagram series. Use it to track progress, revise concepts, and move from theory β†’ practice.

ML in 30 Days

0/30 days completed0%
Foundations0/7
Data & Concepts0/7
Algorithms0/7
Evaluation0/8
Workflow0/1

30 days remaining. Keep going!

🧠 PHASE 1 β€” Foundations (Days 1–7)

Day 1: What is Machine Learning?

Traditional Programming vs Machine Learning

In traditional programming, you write explicit rules: "If income > 50000 AND age > 30, then approve loan." The computer follows your rules exactly.

In machine learning, you show the computer examples of inputs and outputs, and it learns the patterns itself. Instead of saying "what to do," you say "here's what happened beforeβ€”figure out why."

The Core Idea

Machine Learning is teaching computers to learn patterns from data rather than being explicitly programmed for each decision. You provide data (experiences) and the algorithm develops its own understanding (knowledge).

Real-World Examples

  • Email spam filters learn from millions of emails
  • Netflix recommends shows based on your viewing history
  • Banks detect fraud by learning normal transaction patterns
  • Voice assistants recognize speech by learning from thousands of voice samples

Key Insight: ML shines when rules are too complex to write manually, or when the rules change frequently and you'd need to constantly rewrite code.


Day 2: Types of Machine Learning

1. Supervised Learning β€” Learning with a Teacher

You provide labeled examples: both the input AND the correct output. The model learns to predict outputs from inputs.

Examples:

  • Email β†’ Spam/Not Spam (labeled by you)
  • House size β†’ Price (historical sales data)
  • Image β†’ Cat/Dog (manually labeled photos)

2. Unsupervised Learning β€” Learning without a Teacher

You provide only inputs. The model finds hidden patterns or structures on its ownβ€”no labels provided.

Examples:

  • Group customers by purchasing behavior (no pre-defined groups)
  • Compress data by finding common patterns
  • Detect anomalies (things that don't fit the pattern)

3. Reinforcement Learning β€” Learning from Experience

An agent takes actions in an environment and learns from rewards/punishments. It discovers through trial and error what works best.

Examples:

  • AlphaGo playing chess
  • Robots learning to walk
  • Game AI learning strategies

Quick Comparison

TypeDataGoalAnalogy
SupervisedLabeledPredictLearning with answer key
UnsupervisedUnlabeledDiscover patternsFinding groups/clusters
ReinforcementActions + RewardsMaximize rewardLearning from consequences

Day 3: Supervised vs Unsupervised Learning

When to Use Supervised Learning

Use supervised learning when you have:

  • Historical data with known outcomes
  • A clear target variable you want to predict
  • Enough labeled examples to train on

Two Categories of Supervised Learning:

  1. Classification β€” Predict categories/classes

    • Email: Spam or Not Spam
    • Tumor: Malignant or Benign
    • Customer: Will Buy or Won't Buy
    • Output is discrete (finite set of options)
  2. Regression β€” Predict continuous values

    • House price prediction
    • Temperature forecasting
    • Sales estimation
    • Output is a number on a continuous scale

When to Use Unsupervised Learning

Use unsupervised learning when you:

  • Don't have labels/outcomes available
  • Want to explore data and discover patterns
  • Need to segment customers/users into natural groups
  • Want to reduce data complexity for visualization

Common Unsupervised Techniques:

  1. Clustering β€” Group similar data points

    • Customer segmentation
    • Image compression (group similar pixels)
  2. Dimensionality Reduction β€” Simplify without losing information

    • PCA (Principal Component Analysis)
    • Make high-dimensional data visualizable

The Critical Difference: Supervised = predicting known categories/values. Unsupervised = discovering unknown structure.


Day 4: Features & Labels

What is a Feature?

A feature (also called a variable, attribute, or predictor) is a measurable property or characteristic of the phenomenon you're observing.

Example: Predicting House Prices

FeatureTypeDescription
Square feetNumericSize of the house
Number of bedroomsNumericCount of rooms
LocationCategoricalNeighborhood/zip code
Age of houseNumericYears since built
Number of bathroomsNumericCount of bathrooms

What is a Label?

The label (or target/ground truth) is the value you're trying to predict. It's the "answer" for supervised learning.

Continuing the House Example:

  • Label: Sale price ($350,000, $420,000, etc.)

Feature Engineering β€” Crafting Good Features

The quality of your features often matters more than the algorithm you choose.

Good Features:

  • Relevant to the prediction task
  • Reliable (consistent, not noisy)
  • Available for new data you'll predict on
  • Understandable

Feature Engineering Examples:

  • Instead of raw dates, use "days since event"
  • Instead of full address, use "distance to downtown"
  • Combine related features (e.g., bedrooms + bathrooms = total rooms)

Feature Types:

  1. Numeric/Continuous β€” Can take any value in a range

    • Price, temperature, age
  2. Categorical/Discrete β€” Finite set of categories

    • Color (red, blue, green), Yes/No, Rating (1-5)
  3. Ordinal β€” Categories with meaningful order

    • Education level, satisfaction rating

Day 5: Training Data vs Test Data

The Core Problem

We train a model on some data, but we care about how it performs on NEW, UNSEEN data. This is called generalization.

The Solution: Train-Test Split

Split your data into:

  • Training Set (typically 70-80%) β€” Used to teach the model
  • Test Set (typically 20-30%) β€” Used to evaluate performance

Why This Matters

Training Data β†’ Model learns patterns
Test Data     β†’ Model is evaluated on patterns it has NEVER seen

Critical Rule: Never touch your test set during training. Using test data for training = cheating.

Visual Example with 100 data points:

All Data (100 samples)
β”œβ”€β”€ Training Set (80 samples) ──→ Model learns from this
└── Test Set (20 samples) ──────→ Only used ONCE at the end

The Train-Test Split Code:

from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split(
    features, labels, test_size=0.2, random_state=42
)

random_state ensures reproducibilityβ€”you get the same split every time.

Validation Set (Bonus)

For complex models, you often use three splits:

  • Training Set (60-70%) β€” Learn parameters
  • Validation Set (10-20%) β€” Tune hyperparameters
  • Test Set (10-20%) β€” Final evaluation

Day 6: What is a Model?

Model = Mathematical Representation of Patterns

A machine learning model is a mathematical function that takes inputs (features) and produces outputs (predictions). It captures the relationship between X (features) and y (label).

Simple Example: Linear Regression

y=mx+by = mx + b

Where:

  • mm = slope (weight/coefficient)
  • bb = intercept (bias)
  • xx = feature (square footage)
  • yy = prediction (house price)

The model "learns" m and b from training data.

Models are Templates, Not Rules

Think of models as flexible templates that mold themselves to fit your data:

  • Linear Regression = straight line template
  • Decision Tree = flowchart template
  • Neural Network = complex pattern-matching template

The Learning Process

  1. Start with random/fresh parameters
  2. Make predictions on training data
  3. Calculate prediction error (how wrong?)
  4. Adjust parameters to reduce error
  5. Repeat until error is minimized

Key Insight: The model doesn't "know" anything about houses. It simply finds the mathematical relationship that best maps square footage β†’ price based on examples.

Analogy: Learning to Catch

  • Traditional programming: Someone tells you "move your hand to coordinates (x,y)"
  • Machine learning: You throw 100 balls, miss most, adjust your movement each time, eventually get better

The model is your "catching strategy"β€”learned from experience (data), not explicitly programmed.


Day 7: Bias in Data

What is Data Bias?

Bias in ML is systematic error that skews results in a particular direction. It comes from the data, not the algorithm.

Common Types of Data Bias:

1. Selection Bias

  • Training data doesn't represent real-world distribution
  • Example: Training a face detector only on photos of young people, then failing on elderly faces

2. Label Bias

  • Human labels are inconsistent or prejudiced
  • Example: Historical hiring data reflecting past discrimination

3. Confirmation Bias

  • Collect/interpret data to confirm existing beliefs
  • Example: Only tracking positive customer reviews

4. Survivorship Bias

  • Only analyzing "successful" cases, ignoring failures
  • Example: Studying only successful startups to predict success

Real-World Consequence Examples:

  • COMPAS Recidivism Algorithm: Higher false positive rate for Black defendants
  • Amazon Hiring Tool: Biased against women (trained on 10 years of resumes)
  • Facial Recognition: Poor performance on darker skin tones (underrepresented in training data)

The Fix:

  1. Audit your data β€” Who/what is represented?
  2. Diversify data collection β€” Ensure broad representation
  3. Test on multiple groups β€” Check performance equity
  4. Acknowledge limitations β€” Be transparent about biases

Key Insight: A model is only as good as the data it's trained on. "Garbage in, garbage out."

πŸ“ Notes (Foundations)

Write concepts in your own words.
If you can explain it simply, you understand it.


πŸ“Š PHASE 2 β€” Data & Core Concepts (Days 8–14)

Day 8: Data Preprocessing

Why Preprocess Data?

Raw data is messy. Real-world data has:

  • Missing values
  • Inconsistent formats
  • Outliers
  • Duplicate entries
  • Irrelevant columns

Preprocessing prepares data for your model to learn effectively.

Common Preprocessing Steps:

1. Handling Missing Values

Options:

  • Remove rows with missing values (if few)
  • Fill with mean/median/mode (simple imputation)
  • Use advanced techniques (KNN imputation, iterative imputation)
# Option 1: Drop rows with missing values
df.dropna()
 
# Option 2: Fill with mean
df['column'].fillna(df['column'].mean())

2. Encoding Categorical Variables

Convert text categories to numbers:

  • Label Encoding: cat β†’ 0, dog β†’ 1, bird β†’ 2
  • One-Hot Encoding: Creates binary columns
from sklearn.preprocessing import LabelEncoder
 
encoder = LabelEncoder()
df['category_encoded'] = encoder.fit_transform(df['category'])

3. Handling Duplicates

df.drop_duplicates()

4. Fixing Inconsistent Data

# Standardize text
df['city'] = df['city'].str.lower().str.strip()
 
# Fix typos in categories
df['country'] = df['country'].replace({'usa': 'United States'})

The Preprocessing Pipeline:

Raw Data β†’ Clean β†’ Transform β†’ Feature Engineer β†’ Model-Ready Data

Key Insight: Data scientists spend 60-80% of their time on data preprocessing. It's not glamorous, but it makes or breaks your model.


Day 9: Train–Test Split

Recap and Deep Dive

We discussed train-test split in Day 5. Now let's understand it more deeply.

Why 80/20 Split?

  • Too little training data β†’ model can't learn patterns
  • Too little test data β†’ unreliable performance estimate
  • 80/20 is a good starting point (also common: 70/30, 75/25)

The Stratified Split

When classes are imbalanced, use stratified sampling:

train_test_split(X, y, test_size=0.2, stratify=y)

This ensures the train and test sets have the same class distribution.

The Data Leakage Problem

CRITICAL: Never let information from test data influence training.

Bad Examples:

  • Computing mean on entire dataset before splitting
  • Normalizing using training + test combined
  • Feature engineering using test data knowledge

Correct Approach:

# Split FIRST
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
 
# THEN compute statistics on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Uses training stats!

Practical Splitting Strategy:

1. Hold out test set (never touch until final evaluation)
2. Use validation set for hyperparameter tuning
3. Train on remaining data
4. Report test set performance once

Day 10: Underfitting

What is Underfitting?

Underfitting is when your model is too simple to capture patterns in the data. It performs poorly on both training AND test data.

The Model is "Underpowered"

Like trying to fit a straight line to curved dataβ€”the model lacks the capacity to learn the true pattern.

Symptoms of Underfitting:

  • High training error
  • High test error
  • Model ignores important features
  • Patterns in data are obvious but model misses them

Visual Example:

The data points form a curve, but the model draws a straight line:

Underfitting: Linear Model on Curved Data

A linear model cannot capture the curved relationship in the data.

Causes of Underfitting:

  1. Model too simple β€” Linear model for non-linear data
  2. Not enough features β€” Missing important predictors
  3. Too much regularization β€” Penalizing complexity too much
  4. Insufficient training β€” Stopped too early

How to Fix Underfitting:

  • Use a more complex model
  • Add more relevant features
  • Reduce regularization
  • Train longer (more epochs/iterations)

Example: Underfitting vs Good Fit vs Overfitting

ModelEquationDescription
Underfittingy=2xy = 2xStraight line on curved data
Good Fity=2x+0.5x2y = 2x + 0.5x^2Captures the curve
OverfittingComplex polynomialWiggly line touching every point

Key Insight: Underfitting is the model saying "I can't learn this." Overfitting is the model saying "I memorized this, but I don't understand."


Day 11: Mean, Median & Standard Deviation

These are fundamental statistics for understanding data distribution.

Mean (Average)

Sum of all values divided by count.

mean=x1+x2+β‹―+xnn\text{mean} = \frac{x_1 + x_2 + \dots + x_n}{n}

Example: Test scores: 70, 80, 90, 70, 90

mean=70+80+90+70+905=80\text{mean} = \frac{70 + 80 + 90 + 70 + 90}{5} = 80

Median (Middle Value)

The value that separates the higher half from the lower half. Sort and find the middle.

Example: Test scores: 70, 80, 90, 70, 90

Sorted: 70, 70, 80, 90, 90

Median=80\text{Median} = 80

Mean vs Median:

  • Mean is sensitive to outliers
  • Median is robust to outliers

Example: Incomes: $30k, $40k, $50k, $60k, $1M

  • Mean: $236k (misleadingβ€”most people earn far less)
  • Median: $50k (more representative)

Standard Deviation (SD)

Measures how spread out values are from the mean.

Low SD: Values cluster near the mean
High SD: Values are widely spread

Example:

  • Class A scores: 78, 79, 80, 81, 82 β†’ SD β‰ˆ 1.6
  • Class B scores: 50, 70, 80, 90, 110 β†’ SD β‰ˆ 22

Why These Matter in ML:

  • Feature scaling (normalize to similar ranges)
  • Outlier detection (values far from mean/median)
  • Understanding data distribution
  • Choosing appropriate models

Day 12: Correlation

What is Correlation?

Correlation measures how two variables change together. It tells you if one variable can predict another.

Correlation Coefficient (r)

Ranges from -1 to +1:

ValueMeaning
+1.0Perfect positive correlation
+0.7Strong positive correlation
+0.3Weak positive correlation
0.0No correlation
-0.3Weak negative correlation
-0.7Strong negative correlation
-1.0Perfect negative correlation

Positive Correlation

As one variable increases, the other increases. Example: Height ↑ β†’ Weight ↑

Negative Correlation

As one variable increases, the other decreases. Example: Hours of exercise ↑ β†’ Body fat ↓

No Correlation

Variables move independently. Example: Shoe size ↔ IQ score

Correlation β‰  Causation

Just because two things correlate doesn't mean one causes the other!

Spurious Correlation Example: Ice cream sales and shark attacks both increase in summerβ€”but ice cream doesn't cause shark attacks.

Visualizing Correlation:

import seaborn as sns
import matplotlib.pyplot as plt
 
# Correlation matrix heatmap
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')

Feature Selection with Correlation:

  • Remove highly correlated features (redundancy)
  • Keep features strongly correlated with target
  • Remove features uncorrelated with target

Day 13: Overfitting

What is Overfitting?

Overfitting is when your model learns the training data too wellβ€”including its noise and quirks. It memorizes rather than generalizes.

The Model is "Overly Complex"

It captures random fluctuations in training data that aren't real patterns.

Symptoms of Overfitting:

  • Very low training error
  • High test error
  • Model works perfectly on training data, poorly on new data

Visual Example:

Overfitting: The model memorizes noise instead of learning patterns.

Overfitting: Complex Model Memorizing Noise

A complex model that fits every point, including noise.

Causes of Overfitting:

  1. Model too complex β€” Deep tree on small data
  2. Too many features β€” More predictors than samples
  3. Training too long β€” Continued learning after patterns are found
  4. Insufficient data β€” Not enough examples to learn true patterns

How to Fix Overfitting:

  • Regularization β€” Penalize complexity
  • Get more data β€” More examples = better generalization
  • Feature selection β€” Remove irrelevant features
  • Cross-validation β€” Better performance estimation
  • Simplify model β€” Reduce model complexity
  • Early stopping β€” Stop training before memorizing

The Bias-Variance Tradeoff Visualized:

Underfitting

Too simple β€’ Misses patterns

Straight line β€” misses curved pattern

Good Fit

Just right β€’ Captures trend

Smooth curve β€” captures pattern

Overfitting

Too complex β€’ Memorizes noise

Jagged line β€” follows every point


Day 14: Bias vs Variance

Understanding the Two Sources of Error

All prediction errors can be decomposed into:

TotalΒ Error=Bias2+Variance+IrreducibleΒ Error\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}

** Bias-Variance Tradeoff**

Bias-Variance Tradeoff

Watch how bias and variance change with model complexity. The optimal model complexity minimizes total error.

Bias (Underfitting)

Bias is error from overly simplistic assumptions. The model is "biased" toward missing important patterns.

High Bias Symptoms:

  • Misses relevant relationships
  • Underestimates/overestimates systematically
  • Performs poorly on all data

Variance (Overfitting)

Variance is error from sensitivity to noise in training data. The model "varies" too much with different training sets.

High Variance Symptoms:

  • Captures random noise
  • Performs differently on different training sets
  • Training error is low, test error is high

The Bias-Variance Tradeoff

SituationProblemSolution
High Bias + Low VarianceUnderfitting (Consistent but wrong)More complex model
Low Bias + High VarianceOverfitting (Flexible but unstable)Regularization, more data
Low Bias + Low VarianceGood Fit (Right balance)Optimal!

Practical Implications:

SituationProblemSolution
High train error, high test errorUnderfittingMore complex model, more features
Low train error, high test errorOverfittingRegularization, more data, simpler model
High train error, low test errorRare (possible data leakage)Check data pipeline

Key Insight: You cannot simultaneously minimize both bias and variance perfectly. The goal is to find the sweet spot where total error is minimized.

πŸ“ Notes (Data & Concepts)

Focus on why things break, not formulas.


πŸ€– PHASE 3 β€” ML Algorithms (Days 15–21)

Day 15: Linear Regression

What It Does

Linear Regression finds the best-fitting straight line through your data points. It predicts a continuous value based on input features.

The Equation

y=mx+by = mx + b

Where:

  • yy = predicted value (target)
  • xx = input feature
  • mm = slope (weight/coefficient)
  • bb = intercept (bias)

Multiple Linear Regression (multiple features):

y=b0+b1x1+b2x2+β‹―+bnxny = b_0 + b_1x_1 + b_2x_2 + \dots + b_nx_n

** Linear Regression**

Linear Regression: Finding the Best Fit Line

The model learns to find the best-fit line through data points by minimizing the sum of squared errors.

How It Works

The algorithm finds the line (or hyperplane) that minimizes the sum of squared errors (MSE).

MSE=1nβˆ‘i=1n(yiβˆ’y^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

Where y^i\hat{y}_i is the predicted value.

Simple Example: Predicting Ice Cream Sales

Temperature (Β°C)Sales ($)
20150
25200
30280
35350

Model learns: Sales =10Γ—= 10 \times Temperature βˆ’50- 50

When to Use Linear Regression:

  • Target is continuous (not categorical)
  • Features are linearly related to target
  • You need interpretability
  • Baseline model for comparison

When NOT to Use:

  • Relationships are non-linear
  • Target has complex interactions
  • Outliers heavily influence results

Code Example:

from sklearn.linear_model import LinearRegression
 
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
 
# Interpret coefficients
print(f"Coefficient: {model.coef_[0]}")
print(f"Intercept: {model.intercept_}")

Day 16: Logistic Regression

What It Does Despite the Name

Despite "regression" in the name, this is a classification algorithm. It predicts the probability of belonging to a class.

The Sigmoid Function

Logistic regression uses the sigmoid function to squash outputs between 0 and 1:

P(class)=11+eβˆ’zP(class) = \frac{1}{1 + e^{-z}}

Where zz is the linear combination of features.

** Sigmoid Function**

Sigmoid Function: Squashing Values to [0, 1]

The sigmoid function maps any real number to a probability between 0 and 1.

Binary Classification Example: Spam Detection

FeatureValue
Has word "free"1
Number of exclamation marks3
From unknown sender1

z=(0.5Γ—1)+(0.3Γ—3)+(0.8Γ—1)=2.4z = (0.5 \times 1) + (0.3 \times 3) + (0.8 \times 1) = 2.4

P(spam)=11+eβˆ’2.4β‰ˆ0.92P(\text{spam}) = \frac{1}{1 + e^{-2.4}} \approx 0.92

Prediction: SPAM (probability > 0.5 threshold)

Decision Boundary

Typically, we use 0.5 as the threshold:

  • P>0.5P > 0.5 β†’ Class 1
  • P<0.5P < 0.5 β†’ Class 0

Multiclass Classification

Logistic Regression can be extended to 3+ classes:

  • One-vs-Rest (OvR): Train one classifier per class
  • Multinomial: Directly model class probabilities

When to Use Logistic Regression:

  • Binary classification problems
  • Need probability estimates
  • Interpretable model (coefficients show feature importance)
  • Well-separated classes

Code Example:

from sklearn.linear_model import LogisticRegression
 
model = LogisticRegression()
model.fit(X_train, y_train)
probabilities = model.predict_proba(X_test)
predictions = model.predict(X_test)

Day 17: K-Nearest Neighbors (KNN)

The Intuition

KNN makes predictions based on similarity. "Tell me who your neighbors are, and I'll tell you who you are."

How It Works

  1. Choose K (number of neighbors)
  2. For a new data point, find the K closest points
  3. Vote: The majority class among neighbors wins

** KNN Decision Boundary**

KNN: Finding Nearest Neighbors

The new point (green circle) is classified based on the majority class of its K nearest neighbors.

Distance Metrics

The most common is Euclidean distance:

d=(x1βˆ’x2)2+(y1βˆ’y2)2d = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2}

KNN Classification: K=3 vs K=5

With K=3: 2 A's (blue), 1 B (red) β†’ Predict A. With K=5: 2 A's, 3 B's β†’ Predict B.

Choosing K

  • Small K: Sensitive to noise, may overfit
  • Large K: Smoother boundaries, may underfit
  • Common: Try K = 3, 5, 7, sqrt(n)

When to Use KNN:

  • Small to medium datasets
  • Quick baseline model
  • No training phase (lazy learner)
  • Multi-class classification

When NOT to Use:

  • Large datasets (slow prediction)
  • High-dimensional data
  • Features on very different scales

Code Example:

from sklearn.neighbors import KNeighborsClassifier
 
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Day 18: Decision Trees

What It Is

A flowchart-like structure where each node represents a feature test, each branch is an outcome, and each leaf is a prediction.

** Decision Tree Splitting**

Decision Tree: Splitting Feature Space

A decision tree splits the feature space into rectangular regions, each assigned to a class.

Key Concepts:

Information Gain (ID3/C4.5): Measures reduction in uncertainty after a split

IG=H(parent)βˆ’βˆ‘βˆ£Si∣∣S∣H(Si)IG = H(\text{parent}) - \sum \frac{|S_i|}{|S|} H(S_i)

Gini Impurity (CART): Measures probability of misclassification

Gini=βˆ‘pi(1βˆ’pi)=1βˆ’βˆ‘pi2\text{Gini} = \sum p_i (1 - p_i) = 1 - \sum p_i^2

Pruning: Removing branches that have little predictive power to prevent overfitting

When to Use Decision Trees:

  • Interpretability is important
  • Non-linear relationships
  • Mixed feature types (numeric + categorical)
  • Fast training and prediction

Limitations:

  • Prone to overfitting
  • Sensitive to small data changes
  • Can create biased trees with imbalanced data

Code Example:

from sklearn.tree import DecisionTreeClassifier
 
model = DecisionTreeClassifier(max_depth=5, random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
 
# Visualize
from sklearn.tree import plot_tree
plot_tree(model, feature_names=features, filled=True)

Day 19: Random Forest

What It Is

An ensemble method that combines multiple decision trees to create a more powerful and robust model.

The Wisdom of Crowds

  • Individual trees: Good, but can make mistakes
  • Forest of trees: Mistakes tend to cancel out

How It Works (Bagging):

  1. Bootstrap: Sample data with replacement for each tree
  2. Feature Randomness: Each tree sees random subset of features
  3. Aggregate: Majority vote (classification) or average (regression)

Visual Concept:

        Tree 1 ─┐
        Tree 2 ─┼───→ Final Prediction (Vote/Average)
        Tree 3 ──
        Tree 4 β”€β”˜

Why Random Forest Works:

  • Reduces overfitting: Individual trees overfit, but averaging reduces it
  • Handles non-linearity: Trees capture complex patterns
  • Robust to outliers: Individual trees may be affected, forest is not
  • Feature importance: Shows which features matter most

When to Use Random Forest:

  • Most classification/regression tasks
  • Good default choice
  • Need robust predictions
  • Feature importance analysis

Code Example:

from sklearn.ensemble import RandomForestClassifier
 
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
 
# Feature importance
importances = model.feature_importances_

Day 20: Support Vector Machines (SVM)

What It Is

SVM finds the optimal hyperplane that best separates classes in the feature space.

** SVM Maximizing Margin**

SVM: Finding the Optimal Hyperplane

SVM finds the hyperplane that maximizes the margin between classes. The points closest to the boundary are called support vectors.

The Key Idea: Maximize the Margin

SVM doesn't just find any separating lineβ€”it finds the line with the largest margin between classes.

The Kernel Trick

SVM can separate non-linear data by mapping it to higher dimensions:

The Kernel Trick: Mapping to Higher Dimensions

A circular pattern in 2D becomes linearly separable when mapped to higher dimensions.

Common Kernels:

  • Linear: Straight-line separation
  • RBF (Radial Basis Function): Flexible, curved boundaries
  • Polynomial: Curved surfaces

When to Use SVM:

  • Binary classification
  • High-dimensional data
  • Clear margin of separation
  • Small to medium datasets

Limitations:

  • Slow on large datasets
  • Sensitive to parameter choice
  • Requires feature scaling

Code Example:

from sklearn.svm import SVC
 
model = SVC(kernel='rbf', C=1.0, gamma='scale')
model.fit(X_train_scaled, y_train)
predictions = model.predict(X_test_scaled)

Day 21: Naive Bayes

What It Is

A probabilistic classifier based on Bayes' Theorem with a "naive" assumption of feature independence.

Bayes' Theorem

P(Class∣Features)=P(Features∣Class)Γ—P(Class)P(Features)P(\text{Class} | \text{Features}) = \frac{P(\text{Features} | \text{Class}) \times P(\text{Class})}{P(\text{Features})}

** Bayes' Theorem**

Bayes' Theorem: Updating Probabilities

The posterior probability depends on both the prior probability and the likelihood of evidence.

The Naive Assumption

Assume all features are independent given the class. This simplifies calculation dramatically (even though it is rarely true in practice).

P(x1,x2,...,xn∣y)=P(x1∣y)Γ—P(x2∣y)Γ—...Γ—P(xn∣y)P(x_1, x_2, ..., x_n | y) = P(x_1 | y) \times P(x_2 | y) \times ... \times P(x_n | y)

Why It is Called "Naive"

Real-world features are often correlated, but Naive Bayes ignores this. Surprisingly, this works well anyway!

Text Classification Example: Spam Detection

P(Spam∣words)∝P(words∣Spam)Γ—P(Spam)P(\text{Spam} \mid \text{words}) \propto P(\text{words} \mid \text{Spam}) \times P(\text{Spam})

P(Spam∣words)∝P(free∣Spam)Γ—P(money∣Spam)Γ—P(urgent∣Spam)Γ—P(Spam)P(\text{Spam} \mid \text{words}) \propto P(\text{free} \mid \text{Spam}) \times P(\text{money} \mid \text{Spam}) \times P(\text{urgent} \mid \text{Spam}) \times P(\text{Spam})

Types of Naive Bayes:

TypeBest For
GaussianContinuous features (assumes normal distribution)
MultinomialWord counts, text classification
BernoulliBinary features (present/absent)

When to Use Naive Bayes:

  • Text classification (spam, sentiment)
  • Multi-class classification
  • Fast training and prediction
  • Works well with small data

Code Example:

from sklearn.naive_bayes import MultinomialNB
 
model = MultinomialNB()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

πŸ“ Notes (Algorithms)

Ask yourself:
β€œWhat kind of problem does this solve best?”


πŸ“ˆ PHASE 4 β€” Evaluation & Learning Process (Days 22–29)

Day 22: Model Evaluation Metrics

How Do You Know If Your Model Is Good?

You need metrics to quantify performance. Different problems require different metrics.

Regression Metrics:

1. Mean Absolute Error (MAE)

MAE=1nβˆ‘i=1n∣yiβˆ’y^i∣\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|

Interpretation: Average absolute difference between predictions and actual values. In the same units as target.

2. Mean Squared Error (MSE)

MSE=1nβˆ‘i=1n(yiβˆ’y^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

Penalizes larger errors more heavily. Units are squared.

3. Root Mean Squared Error (RMSE)

RMSE=MSE\text{RMSE} = \sqrt{\text{MSE}}

Interpretation: Back to original units. More interpretable than MSE.

4. RΒ² Score (Coefficient of Determination)

R2=1βˆ’SSresidualSStotalR^2 = 1 - \frac{\text{SS}_{\text{residual}}}{\text{SS}_{\text{total}}}

Interpretation: Proportion of variance explained. 1.0 = perfect, 0 = predicts mean always.

Classification Metrics:

1. Accuracy

Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}

Interpretation: Proportion of correct predictions. Good for balanced classes.

2. Precision

Precision=TPTP+FP\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}

Interpretation: Of all positive predictions, how many are correct? Important when false positives are costly.

3. Recall (Sensitivity)

Recall=TPTP+FN\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}

Interpretation: Of all actual positives, how many did we find? Important when false negatives are costly.

4. F1 Score

F1=2Γ—PrecisionΓ—RecallPrecision+Recall\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

Interpretation: Harmonic mean of precision and recall. Good for imbalanced classes.

Choosing the Right Metric:

Problem TypeCommon Metrics
RegressionMAE, RMSE, RΒ²
Binary Classification (balanced)Accuracy
Binary Classification (imbalanced)Precision, Recall, F1
Multi-classMacro/Micro F1

Day 23: Confusion Matrix

What Is a Confusion Matrix?

A table that describes the performance of a classification model. It shows actual vs predicted classifications.

Binary Classification Example:

[TNFPFNTP]\begin{bmatrix} \text{TN} & \text{FP} \\ \text{FN} & \text{TP} \end{bmatrix}

Where:

  • TN = True Negative (Correctly predicted negative)
  • TP = True Positive (Correctly predicted positive)
  • FN = False Negative (Missed positive - Type II Error)
  • FP = False Positive (Wrongly predicted positive - Type I Error)

Example: Cancer Detection (100 patients)

[85537]\begin{bmatrix} 85 & 5 \\ 3 & 7 \end{bmatrix}

Calculations:

Accuracy=85+7100=92%\text{Accuracy} = \frac{85 + 7}{100} = 92\% Precision=77+5=58%\text{Precision} = \frac{7}{7 + 5} = 58\% Recall=77+3=70%\text{Recall} = \frac{7}{7 + 3} = 70\%

Multi-Class Confusion Matrix:

For 3+ classes, you get an NΓ—N matrix showing all class predictions.

Visualizing:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
 
cm = confusion_matrix(y_test, predictions)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()

Why It Matters:

  • Shows not just accuracy, but types of errors
  • Identifies which classes are confused with each other
  • Reveals class imbalance issues

Day 24: ROC Curve & AUC

ROC Curve

Receiver Operating Characteristic curve plots:

  • True Positive Rate (Recall) on Y-axis
  • False Positive Rate on X-axis

At various classification thresholds.

TPR=TPTP+FN(Recall)\text{TPR} = \frac{\text{TP}}{\text{TP} + \text{FN}} \quad (\text{Recall}) FPR=FPFP+TN(FalseΒ AlarmΒ Rate)\text{FPR} = \frac{\text{FP}}{\text{FP} + \text{TN}} \quad (\text{False Alarm Rate})

AUC (Area Under the Curve)

The area under the ROC curve. Single number summary:

AUCMeaning
0.5Random guessing (useless)
0.7-0.8Fair
0.8-0.9Good
0.9+Excellent

** ROC Curve**

ROC Curve: Tradeoff Between TPR and FPR

The ROC curve shows the tradeoff between True Positive Rate and False Positive Rate at different thresholds. The diagonal line represents random guessing.

Why Use ROC-AUC?

  • Threshold-independent evaluation
  • Works well for imbalanced datasets
  • Shows tradeoff between TPR and FPR

When to Use:

  • Binary classification
  • Comparing multiple models
  • Imbalanced classification problems

Code Example:

from sklearn.metrics import roc_auc_score, roc_curve
 
# Get probabilities for positive class
y_proba = model.predict_proba(X_test)[:, 1]
 
# AUC Score
auc = roc_auc_score(y_test, y_proba)
 
# Plot ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)

Day 25: Cross-Validation

The Problem with Single Split

A single train-test split gives one performance estimate. But what if that split is unlucky?

** K-Fold Cross-Validation**

K-Fold Cross-Validation

Watch how K-fold CV uses each fold as test set exactly once, giving a more reliable performance estimate.

K-Fold Cross-Validation Solution

Split data into K folds, train K times:

FoldSplit
1[Test] [Train] [Train] [Train] [Train]
2[Train] [Test] [Train] [Train] [Train]
3[Train] [Train] [Test] [Train] [Train]
4[Train] [Train] [Train] [Test] [Train]
5[Train] [Train] [Train] [Train] [Test]

Final Score = Average of K scores

Common Values of K:

  • K=5: Good balance of speed and reliability
  • K=10: Standard, very reliable
  • K=LOOCV: Leave-One-Out (K=n), very slow but uses maximum data

Stratified K-Fold

For classification, use stratified K-fold to maintain class distribution in each fold.

Nested Cross-Validation

For hyperparameter tuning:

Outer Loop: Evaluate model performance
  └─ Inner Loop: Tune hyperparameters

Code Example:

from sklearn.model_selection import cross_val_score
 
scores = cross_val_score(
    model, X, y, 
    cv=5,  # 5-fold
    scoring='accuracy'
)
 
print(f"Scores: {scores}")
print(f"Mean: {scores.mean():.3f}")
print(f"Std: {scores.std():.3f}")

Why It Matters:

  • More reliable performance estimate
  • Uses all data for training and testing
  • Reduces variance of performance estimate

Day 26: Feature Scaling

Why Scale Features?

Many algorithms are sensitive to feature scales:

  • Distance-based algorithms (KNN, SVM)
  • Gradient descent algorithms
  • Regularization

Without Scaling:

Age: 25-65 (small range)
Income: 20,000-200,000 (large range)

Distance=(Agediff)2+(Incomediff)2Distance = \sqrt{(Age_{diff})^2 + (Income_{diff})^2}

β†’ Income dominates completely

Types of Scaling:

1. Standardization (Z-score normalization)

z=xβˆ’ΞΌΟƒz = \frac{x - \mu}{\sigma}

Result: Mean = 0, Std = 1

2. Min-Max Normalization

xscaled=xβˆ’min⁑(x)max⁑(x)βˆ’min⁑(x)x_{\text{scaled}} = \frac{x - \min(x)}{\max(x) - \min(x)}

Result: Range [0, 1]

3. Robust Scaling

Uses median and IQR (outlier-resistant):

xscaled=xβˆ’median(x)IQRx_{\text{scaled}} = \frac{x - \text{median}(x)}{\text{IQR}}

Code Example:

from sklearn.preprocessing import StandardScaler, MinMaxScaler
 
# Standardization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
 
# Min-Max Scaling
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)

Important: Fit scaler on training data only, transform both train and test with same scaler.

When to Scale:

  • KNN, SVM, K-Means

  • Logistic Regression, Linear Regression (with regularization)

  • Neural Networks

  • PCA, clustering

  • Tree-based models (Decision Tree, Random Forest) β€” invariant to scale


Day 27: Gradient Descent

What Is Gradient Descent?

An optimization algorithm that minimizes a loss function by iteratively moving in the direction of steepest descent.

The Intuition

Imagine being blindfolded on a mountain and trying to reach the bottom by feeling the slope under your feet. Gradient descent is the mathematical version of this.

The Update Rule

ΞΈnew=ΞΈoldβˆ’Ξ±Γ—βˆ‡\theta_{\text{new}} = \theta_{\text{old}} - \alpha \times \nabla

Where:

  • ΞΈ\theta = model parameters (weights)
  • Ξ±\alpha = learning rate (step size)
  • βˆ‡\nabla = gradient (slope of loss function)

Visual Example:

** Gradient Descent**

Gradient Descent: Finding the Minimum

Watch how gradient descent iteratively moves toward the minimum of the loss function. The learning rate determines step size.

Types of Gradient Descent:

1. Batch Gradient Descent

  • Uses all data per iteration
  • Stable, but slow for large datasets

2. Stochastic Gradient Descent (SGD)

  • Uses one sample per iteration
  • Fast, noisy, can escape local minima

3. Mini-Batch Gradient Descent

  • Uses small batches (32, 64, 128 samples)
  • Best of both worlds β€” most common in practice

Learning Rate Matters:

  • Too small: Converges very slowly
  • Too large: Oscillates or diverges
  • Just right: Converges efficiently

Code Example (Concept):

# Simplified gradient descent for linear regression
for epoch in range(epochs):
    for each sample in batch:
        prediction = model(x)
        error = prediction - y
        gradient = error * x
        weights -= learning_rate * gradient

Day 28: Regularization

What Is Regularization?

A technique to prevent overfitting by adding a penalty to the loss function that discourages complex models.

The Bias-Variance Tradeoff Again

Regularization intentionally increases bias to reduce variance, finding a better total error.

Types of Regularization:

1. L1 Regularization (Lasso)

Adds absolute value of weights to loss:

Loss = MSE + Ξ» Γ— Ξ£|weights|

Effect: Pushes some weights to exactly zero (feature selection)

2. L2 Regularization (Ridge)

Adds squared weights to loss:

Loss = MSE + Ξ» Γ— Ξ£(weights)Β²

Effect: Shrinks weights toward zero (but rarely to exactly zero)

3. Elastic Net (Combination)

Loss=MSE+Ξ»1βˆ‘βˆ£wi∣+Ξ»2βˆ‘wi2\text{Loss} = \text{MSE} + \lambda_1 \sum |w_i| + \lambda_2 \sum w_i^2

** Regularization Effect**

Regularization: Controlling Model Complexity

Adjust lambda (regularization strength) to see how it affects the model complexity. Higher lambda = simpler model = less overfitting.

Lambda (Ξ») Controls Strength:

  • Ξ» = 0: No regularization (risk of overfitting)
  • Ξ» = Large: Very strong regularization (risk of underfitting)

Code Example:

from sklearn.linear_model import Ridge, Lasso, ElasticNet
 
# Ridge (L2)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
 
# Lasso (L1)
lasso = Lasso(alpha=1.0)
lasso.fit(X_train, y_train)
 
# Elastic Net
elastic = ElasticNet(alpha=1.0, l1_ratio=0.5)
elastic.fit(X_train, y_train)

Day 29: Hyperparameter Tuning

What Are Hyperparameters?

Parameters set BEFORE training starts (not learned from data):

  • Learning rate
  • Number of trees in Random Forest
  • K in KNN
  • Regularization strength
  • Max depth of Decision Tree

Why Tuning Matters

Small changes can dramatically affect performance:

Default Random Forest:     82% accuracy
Tuned Random Forest:        89% accuracy

Tuning Methods:

1. Grid Search

Try all combinations in a predefined grid.

from sklearn.model_selection import GridSearchCV
 
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15],
    'learning_rate': [0.01, 0.1, 0.3]
}
 
grid_search = GridSearchCV(
    RandomForestClassifier(),
    param_grid,
    cv=5,
    scoring='accuracy'
)
grid_search.fit(X_train, y_train)

2. Random Search

Randomly sample hyperparameters (often more efficient).

from sklearn.model_selection import RandomizedSearchCV
 
random_search = RandomizedSearchCV(
    RandomForestClassifier(),
    param_distributions=param_grid,
    n_iter=30,
    cv=5
)
random_search.fit(X_train, y_train)

3. Bayesian Optimization (Advanced)

Uses probability models to smartly select promising hyperparameters.

Important Rules:

  • Tune on validation set, not test set
  • Use cross-validation for small datasets
  • Don't tune too many parameters at once

πŸ“ Notes (Evaluation)

This phase is what separates
β€œtutorial ML” from β€œreal ML”.


🧩 PHASE 5 β€” The Big Picture (Day 30)

Day 30: End-to-End Machine Learning Workflow

Putting it all togetherβ€”this is how real ML projects work:

The Complete Pipeline:

  1. Define Problem β€” What are we predicting? What data do we need?

  2. Collect Data β€” Get, scrape, or buy data relevant to the problem

  3. Explore Data (EDA) β€” Understand distributions, correlations, patterns

  4. Preprocess Data β€” Clean, handle missing values, encode categories, scale

  5. Split Data β€” Train/validation/test split

  6. Choose Baseline β€” Simple model to beat (linear regression, majority class)

  7. Try Multiple Models β€” Compare 3-5 different algorithms

  8. Evaluate & Tune β€” Use validation set, cross-validation, hyperparameter tuning

  9. Final Evaluation β€” Evaluate on test set ONCE

  10. Deploy & Monitor β€” Put in production, track performance over time

Key Insight:

StageTime
Data Collection & Cleaning60%
Model Building & Tuning20%
Evaluation & Deployment20%

Real ML is 80% data work.

Final Project Idea:

Build a complete ML pipeline end-to-end:

  1. Collect data (Kaggle dataset or scrape)
  2. Clean and preprocess
  3. Try 3+ models
  4. Tune the best one
  5. Evaluate on held-out test set
  6. Write a short report explaining your choices

πŸ“ Notes (Workflow)

ML is a process, not a script.


πŸ› οΈ PROJECT IDEAS

Pro Tip: Building projects is where theory becomes real understanding

ML basics, features, labels, train/test split

ProjectDifficultyTime
Predict house prices using fake/simple data⭐2-3 hours
Classify students as pass/fail using manual rules⭐1-2 hours
Manually label a small dataset (build intuition)⭐1 hour
Build a simple guessing game based on rules⭐1 hour

Skills Practiced: Understanding inputs β†’ outputs, thinking in features, manual pattern recognition

Preprocessing, correlation, underfitting/overfitting

ProjectDifficultyTime
Clean a messy CSV (handle missing values, outliers)⭐⭐3-4 hours
Visualize correlations with matplotlib/seaborn⭐⭐2-3 hours
Create synthetic data and identify underfitting vs overfitting⭐⭐3-4 hours
Exploratory data analysis (EDA) on a real dataset⭐⭐4-5 hours

Skills Practiced: Data cleaning, visualization, statistical thinking, model behavior recognition

Linear Regression, KNN, Decision Trees, Random Forest, SVM, Naive Bayes

ProjectDifficultyTime
Linear Regression from scratch (NumPy only)⭐⭐⭐4-5 hours
Spam classifier using Naive Bayes⭐⭐⭐4-5 hours
Build a Decision Tree classifier⭐⭐⭐3-4 hours
Compare KNN vs Logistic Regression on same dataset⭐⭐⭐4-5 hours
Titanic survival prediction with Random Forest⭐⭐⭐5-6 hours

Skills Practiced: Algorithm implementation, model comparison, feature interpretation

Metrics, cross-validation, regularization, hyperparameter tuning

ProjectDifficultyTime
Evaluate model with Confusion Matrix, Precision, Recall, F1⭐⭐2-3 hours
Plot and interpret ROC-AUC curves⭐⭐⭐3-4 hours
Implement 5-fold cross-validation from scratch⭐⭐⭐4-5 hours
Feature scaling experiment (with/without StandardScaler)⭐⭐2-3 hours
Grid search vs Random search comparison⭐⭐⭐4-5 hours

Skills Practiced: Model evaluation, robust validation, performance optimization

Apply the complete ML workflow to a real problem

Prediction

House prices, stock prices, demand forecasting

Focus: Regression, feature engineering

Classification

Customer churn, fraud detection, disease diagnosis

Focus: Metrics, class imbalance

Clustering

Customer segmentation, document grouping

Focus: Unsupervised learning

Recommendation

Movie/product recommendations

Focus: Collaborative filtering

Capstone Requirements:
  1. Define the problem clearly
  2. Load and explore data (EDA)
  3. Preprocess and engineer features
  4. Train multiple models
  5. Evaluate rigorously (metrics + validation)
  6. Document your reasoning and choices

Focus on reasoning, not accuracy. A well-reasoned wrong answer is more valuable than a lucky correct one.


🎯 How to Use This Page

  • One day = one concept
  • Tick only when you understand
  • Revisit notes weekly
  • Build at least 1 small project

🧠 Final Reminder

You do NOT need to know everything.

If you can do these three things, you already know Machine Learning:

βœ…Understand the workflow
βœ…Choose the right model
βœ…Evaluate properly

You already know Machine Learning. πŸš€

Subscribe to Our Newsletter