From Notebook to Production: How I Actually Deploy ML Models
Your model working in a notebook means nothing. I break down what it actually takes to deploy ML systems in production — the messy parts, the real tradeoffs, and what I wish I knew earlier.
From Notebook to Production: How I Actually Deploy ML Models
Introduction
The first time I deployed a model outside a notebook, it broke in six different ways before I could even call it "running".
Not because the model was bad. Not because my code was wrong.
It broke because I had never actually thought about what production means.
Most people learn ML through notebooks — Jupyter, Colab, whatever. You load your data, clean it, train a model, print an accuracy score, feel great.
But that's not a product. That's a science experiment.
By the end of this post, you'll understand:
- The real mental model for production ML systems
- Why the gap between experimentation and deployment is enormous
- What actually matters when real users depend on your model
What People Think ML Deployment Is (And Why That's Wrong)
Most people (including past me) think deploying an ML model means:
- Saving the model as a
.pklfile - Writing a Flask route around it
- Putting it on a server somewhere
- Done
That's not wrong. But it's dangerously incomplete.
Here's what production ML actually is:
- It's a system, not a model
- It's a pipeline that runs continuously, not a script you run once
- It's something that breaks in production even when it works in your notebook
In a notebook, everything is controlled. Your data is clean. Your environment is frozen. There's no latency pressure. No one's calling your model at 3am with malformed inputs.
Production is the opposite of that.
Production is chaos dressed in a nice API.
Why Notebooks Lie to You
Your notebook is lying to you and it's doing it politely.
When you train a model locally, you control everything:
- The data is already cleaned
- The features are already engineered
- The environment has every dependency installed
- There's no one sending weird inputs you didn't anticipate
None of that is true in production.
Real users send missing values. Real data pipelines break. Real servers have different Python versions than your laptop. Real traffic spikes at unpredictable times.
The notebook creates an illusion of progress.
You see 92% accuracy and think you're done. But you haven't tested what happens when a column is null. Or when the model gets a string where it expected a float. Or when the input distribution shifts because it's December and holiday shopping has changed your users' behavior.
The notebook doesn't show you any of that.
Notebook = ideal conditions. Production = adversarial conditions.
Your model is being tested in a quiet lab. It needs to perform in the middle of a storm.
The Real Shift: From Model to System
This is the insight that changed how I think about ML work.
Your model is not the product. Your model is one component inside a product.
Think of it like this: if you're building a car, the engine matters. But the engine alone is useless without fuel, wheels, steering, and brakes. You don't sell an engine, you sell a car.
Same with ML.
The actual system looks like this:
Raw Data → Data Pipeline → Feature Engineering → Model → Prediction → Post-Processing → Response
Every single arrow in that diagram is a failure point.
Most ML engineers spend 80% of their time building the model and 20% on everything else. In production, that ratio needs to flip.
The features you engineer matter more than the model you pick.
The data pipeline you build matters more than the features you engineer.
The monitoring you set up matters more than the accuracy you achieved.
This was hard for me to accept. I wanted to optimize the model. Turns out, I needed to be an engineer first and a data scientist second.
Packaging the Model (The First Reality Check)
So you've trained your model. Now what?
First, you have to save it in a way that can be loaded somewhere else, by someone else (or by your own code in a different environment).
This is called serialization. And it's messier than it sounds.
Your options:
pickle— the default, but brittle. Tied to your exact Python version and environment. I've had pickle files break because I updated scikit-learn by a minor version.joblib— better for large numpy arrays, still Python-specific.ONNX— framework-agnostic format that lets you export from PyTorch or sklearn and run anywhere. More setup, but more portable.TorchScript— for PyTorch models specifically, good for production.SavedModel— TensorFlow's own format.
What I've learned the hard way: pickle is fine for experiments, not for production.
The moment you need to serve a model to someone else, or on a different machine, or six months from now, you'll want a more stable format.
The second problem is dependency hell.
Your model was trained in an environment with specific versions of pandas, numpy, scikit-learn, etc. If production has different versions, the behavior might change. Slightly. Just enough to produce wrong predictions without crashing.
That's the worst kind of bug. Silent and slow.
The fix: containerize everything.
Docker solves this. You freeze your entire environment into a container image. Same Python, same libraries, same behavior — everywhere.
Is it overhead? Yes. Is it worth it? Absolutely.
Reproducibility is harder than it sounds. But it's non-negotiable.
Serving the Model: APIs, Batch, or Streaming
Once your model is packaged, you need to expose it somehow.
There are three main patterns:
Real-Time Inference (APIs)
User makes a request → your model runs → response returned immediately.
This is what most people think of. FastAPI or Flask wrapping your model, exposed via HTTP.
Use this when:
- You need a response right now (fraud detection, recommendation, search ranking)
- Latency matters — users are waiting
The tradeoff: you have to optimize for speed. A model that takes 3 seconds to respond is useless in a user-facing product. You need to think about latency, concurrency, and load balancing.
Batch Inference
You run your model on a chunk of data at a scheduled time. Not in real-time.
Use this when:
- Latency doesn't matter (generating weekly reports, scoring customer segments overnight)
- You have a lot of data to process at once
This is much simpler to build and operate. No API to manage. Just a job that runs on a schedule.
I almost always recommend starting here if you can.
Streaming Inference
Your model processes events as they arrive in a continuous stream (Kafka, Kinesis, etc.).
Use this when:
- You have high-volume, real-time events (clickstreams, IoT sensors)
- Batch is too slow but you don't need per-request latency
This is the most complex to build. Unless you genuinely need it, you probably don't need it.
Data Pipelines: The Silent Killer
This is the part nobody talks about in tutorials. And it's the part that kills most production ML systems.
Let me explain the core problem: training-serving skew.
Training-serving skew happens when the data your model was trained on is different from the data it sees in production.
And this happens more often than you'd think.
Example: You train a model on data where age is always filled in. In production, some users don't provide their age. Your model sees null where it expected a number. Prediction goes sideways.
Another example: You engineer a feature called "days since last purchase" during training. In production, the code that computes that feature has a different timezone assumption. Off by one day across the board. Your model's accuracy silently drops.
The dangerous thing about training-serving skew is that it doesn't throw an error. The model still makes predictions. Just wrong ones.
The fix is straightforward but tedious:
- Build your feature engineering as a shared library, used for both training and serving
- Never compute features differently in two places
- Log what features your model actually receives in production and compare to training
The second data pipeline problem is upstream dependency.
Your model depends on data from a database, an API, or another service. If that source changes its schema, your pipeline breaks. If it goes down, your pipeline breaks. If it starts sending nulls where it didn't before, your model degrades.
You have to build your pipeline defensively. Validate inputs. Handle missing values explicitly. Alert when distributions shift.
Monitoring: If You're Not Watching, It's Already Broken
Here's something they don't teach you in any ML course:
Your model will degrade in production, and you won't notice unless you build monitoring.
There are two kinds of drift you care about:
Data drift — the distribution of inputs changes.
Maybe it's seasonal. Maybe user behavior shifted. Maybe an upstream data source changed. Whatever the reason, the model is now receiving inputs that look different from what it was trained on.
Model drift (concept drift) — the relationship between inputs and outputs changes.
The world changes. What predicted churn six months ago might not predict it now. Economic conditions shift. Competitor launches a new product. Your training data becomes stale.
Both of these are silent. No error is thrown. The model just gets worse over time.
What you need to log:
- Inputs: What did your model actually receive? Log a sample.
- Predictions: What did your model output? Distribution of scores, class breakdowns.
- Ground truth: When you eventually find out the actual outcome, compare it to what the model predicted.
- Latency: How long is inference taking? A slowdown is often a signal something is wrong.
The monitoring infrastructure is not glamorous. But it's the difference between a model that degrades silently and a model you can actually trust.
Build observability before you optimize. Every time.
How I Actually Use This
Here's my real workflow when I need to take a model to production:
1. Start with batch before real-time
If I can run the model on a schedule instead of in real-time, I do that first. It's simpler, easier to debug, and lets me validate the model's behavior before I'm on the hook for latency.
2. Use FastAPI when I need a real-time endpoint
It's fast, has built-in validation with Pydantic, and generates docs automatically. Flask is fine too, but I've moved to FastAPI for anything new.
3. Containerize from day one
Docker from the start. Not "I'll add Docker later." Later never comes and later Docker becomes a migration project.
4. Version everything
Models, data, configs. When something breaks in production (and it will), you need to know what changed. Git for code. MLflow or DVC for model versions. Simple is fine, but something.
5. Add monitoring before launch, not after
I log inputs, outputs, and latency. I set up alerts for when prediction distributions shift outside a threshold. This takes a few hours to build. It has saved me many times.
What I'd Do Differently If I Started Today
I'd spend less time tuning models and more time on the pipeline.
Every hour I spent adjusting hyperparameters could have been spent building better feature engineering. The pipeline improvements delivered more value every time.
I'd design for failure from day one.
What happens if the model returns null? What happens if the database is down? What's the fallback? These questions should be answered before launch, not after the 3am incident.
I'd keep deployments boring and predictable.
The flashiest infra choice is usually the wrong one. Simple beats clever. Boring beats innovative. I want the deployment to be so predictable that launches feel anticlimactic.
That's a good sign.
Common Mistakes & Gotchas
- Ignoring edge cases because 'it worked on test data' — test data is clean, production is not
- Not handling missing or malformed inputs — nulls and wrong types break predictions silently
- Overengineering infra before proving value — start simple, scale when needed
- Treating deployment as an afterthought — think about serving, monitoring, and retraining from the start
Mini FAQ
Q1. How do you deploy a machine learning model to production?
To deploy an ML model to production: serialize it with a stable format like ONNX or joblib, containerize the environment with Docker, expose it via a FastAPI or batch job, validate all inputs explicitly, handle failures with sensible fallbacks, and build monitoring for drift and latency from day one — before you launch, not after.
Q2. What tools are used for ML deployment?
The most common ML deployment stack includes Docker for containerization, FastAPI or Flask for real-time APIs, MLflow or DVC for model and data versioning, and Prometheus or Datadog for monitoring. For managed serving at scale, cloud options like AWS SageMaker, GCP Vertex AI, or Azure ML handle infrastructure so you can focus on the model pipeline.
Q3. What is training-serving skew?
Training-serving skew happens when the data a model was trained on differs from what it receives in production — due to different preprocessing logic, missing values, timezone mismatches, or upstream schema changes. The model keeps running and returning predictions, but accuracy quietly degrades. It is one of the most common and hardest-to-detect failures in production ML systems.
Q4. Should I use real-time or batch inference?
Default to batch inference unless your use case has a hard latency requirement. Batch is simpler to build, cheaper to run, and much easier to debug. Use real-time inference only when users or downstream systems are waiting on the result — like fraud detection, search ranking, or live recommendations. Most internal ML use cases do not need real-time.
Q5. Why do ML models fail in production?
ML models fail in production for five main reasons: data drift (input distribution shifts over time), training-serving skew (different preprocessing in training vs. serving), missing input validation (nulls and malformed data break predictions silently), lack of monitoring (no one notices when accuracy drops), and unrealistic assumptions built during training that do not hold in the real world.
My Opinion: What Actually Matters
My honest take: most ML projects fail not because of bad models, but because of bad systems.
The model is the easy part. Data scientists love to optimize models. It's intellectually fun. You get a metric that goes up and feel good.
But the model is 20% of the work in production.
The pipeline, the monitoring, the versioning, the serving infrastructure — that's the other 80%. And that's what determines whether your model actually creates value.
This is NOT for:
- People optimizing Kaggle scores
- Teams without real users or data flow
This approach breaks when:
- You ignore data contracts and upstream dependencies
- You treat deployment as a one-time event instead of an ongoing responsibility
- You scale before you've proven value — exposing every shortcut you took
Outro
If your model isn't in production, it's just a demo.
Start small. Deploy something imperfect. Watch how it behaves. Iterate.
The fastest way to learn production ML is to be responsible for a model that's actually running somewhere that matters. Everything breaks eventually. The question is whether you find out first or your users do.
Build the monitoring. Version the models. Validate the inputs.
Everything else is details.
Credible Sources
-
Google – Rules of Machine Learning Battle-tested production ML guidelines from teams that operate at massive scale.
-
MLOps Community Practical insights from engineers working on real production ML — not academic papers.
-
Uber Engineering Blog – Michelangelo How Uber built their end-to-end ML platform. The scale is different, the lessons are universal.
-
Netflix Tech Blog Real-world deployment and experimentation systems from teams that take ML seriously.
-
AWS Machine Learning Blog Production-focused architectures from teams deploying at scale.