From Research to Production: What Code Needs Before It Can Ship

If you've ever handed off research code to an engineering team — or received it — you probably know this moment.

The model works. The accuracy is great. The paper is published or the PoC is approved. Everyone is excited.

Then someone says:

“OK, let’s ship it.”

And that's where the gap appears.

I've lived both sides of this problem throughout my career.

I've done the research work myself — experimenting with architectures, trying multiple approaches before finding one that finally works, staring at loss curves late at night hoping the training converges. I know the feeling of waking up at 3am, walking past the screen, and stopping just to check whether the run finally succeeded.

And I've also been the engineer, architect, and technical lead responsible for turning that same code into production systems that real users depend on — taking ideas from experimentation all the way to scalable products that reach end users. I've always loved that part of engineering too: turning research and ideas into something real that people can actually use.

So when I talk about the gap between research and production, I'm not taking sides.

This is not about right or wrong. There is rarely one perfect way to do things in engineering. Context matters. Teams matter. Products matter. Longevity in this field teaches adaptability much more than ideology.

Researchers are brilliant. They solve difficult problems, push boundaries, and create work that genuinely matters. The gap between research and production is not about intelligence or capability.

It’s about priorities.

Research code is optimized for experimentation and discovery.
Production code is optimized for reliability, maintainability, security, and collaboration.
Different priorities create different constraints.

Most researchers were never expected to think deeply about production concerns before — because in research environments, many of those concerns simply do not exist yet.

And honestly, that makes sense.

What research code typically looks like

There’s a notebook, a few Python scripts, maybe some code adapted from GitHub to reproduce a baseline. The model trains correctly, the metrics look good, and the experiment reaches the target.

Then you look a little closer.

Paths are hardcoded to one GPU machine.
Credentials are sitting directly inside the source code.
Training, preprocessing, inference, evaluation, and debugging utilities all live in one large script.

And honestly, this makes complete sense.

Researchers optimize for one thing: results. They work alone or in small groups, on their own machines, with their own data. The code only needs to produce the result — the paper, the proof, the accuracy number. Maintainability is rarely the priority. Often, nobody else needs to run it beyond the research environment.

Production changes the constraints.

The four gaps to close

When research code moves toward a real product, four problems surface almost immediately. None of them are ML-specific — they are standard engineering concerns that most researchers simply haven't been exposed to yet.

Security

A hardcoded API key or credential in your source code is not a minor issue — it is a security incident waiting to happen. Even in a private repository, secrets in code will fail a cybersecurity scan and stop your entire release.

The fix is straightforward: environment variables and secret management tools. But if you've only ever run code on your own machine with your own credentials, you've probably never needed to think about this before. It's not negligence — it's simply a rule that did not exist in your environment until now.

In production, many people and systems interact with the same codebase. That changes everything.

Maintainability

Hardcoded paths work perfectly — on one machine, in one environment, right now. The moment another engineer runs the same code on a different server or inside a Docker container, everything breaks.

Environment-specific settings — dataset paths, model paths, service endpoints — belong in configuration files or environment variables, not inside the code itself.

This becomes important very quickly once code moves across environments and teams.

Code quality

When one person maintains the code, formatting usually does not matter much. But research-to-production is rarely a one-time handoff.

Models evolve. Bugs are found. Research teams continue improving architectures, retraining models, or adjusting inference logic after deployment work has already started.

Without shared formatting and code quality standards, collaboration between teams becomes unnecessarily difficult over time.

A few tools help keep the repository predictable and easier for everyone to work in:

pylint— checks code quality and catches common errors
black — enforces consistent Python formatting automatically
isort — keeps imports organized so PRs don't include unnecessary diff noise
prettier — keeps Markdown, JSON, YAML, and frontend-related files consistently formatted

These are small things, but shared standards reduce friction significantly when multiple teams continue working in the same repository long-term.

Repository structure

A single script that does everything is completely fine during experimentation and research. Not every exploratory notebook or adopted GitHub implementation needs enterprise-level structure.

But the code handed over for production inference is different.

That path should be clear and separated:

model initialization preprocessing inference post-processing configuration

If everything is mixed together inside training scripts, experiments, notebooks, and debugging utilities, another engineer has to spend time figuring out which parts are actually required for production.

Good repository structure is not about aesthetics. It is what makes inference code testable, deployable, maintainable, and easier to evolve as the product grows.

It also allows teams to move faster without repeatedly rediscovering the same assumptions hidden inside the repository.

Practical checks before handoff

Most of the friction during handoff is not the model itself.

Usually, it's the engineering assumptions around it.

A few small adjustments before handoff can save the production team an enormous amount of time later.

Separate inference code from training code

Production teams usually only need the inference path, not the entire training pipeline.

If inference logic is mixed together with experiments, notebooks, training utilities, and debugging code, another engineer has to go through the repository and extract dependencies piece by piece just to deploy the model.

That becomes difficult very quickly.

There is also a security and maintenance reason for this separation.

Inference environments usually do not need the full training dependency stack. In enterprise systems, every dependency goes through vulnerability and security scanning. The more unnecessary libraries included, the larger the maintenance and security surface becomes.

A smaller inference environment is easier to maintain, safer to operate, and faster to deploy.

Move configurable values out of the code

Avoid hardcoding:

batch sizes
model locations
API endpoints
device settings

Server specifications and environments change constantly.

One mindset that helps:

write configuration as if the next engineer has never seen your repository.

Because most of the time, they haven't.

If possible, leave small notes about reasonable parameter ranges and what changing them affects.

Small context prevents a surprising amount of unnecessary debugging later.

Do not assume device_id == 0

This one sounds small, but it appears surprisingly often.

Avoid hardcoding: cuda:0

Production environments are rarely identical to local research machines.

GPU allocation may change dynamically in shared servers, Kubernetes clusters, or multi-tenant systems. GPU 0 may not even be available.

Make device selection configurable or resolve it dynamically at runtime.

Small change. Very common production issue avoided.

Push your work to Git early

This is not specifically a research-to-production gap, but it matters more than people think.

I have seen training servers crash. I have seen EC2 instances terminated because a reservation contract ended. And suddenly, weeks or months of work disappeared with the machine.

Push your work to Git regularly.

If the project is likely to become a real product, enable:

security scanning
Dependabot
vulnerability scanning for requirements.txt

Start fixing dependency issues early, especially Critical and High severity findings. It is much easier than waiting until production handoff.

In enterprise environments, dependency versions matter. Security teams will scan them eventually anyway.

Document inference assumptions and resource usage

When handing over a model, do not provide only the code and weights. Provide the operational assumptions as well.

Include:

input specification
preprocessing requirements
default inference settings
expected output format
recommended batch size

And most importantly, explain why the defaults were chosen.

The engineer receiving your code may need to adjust it for:

lower latency higher throughput
smaller GPU memory
different hardware environments
different business constraints

Small context helps people make safe adjustments without guessing.

Also provide benchmark information such as:

GPU usage
memory usage
disk usage
input size
batch size
inference speed in milliseconds
tokens per second for LLM systems

Especially in computer vision, inference speed can change significantly depending on image size and batch configuration.

If reproducibility matters, also document:

random seeds
library versions
runtime environments used during validation

These numbers become the baseline for production validation, capacity planning, and future optimization work. They also help teams detect performance regressions after refactoring or deployment changes.

Keep inference dependencies local

For production inference, avoid pulling models or assets dynamically from external sources such as Hugging Face during runtime.

Keep:

model weights
tokenizers
configs
inference assets

local and version-controlled.

There are practical reasons for this:

enterprise environments may block external network access
external services can go down or become rate-limited
upstream changes can create inconsistent inference results between deployments

Production systems should be reproducible and stable.

Stable inputs create stable systems.

The bridge, not the barrier

Every standard described here exists for a reason.

Security checks protect users and systems.

Configuration management helps code survive changing environments.

Code quality tools make collaboration possible.

Repository structure makes systems maintainable long after the original author moves on.

These are not obstacles placed between research and production.

They are the bridge.

And in my experience, once researchers understand why these standards exist — not as bureaucracy, but as protection for their own work and their own results — most of them embrace it naturally.

Because their models, experiments, and hard-earned results deserve to reach real users in a form that is stable, secure, and built to last.

The gap between research and production is real.

But it is not a wall.

Most of the time, it just needs someone willing to build the bridge.

From Research to Production: What Code Needs Before It Can Ship

What research code typically looks like

The four gaps to close

Security

Maintainability

Code quality

Repository structure

Practical checks before handoff

Separate inference code from training code

Move configurable values out of the code

Do not assume device_id == 0

Push your work to Git early

Document inference assumptions and resource usage

Keep inference dependencies local

The bridge, not the barrier

Comments (1)

More from this blog

How to Understand Latency in Computer Vision Inference Systems

Command Palette

What research code typically looks like

The four gaps to close

Security

Maintainability

Code quality

Repository structure

Practical checks before handoff

Separate inference code from training code

Move configurable values out of the code

Do not assume device_id == 0

Push your work to Git early

Document inference assumptions and resource usage

Keep inference dependencies local

The bridge, not the barrier

Comments (1)

More from this blog