Putting AI into production - privately

There’s a quiet assumption baked into a lot of AI pitches: that to get a capable model, you have to send your data somewhere else. For most industries that’s an annoyance. For a bank, it’s a non-starter.

The good news is that the assumption is no longer true. Open-weight models - Llama, Qwen, Mistral, DeepSeek, Gemma - have closed enough of the gap that, for the overwhelming majority of finance back-office work, a privately deployed model does the job. The interesting question stopped being “is the open model good enough?” and became “how do we run it well, where our data already lives?”

What “private” actually means

Private deployment isn’t a checkbox. It’s a set of concrete decisions:

The weights run in your environment. Your VPC, your on-prem cluster, your sovereign region. Inference happens next to the data, not across an API boundary.
Nothing is sent off to be trained on. No prompts, no documents, no customer records leaving the perimeter. The model is a dependency you host, the same way you host a database.
You keep the artifacts. Weights, adapters, prompts, eval sets - they’re yours. If you part ways with your vendor, the system keeps running.

None of this is exotic. It’s standard infrastructure practice applied to a new kind of workload.

The serving layer is the hard part - and it’s solved

Running a model in production is mostly an inference-serving problem: batching, KV-cache management, quantization, autoscaling, keeping the GPUs busy. This is where engines like vLLM earn their keep. A well-tuned vLLM deployment on rented or in-cloud GPUs will serve a 70B-class model at a cost that surprises people who’ve only seen per-token API bills.

# A private OpenAI-compatible endpoint, in your cloud, in one line
vllm serve Qwen/Qwen2.5-72B-Instruct \
  --tensor-parallel-size 4 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92

Point your applications at that endpoint and the rest of the stack - RAG, agents, copilots - doesn’t care that the model is private. That’s the whole trick: the privacy boundary lives at the infrastructure layer, so everything above it stays simple.

Cost follows ownership

When compute is rented by the hour or runs on hardware you already pay for, the economics invert. You’re no longer taxed per token on every interaction; you pay for capacity and drive utilization up. For workloads with steady volume - and back-office automation is exactly that - owning the serving layer is usually cheaper and more private. You rarely have to trade one for the other.

Where to start

Pick one workload with clear value and contained scope - document extraction, a support copilot, a reconciliation assistant. Stand up a private endpoint, ground it in your data, measure it honestly, and put it in front of real users. The point isn’t a demo. The point is the first production system that proves the pattern, so the next ten are routine.

That’s the work. It’s less about the model and more about the engineering around it - which, after seventeen years in payments, is the part we find familiar.