Open source, battle-tested: the asset-light AI stack
You don't need to own a GPU farm or a hundred-person team to ship serious AI. The open-source ecosystem already did the hard parts. Here's the stack we build on.
We’re a small, senior team, and we don’t own infrastructure. That’s not a limitation we apologize for - it’s a deliberate position. The open-source AI ecosystem now contains so much battle-tested engineering, built and used by the world’s largest companies, that the leverage available to a focused team is enormous. The job is to compose it well, point it at a real problem, and run it.
Here’s the stack we reach for, and why.
Models - open weights first
Llama, Qwen, Mistral, DeepSeek, Gemma. Open-weight models you can host,
inspect and fine-tune, with a frontier API kept as an optional escape hatch for
the few tasks that genuinely need it. Open weights are what make private
deployment possible at all - you can’t run a closed model in someone else’s VPC.
Serving - keep the GPUs busy
vLLM, Ollama, llama.cpp. The serving layer is where throughput and cost
live. vLLM for production endpoints, Ollama and llama.cpp for edge, local and
on-device work. (TGI has been retired upstream; the open ecosystem has
consolidated around these.)
Agents & retrieval
LangGraph and CrewAI for controllable agent execution; LlamaIndex for
retrieval-first products. For the data layer, pgvector when you already run
Postgres, Qdrant or Milvus when you need a dedicated, scalable vector store.
Fine-tuning - cheaper than people think
LoRA / QLoRA, Unsloth, Axolotl, on top of the Hugging Face ecosystem.
Adapter-based fine-tuning means you can specialize a model on a single GPU and
ship a small adapter rather than a new base. Most “we need a custom model”
requests are really “we need a good RAG pipeline plus a small adapter and a clean
eval set.”
Evaluation & observability - the unglamorous essentials
Langfuse for tracing and observability, Ragas for RAG evals, MLflow for
experiment tracking, Guardrails for runtime constraints. This is the layer that
separates a production system from a demo. If you can’t measure whether a change
made things better or worse, you’re not engineering - you’re guessing.
The backbone is conventional
Underneath all of it, the production glue is a modern cloud-native stack:
Kubernetes and Docker for orchestration, Go and gRPC for the services,
Terraform for infrastructure-as-code, Prometheus and OpenTelemetry for
metrics and traces. The AI is new; the discipline of running reliable services
isn’t.
Compute: rented, not owned
When we need GPUs, we rent them - Runpod, Modal, Lambda - or we deploy into the client’s own cloud. We never buy hardware, and neither should most of our clients for an early system. Asset-light isn’t a cost-cutting story; it’s what lets a senior team stay focused on the work that’s actually hard: the integration, the guardrails, and the last mile into production.
Want this in production?
We help banks and fintechs ship AI privately, on their own cloud.
Start a project