Open source, battle-tested: the asset-light AI stack

We’re a small, senior team, and we don’t own infrastructure. That’s not a limitation we apologize for - it’s a deliberate position. The open-source AI ecosystem now contains so much battle-tested engineering, built and used by the world’s largest companies, that the leverage available to a focused team is enormous. The job is to compose it well, point it at a real problem, and run it.

Here’s the stack we reach for, and why.

Models - open weights first

Llama, Qwen, Mistral, DeepSeek, Gemma. Open-weight models you can host, inspect and fine-tune, with a frontier API kept as an optional escape hatch for the few tasks that genuinely need it. Open weights are what make private deployment possible at all - you can’t run a closed model in someone else’s VPC.

Serving - keep the GPUs busy

vLLM, Ollama, llama.cpp. The serving layer is where throughput and cost live. vLLM for production endpoints, Ollama and llama.cpp for edge, local and on-device work. (TGI has been retired upstream; the open ecosystem has consolidated around these.)

Agents & retrieval

LangGraph and CrewAI for controllable agent execution; LlamaIndex for retrieval-first products. For the data layer, pgvector when you already run Postgres, Qdrant or Milvus when you need a dedicated, scalable vector store.

Fine-tuning - cheaper than people think

LoRA / QLoRA, Unsloth, Axolotl, on top of the Hugging Face ecosystem. Adapter-based fine-tuning means you can specialize a model on a single GPU and ship a small adapter rather than a new base. Most “we need a custom model” requests are really “we need a good RAG pipeline plus a small adapter and a clean eval set.”

Evaluation & observability - the unglamorous essentials

Langfuse for tracing and observability, Ragas for RAG evals, MLflow for experiment tracking, Guardrails for runtime constraints. This is the layer that separates a production system from a demo. If you can’t measure whether a change made things better or worse, you’re not engineering - you’re guessing.

The backbone is conventional

Underneath all of it, the production glue is a modern cloud-native stack: Kubernetes and Docker for orchestration, Go and gRPC for the services, Terraform for infrastructure-as-code, Prometheus and OpenTelemetry for metrics and traces. The AI is new; the discipline of running reliable services isn’t.

Compute: rented, not owned

When we need GPUs, we rent them - Runpod, Modal, Lambda - or we deploy into the client’s own cloud. We never buy hardware, and neither should most of our clients for an early system. Asset-light isn’t a cost-cutting story; it’s what lets a senior team stay focused on the work that’s actually hard: the integration, the guardrails, and the last mile into production.