An AI workforce for the back office

The first wave of enterprise AI was conversational: assistants that retrieve a document and summarize it. Useful, but it leaves a person doing the actual work - opening the next system, keying in the values, clicking approve.

The shift now underway is from answering to executing. An agent doesn’t just tell an operator how to clear a reconciliation break; it pulls the two ledgers, matches the entries, drafts the adjustment, and routes the exceptions to a human. That’s a workforce, not a chatbot. And the back office of a bank is full of work shaped exactly for it:

Reconciliation - matching, breaks, adjustments
Payments operations - exceptions, returns, repairs
KYC & onboarding - document intake, checks, enrichment
AML & fraud - alert triage, case packaging
Disputes & chargebacks - evidence gathering, representment
Procurement & HR ops - the long tail of internal workflows

One conversational interface on top, real systems underneath.

The hard part isn’t the agent. It’s the guardrails.

Letting software take actions inside a bank is a governance problem before it’s an AI problem. An agent that can move money is not allowed to be probabilistic about it. The architecture we keep coming back to draws a hard line:

The model decides what to do. Deterministic code decides whether it’s allowed and carries it out.

Concretely, that means:

Runtime authorization. Every action an agent attempts is checked against policy at execution time - who, what, how much, under which conditions - by code, not by a prompt.
Deterministic routing for money-moving steps. The irreversible actions go through typed, audited functions with hard limits. The LLM proposes; the rail disposes.
Human-in-the-loop where it matters. Above a threshold, or below a confidence bar, the agent stops and asks. Escalation is a feature, not a failure.
Full observability and audit. Every step, tool call and decision is traced. When a regulator asks “why did this happen,” the answer is a log, not a shrug.

Tools like LangGraph give you the controllable execution graph; MCP gives agents a clean, typed way to reach your systems; Langfuse and eval harnesses keep you honest about whether it’s actually working. But the integration layer - the Go services, gRPC, the connection into core and payment systems - is where the reliability is won or lost. That’s old-fashioned engineering, and it’s the part demos skip.

Start narrow, instrument everything

The failure mode is trying to automate a whole department on day one. The pattern that works: take one workflow with a clear owner, a measurable outcome and a tolerable blast radius. Give the agent read access first, then a few reversible write actions behind authorization, then - only once the evals and the operators both trust it - the higher-stakes steps.

Done this way, the “AI workforce” stops being a slogan and becomes what it should be: boring, reliable software that quietly clears the queue overnight.