An AI workforce for the back office
Chatbots answer questions. Agents do the work. What it takes to let software actually execute reconciliation, onboarding and disputes - safely - inside a bank.
The first wave of enterprise AI was conversational: assistants that retrieve a document and summarize it. Useful, but it leaves a person doing the actual work - opening the next system, keying in the values, clicking approve.
The shift now underway is from answering to executing. An agent doesn’t just tell an operator how to clear a reconciliation break; it pulls the two ledgers, matches the entries, drafts the adjustment, and routes the exceptions to a human. That’s a workforce, not a chatbot. And the back office of a bank is full of work shaped exactly for it:
- Reconciliation - matching, breaks, adjustments
- Payments operations - exceptions, returns, repairs
- KYC & onboarding - document intake, checks, enrichment
- AML & fraud - alert triage, case packaging
- Disputes & chargebacks - evidence gathering, representment
- Procurement & HR ops - the long tail of internal workflows
One conversational interface on top, real systems underneath.
The hard part isn’t the agent. It’s the guardrails.
Letting software take actions inside a bank is a governance problem before it’s an AI problem. An agent that can move money is not allowed to be probabilistic about it. The architecture we keep coming back to draws a hard line:
The model decides what to do. Deterministic code decides whether it’s allowed and carries it out.
Concretely, that means:
- Runtime authorization. Every action an agent attempts is checked against policy at execution time - who, what, how much, under which conditions - by code, not by a prompt.
- Deterministic routing for money-moving steps. The irreversible actions go through typed, audited functions with hard limits. The LLM proposes; the rail disposes.
- Human-in-the-loop where it matters. Above a threshold, or below a confidence bar, the agent stops and asks. Escalation is a feature, not a failure.
- Full observability and audit. Every step, tool call and decision is traced. When a regulator asks “why did this happen,” the answer is a log, not a shrug.
Tools like LangGraph give you the controllable execution graph; MCP gives
agents a clean, typed way to reach your systems; Langfuse and eval harnesses
keep you honest about whether it’s actually working. But the integration layer -
the Go services, gRPC, the connection into core and payment systems - is where
the reliability is won or lost. That’s old-fashioned engineering, and it’s the
part demos skip.
Start narrow, instrument everything
The failure mode is trying to automate a whole department on day one. The pattern that works: take one workflow with a clear owner, a measurable outcome and a tolerable blast radius. Give the agent read access first, then a few reversible write actions behind authorization, then - only once the evals and the operators both trust it - the higher-stakes steps.
Done this way, the “AI workforce” stops being a slogan and becomes what it should be: boring, reliable software that quietly clears the queue overnight.
Want this in production?
We help banks and fintechs ship AI privately, on their own cloud.
Start a project