Four deterministic steps · one human-in-the-loop · zero lock-in
The challenge
The client’s accounts-payable team was processing around 4,000 supplier invoices per month — PDFs, scans, and the occasional emailed photo. Each one took an operator 12–16 minutes: download, re-key, match to a PO, route for approval. Errors slipped through. New hires took weeks to ramp.
They had tried an OCR vendor the year before. The extraction was 82% accurate on a good day, which meant a human still had to re-verify every single field. The tool had become another tab to keep open rather than a workflow that actually shipped results.
The brief to us was blunt: “Ship something that actually saves us hours, not another dashboard.”
Our roadmap
-
01Phase 01 Week 1
Discovery & data
Shadowed 3 operators, labelled a gold-set of 500 real invoices, benchmarked two LLMs against the status quo.
-
02Phase 02 Week 2
Agent & extraction
Built the multi-step agent: classify, extract, validate, match to PO. Structured outputs, JSON schema, deterministic retries.
-
03Phase 03 Week 3
Human-in-the-loop UI
Shipped a single-pane review UI with inline corrections. One keystroke approves; edits feed the evaluation set automatically.
-
04Phase 04 Week 4
Harden & launch
Cost caps, PII guardrails, DATEV export, on-call runbook. Rolled out to the full ops team on day 28.
How we shipped it
Week 1 — Discovery that pays for itself
Before writing a single prompt we sat with three AP operators for a day each. The real bottleneck wasn’t extraction — it was the six little decisions a human made per invoice (is this a duplicate? is the VAT ID in our vendor master? does the PO still have budget?). We captured those decisions as the spec for the agent, not the other way round.
We labelled 500 representative invoices as a gold set. That gold set stayed the source of truth for every benchmark that followed.
Week 2 — The agent, built boringly
Four deterministic steps, not one giant prompt:
- Classify — is this an invoice, a credit note, a statement, or something else?
- Extract — structured JSON against a Zod schema (header, line items, totals, VAT).
- Validate — sums add up, VAT IDs check out against the EU VIES registry, currency matches the vendor master.
- Match — propose a purchase order and a cost centre, with a confidence score.
Each step is independently evaluated, independently retryable, and independently cacheable. That’s the boring part that makes it cheap to run and easy to debug at 2 a.m.
Week 3 — The review UI nobody dreaded using
The operator sees the PDF on the left, the extracted fields on the right, confidence-coded. Green fields need a glance. Orange fields need a click. Red fields are pre-focused for editing. The whole thing is keyboard-driven — Tab, Enter, done.
Every correction the operator makes writes back into a labelled evaluation set. The agent’s accuracy keeps improving without anyone running a training job.
Week 4 — Hardening before the handover
- Cost caps per invoice and per day, hard-enforced.
- PII redaction on anything stored in logs.
- DATEV export in the format the client’s accountants already used.
- An on-call runbook, a Grafana board, and a Slack alert for anything the agent flagged as ambiguous twice in a row.
We went live on a Thursday. The ops lead DM’d us on Friday evening: “First week in two years I’m leaving on time.”
The architecture
A readable stack is a maintainable stack. Nothing exotic.
- Ingestion: Microsoft 365 connector + a watched inbox → S3-compatible bucket.
- Agent runtime: TypeScript, Anthropic Claude Sonnet for reasoning, structured outputs against a Zod schema, strict JSON validation.
- Orchestration: Temporal for retries, idempotency, and human-in-the-loop waits.
- Validation: EU VIES for VAT IDs, the client’s vendor master as a read-only source of truth.
- Review UI: Astro + HTMX on Cloudflare, deployed in under a minute per change.
- Export: DATEV CSV, posted directly into the client’s existing ERP import queue.
- Observability: OpenTelemetry traces, token-spend dashboards, eval regressions wired into CI.
No lock-in. The client owns the repo, the infra, and the evaluation set on day 1.
The results
Four weeks after kickoff, measured over the following 90 days:
- Average handling time dropped from 14 minutes to 3 minutes per invoice.
- 99.2% field-level extraction accuracy on the gold set — and rising as corrections feed back.
- 180 operator hours freed per month — redeployed into vendor-relationship work the team had been postponing for a year.
- Zero incidents in the first 90 days. The agent refused to guess on 2.1% of invoices, which humans handled in the review UI without friction.
- EU AI Act ready — classified as limited risk, documented accordingly, with a clean audit trail for every automated decision.
What’s next
The same agent pattern is now being extended to expense reports and customer onboarding KYC — reusing 80% of the infrastructure we shipped in month one. That’s the compounding return of craft-first engineering: the second agent costs a fraction of the first.
If you’re staring at a repetitive back-office workflow and wondering whether an agent is worth the risk — talk to us. We’ll tell you honestly if it’s the wrong fit, and if it’s the right fit we’ll be in production before most vendors have scheduled a second meeting.