Skip to content

Client · DACH FinTech scale-up

Cut invoice processing time by 78% with a human-in-the-loop AI agent

A German FinTech was drowning in 4,000 supplier invoices a month. In four weeks we shipped an LLM-powered triage agent with extraction, validation, and a clean human review UI — cutting handling time from 14 minutes to 3 per document and freeing 180 operator hours every month.

78%
Faster per invoice
180 h
Operator hours saved / month
99.2%
Extraction accuracy
4 weeks
From kickoff to production
Services AI automation & agents LLM engineering Workflow integration
INBOX Classify 01 Extract 02 Validate 03 Match 04 ERP

Four deterministic steps · one human-in-the-loop · zero lock-in

The challenge

The client’s accounts-payable team was processing around 4,000 supplier invoices per month — PDFs, scans, and the occasional emailed photo. Each one took an operator 12–16 minutes: download, re-key, match to a PO, route for approval. Errors slipped through. New hires took weeks to ramp.

They had tried an OCR vendor the year before. The extraction was 82% accurate on a good day, which meant a human still had to re-verify every single field. The tool had become another tab to keep open rather than a workflow that actually shipped results.

The brief to us was blunt: “Ship something that actually saves us hours, not another dashboard.”

Our roadmap

  1. 01
    Phase 01 Week 1

    Discovery & data

    Shadowed 3 operators, labelled a gold-set of 500 real invoices, benchmarked two LLMs against the status quo.

  2. 02
    Phase 02 Week 2

    Agent & extraction

    Built the multi-step agent: classify, extract, validate, match to PO. Structured outputs, JSON schema, deterministic retries.

  3. 03
    Phase 03 Week 3

    Human-in-the-loop UI

    Shipped a single-pane review UI with inline corrections. One keystroke approves; edits feed the evaluation set automatically.

  4. 04
    Phase 04 Week 4

    Harden & launch

    Cost caps, PII guardrails, DATEV export, on-call runbook. Rolled out to the full ops team on day 28.

How we shipped it

Week 1 — Discovery that pays for itself

Before writing a single prompt we sat with three AP operators for a day each. The real bottleneck wasn’t extraction — it was the six little decisions a human made per invoice (is this a duplicate? is the VAT ID in our vendor master? does the PO still have budget?). We captured those decisions as the spec for the agent, not the other way round.

We labelled 500 representative invoices as a gold set. That gold set stayed the source of truth for every benchmark that followed.

Week 2 — The agent, built boringly

Four deterministic steps, not one giant prompt:

  1. Classify — is this an invoice, a credit note, a statement, or something else?
  2. Extract — structured JSON against a Zod schema (header, line items, totals, VAT).
  3. Validate — sums add up, VAT IDs check out against the EU VIES registry, currency matches the vendor master.
  4. Match — propose a purchase order and a cost centre, with a confidence score.

Each step is independently evaluated, independently retryable, and independently cacheable. That’s the boring part that makes it cheap to run and easy to debug at 2 a.m.

Week 3 — The review UI nobody dreaded using

The operator sees the PDF on the left, the extracted fields on the right, confidence-coded. Green fields need a glance. Orange fields need a click. Red fields are pre-focused for editing. The whole thing is keyboard-driven — Tab, Enter, done.

Every correction the operator makes writes back into a labelled evaluation set. The agent’s accuracy keeps improving without anyone running a training job.

Week 4 — Hardening before the handover

  • Cost caps per invoice and per day, hard-enforced.
  • PII redaction on anything stored in logs.
  • DATEV export in the format the client’s accountants already used.
  • An on-call runbook, a Grafana board, and a Slack alert for anything the agent flagged as ambiguous twice in a row.

We went live on a Thursday. The ops lead DM’d us on Friday evening: “First week in two years I’m leaving on time.”

The architecture

A readable stack is a maintainable stack. Nothing exotic.

  • Ingestion: Microsoft 365 connector + a watched inbox → S3-compatible bucket.
  • Agent runtime: TypeScript, Anthropic Claude Sonnet for reasoning, structured outputs against a Zod schema, strict JSON validation.
  • Orchestration: Temporal for retries, idempotency, and human-in-the-loop waits.
  • Validation: EU VIES for VAT IDs, the client’s vendor master as a read-only source of truth.
  • Review UI: Astro + HTMX on Cloudflare, deployed in under a minute per change.
  • Export: DATEV CSV, posted directly into the client’s existing ERP import queue.
  • Observability: OpenTelemetry traces, token-spend dashboards, eval regressions wired into CI.

No lock-in. The client owns the repo, the infra, and the evaluation set on day 1.

The results

Four weeks after kickoff, measured over the following 90 days:

  • Average handling time dropped from 14 minutes to 3 minutes per invoice.
  • 99.2% field-level extraction accuracy on the gold set — and rising as corrections feed back.
  • 180 operator hours freed per month — redeployed into vendor-relationship work the team had been postponing for a year.
  • Zero incidents in the first 90 days. The agent refused to guess on 2.1% of invoices, which humans handled in the review UI without friction.
  • EU AI Act ready — classified as limited risk, documented accordingly, with a clean audit trail for every automated decision.

What’s next

The same agent pattern is now being extended to expense reports and customer onboarding KYC — reusing 80% of the infrastructure we shipped in month one. That’s the compounding return of craft-first engineering: the second agent costs a fraction of the first.

If you’re staring at a repetitive back-office workflow and wondering whether an agent is worth the risk — talk to us. We’ll tell you honestly if it’s the wrong fit, and if it’s the right fit we’ll be in production before most vendors have scheduled a second meeting.

Last updated ·

Your move

Let's make your software feel inevitable.

Tell us what you need. We reply within one working day — with a real opinion, not a sales pitch.