Invoice-triage agent: 180 hours/month back

<div class="not-prose relative my-12 overflow-hidden rounded-3xl border border-white/10 bg-ink-900/40 p-6 md:p-10 reveal">
<div aria-hidden="true" class="absolute inset-0 opacity-70" style="background: radial-gradient(closest-side at 50% 50%, rgba(232,149,74,0.22), transparent 70%);"></div>
<svg viewBox="0 0 880 220" class="relative h-auto w-full" role="img" aria-label="Invoice triage agent pipeline diagram">
<defs>
<linearGradient id="iv-line" x1="0" x2="1">
<stop offset="0%" stop-color="#E8954A" stop-opacity="0.15"/>
<stop offset="50%" stop-color="#E8954A" stop-opacity="0.8"/>
<stop offset="100%" stop-color="#6EE7E0" stop-opacity="0.7"/>
</linearGradient>
</defs>
<path d="M 90 110 L 790 110" stroke="url(#iv-line)" stroke-width="2" fill="none"/>
<path d="M 90 110 L 790 110" stroke="#E8954A" stroke-width="2" fill="none" class="dash-anim" opacity="0.8"/>
<g transform="translate(32,60)">
<rect x="4" y="6" width="52" height="68" rx="4" fill="rgba(255,255,255,0.03)" stroke="rgba(255,255,255,0.12)"/>
<rect x="0" y="0" width="52" height="68" rx="4" fill="rgba(255,255,255,0.05)" stroke="rgba(255,255,255,0.22)"/>
<line x1="10" y1="16" x2="42" y2="16" stroke="rgba(255,255,255,0.35)" stroke-width="1.5"/>
<line x1="10" y1="26" x2="42" y2="26" stroke="rgba(255,255,255,0.35)" stroke-width="1.5"/>
<line x1="10" y1="36" x2="30" y2="36" stroke="rgba(255,255,255,0.35)" stroke-width="1.5"/>
<text x="26" y="90" text-anchor="middle" fill="rgba(255,255,255,0.55)" font-size="10" font-family="ui-monospace, monospace" letter-spacing="0.18em">INBOX</text>
</g>
<g transform="translate(150,70)">
<rect width="120" height="80" rx="14" fill="rgba(232,149,74,0.06)" stroke="rgba(232,149,74,0.45)"/>
<circle cx="60" cy="30" r="12" fill="none" stroke="#E8954A" stroke-width="1.8" class="pulse-soft"/>
<circle cx="60" cy="30" r="3" fill="#E8954A"/>
<text x="60" y="60" text-anchor="middle" fill="#F5EFE6" font-size="12" font-family="inherit" font-weight="500">Classify</text>
<text x="60" y="100" text-anchor="middle" fill="rgba(232,149,74,0.7)" font-size="9" font-family="ui-monospace, monospace" letter-spacing="0.18em">01</text>
</g>
<g transform="translate(300,70)">
<rect width="120" height="80" rx="14" fill="rgba(232,149,74,0.09)" stroke="rgba(232,149,74,0.55)"/>
<rect x="20" y="24" width="36" height="4" rx="1.5" fill="#E8954A" opacity="0.85"/>
<rect x="62" y="24" width="38" height="4" rx="1.5" fill="#E8954A" opacity="0.5"/>
<rect x="20" y="34" width="56" height="4" rx="1.5" fill="#E8954A" opacity="0.7"/>
<rect x="82" y="34" width="18" height="4" rx="1.5" fill="#E8954A" opacity="0.4"/>
<text x="60" y="60" text-anchor="middle" fill="#F5EFE6" font-size="12" font-family="inherit" font-weight="500">Extract</text>
<text x="60" y="100" text-anchor="middle" fill="rgba(232,149,74,0.7)" font-size="9" font-family="ui-monospace, monospace" letter-spacing="0.18em">02</text>
</g>
<g transform="translate(450,70)">
<rect width="120" height="80" rx="14" fill="rgba(232,149,74,0.06)" stroke="rgba(232,149,74,0.45)"/>
<path d="M 48 34 l 8 8 l 18 -18" fill="none" stroke="#E8954A" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="pulse-soft"/>
<text x="60" y="60" text-anchor="middle" fill="#F5EFE6" font-size="12" font-family="inherit" font-weight="500">Validate</text>
<text x="60" y="100" text-anchor="middle" fill="rgba(232,149,74,0.7)" font-size="9" font-family="ui-monospace, monospace" letter-spacing="0.18em">03</text>
</g>
<g transform="translate(600,70)">
<rect width="120" height="80" rx="14" fill="rgba(232,149,74,0.12)" stroke="rgba(232,149,74,0.65)"/>
<circle cx="44" cy="34" r="6" fill="none" stroke="#E8954A" stroke-width="1.5"/>
<circle cx="76" cy="34" r="6" fill="none" stroke="#E8954A" stroke-width="1.5"/>
<line x1="50" y1="34" x2="70" y2="34" stroke="#E8954A" stroke-width="1.5"/>
<text x="60" y="60" text-anchor="middle" fill="#F5EFE6" font-size="12" font-family="inherit" font-weight="500">Match</text>
<text x="60" y="100" text-anchor="middle" fill="rgba(232,149,74,0.7)" font-size="9" font-family="ui-monospace, monospace" letter-spacing="0.18em">04</text>
</g>
<g transform="translate(770,60)">
<circle cx="36" cy="50" r="30" fill="rgba(110,231,224,0.1)" stroke="rgba(110,231,224,0.55)" stroke-width="1.5" class="pulse-soft"/>
<path d="M 22 50 l 10 10 l 18 -20" fill="none" stroke="#6EE7E0" stroke-width="2.5" stroke-linecap="round" stroke-linejoin="round"/>
<text x="36" y="102" text-anchor="middle" fill="rgba(110,231,224,0.75)" font-size="10" font-family="ui-monospace, monospace" letter-spacing="0.18em">ERP</text>
</g>
</svg>
<p class="relative mt-6 text-center text-xs uppercase tracking-eyebrow text-copper-400">Four deterministic steps · one human-in-the-loop · zero lock-in</p>
</div>

The challenge

The client's accounts-payable team was processing around 4,000 supplier invoices per month — PDFs, scans, and the occasional emailed photo. Each one took an operator 12–16 minutes: download, re-key, match to a PO, route for approval. Errors slipped through. New hires took weeks to ramp.

They had tried an OCR vendor the year before. The extraction was 82% accurate on a good day, which meant a human still had to re-verify every single field. The tool had become another tab to keep open rather than a workflow that actually shipped results.

The brief to us was blunt: "Ship something that actually saves us hours, not another dashboard."

Our roadmap

<div class="not-prose relative my-12">
<ol class="relative space-y-6 md:space-y-8">
<li class="reveal group relative pl-14 md:pl-20" style="transition-delay: 0ms;">
<div aria-hidden="true" class="absolute left-[1.1rem] top-12 bottom-[-1.5rem] w-px bg-gradient-to-b from-copper-500/30 to-copper-500/5 md:left-[1.6rem]"></div>
<div class="absolute left-0 top-0 flex h-10 w-10 items-center justify-center rounded-full border border-copper-500/40 bg-ink-900 font-display text-sm font-semibold text-copper-300 shadow-[0_0_0_4px_var(--color-ink-950)] md:h-[3.25rem] md:w-[3.25rem] md:text-base">01</div>
<div class="absolute left-[2.05rem] top-[1.1rem] md:left-[2.9rem] md:top-6">
<span class="relative flex h-2 w-2">
<span class="absolute inline-flex h-full w-full animate-ping rounded-full bg-copper-400 opacity-60"></span>
<span class="relative inline-flex h-2 w-2 rounded-full bg-copper-400"></span>
</span>
</div>
<div class="rounded-2xl border border-white/10 bg-gradient-to-br from-ink-800/40 via-ink-900/60 to-ink-950/80 p-5 transition-all duration-300 group-hover:border-copper-500/40 group-hover:shadow-[0_16px_48px_-16px_rgba(232,149,74,0.3)] md:p-6">
<div class="flex flex-wrap items-center gap-x-3 gap-y-1">
<span class="font-mono text-[10px] uppercase tracking-[0.22em] text-copper-400">Phase 01</span>
<span aria-hidden="true" class="h-px w-6 bg-copper-500/30"></span>
<span class="text-xs uppercase tracking-eyebrow text-ink-300">Week 1</span>
</div>
<h3 class="mt-2 font-display text-lg leading-snug text-porcelain-100 md:text-xl">Discovery & data</h3>
<p class="mt-3 text-sm leading-relaxed text-ink-300 md:text-base md:leading-relaxed">Shadowed 3 operators, labelled a gold-set of 500 real invoices, benchmarked two LLMs against the status quo.</p>
</div>
</li>
<li class="reveal group relative pl-14 md:pl-20" style="transition-delay: 100ms;">
<div aria-hidden="true" class="absolute left-[1.1rem] top-12 bottom-[-1.5rem] w-px bg-gradient-to-b from-copper-500/30 to-copper-500/5 md:left-[1.6rem]"></div>
<div class="absolute left-0 top-0 flex h-10 w-10 items-center justify-center rounded-full border border-copper-500/40 bg-ink-900 font-display text-sm font-semibold text-copper-300 shadow-[0_0_0_4px_var(--color-ink-950)] md:h-[3.25rem] md:w-[3.25rem] md:text-base">02</div>
<div class="absolute left-[2.05rem] top-[1.1rem] md:left-[2.9rem] md:top-6">
<span class="relative flex h-2 w-2">
<span class="absolute inline-flex h-full w-full animate-ping rounded-full bg-copper-400 opacity-60"></span>
<span class="relative inline-flex h-2 w-2 rounded-full bg-copper-400"></span>
</span>
</div>
<div class="rounded-2xl border border-white/10 bg-gradient-to-br from-ink-800/40 via-ink-900/60 to-ink-950/80 p-5 transition-all duration-300 group-hover:border-copper-500/40 group-hover:shadow-[0_16px_48px_-16px_rgba(232,149,74,0.3)] md:p-6">
<div class="flex flex-wrap items-center gap-x-3 gap-y-1">
<span class="font-mono text-[10px] uppercase tracking-[0.22em] text-copper-400">Phase 02</span>
<span aria-hidden="true" class="h-px w-6 bg-copper-500/30"></span>
<span class="text-xs uppercase tracking-eyebrow text-ink-300">Week 2</span>
</div>
<h3 class="mt-2 font-display text-lg leading-snug text-porcelain-100 md:text-xl">Agent & extraction</h3>
<p class="mt-3 text-sm leading-relaxed text-ink-300 md:text-base md:leading-relaxed">Built the multi-step agent: classify, extract, validate, match to PO. Structured outputs, JSON schema, deterministic retries.</p>
</div>
</li>
<li class="reveal group relative pl-14 md:pl-20" style="transition-delay: 200ms;">
<div aria-hidden="true" class="absolute left-[1.1rem] top-12 bottom-[-1.5rem] w-px bg-gradient-to-b from-copper-500/30 to-copper-500/5 md:left-[1.6rem]"></div>
<div class="absolute left-0 top-0 flex h-10 w-10 items-center justify-center rounded-full border border-copper-500/40 bg-ink-900 font-display text-sm font-semibold text-copper-300 shadow-[0_0_0_4px_var(--color-ink-950)] md:h-[3.25rem] md:w-[3.25rem] md:text-base">03</div>
<div class="absolute left-[2.05rem] top-[1.1rem] md:left-[2.9rem] md:top-6">
<span class="relative flex h-2 w-2">
<span class="absolute inline-flex h-full w-full animate-ping rounded-full bg-copper-400 opacity-60"></span>
<span class="relative inline-flex h-2 w-2 rounded-full bg-copper-400"></span>
</span>
</div>
<div class="rounded-2xl border border-white/10 bg-gradient-to-br from-ink-800/40 via-ink-900/60 to-ink-950/80 p-5 transition-all duration-300 group-hover:border-copper-500/40 group-hover:shadow-[0_16px_48px_-16px_rgba(232,149,74,0.3)] md:p-6">
<div class="flex flex-wrap items-center gap-x-3 gap-y-1">
<span class="font-mono text-[10px] uppercase tracking-[0.22em] text-copper-400">Phase 03</span>
<span aria-hidden="true" class="h-px w-6 bg-copper-500/30"></span>
<span class="text-xs uppercase tracking-eyebrow text-ink-300">Week 3</span>
</div>
<h3 class="mt-2 font-display text-lg leading-snug text-porcelain-100 md:text-xl">Human-in-the-loop UI</h3>
<p class="mt-3 text-sm leading-relaxed text-ink-300 md:text-base md:leading-relaxed">Shipped a single-pane review UI with inline corrections. One keystroke approves; edits feed the evaluation set automatically.</p>
</div>
</li>
<li class="reveal group relative pl-14 md:pl-20" style="transition-delay: 300ms;">
<div aria-hidden="true" class="absolute left-[1.1rem] top-12 bottom-[-1.5rem] w-px bg-gradient-to-b from-copper-500/30 to-copper-500/5 md:left-[1.6rem] hidden"></div>
<div class="absolute left-0 top-0 flex h-10 w-10 items-center justify-center rounded-full border border-copper-500/40 bg-ink-900 font-display text-sm font-semibold text-copper-300 shadow-[0_0_0_4px_var(--color-ink-950)] md:h-[3.25rem] md:w-[3.25rem] md:text-base">04</div>
<div class="absolute left-[2.05rem] top-[1.1rem] md:left-[2.9rem] md:top-6">
<span class="relative flex h-2 w-2">
<span class="absolute inline-flex h-full w-full animate-ping rounded-full bg-copper-400 opacity-60"></span>
<span class="relative inline-flex h-2 w-2 rounded-full bg-copper-400"></span>
</span>
</div>
<div class="rounded-2xl border border-white/10 bg-gradient-to-br from-ink-800/40 via-ink-900/60 to-ink-950/80 p-5 transition-all duration-300 group-hover:border-copper-500/40 group-hover:shadow-[0_16px_48px_-16px_rgba(232,149,74,0.3)] md:p-6">
<div class="flex flex-wrap items-center gap-x-3 gap-y-1">
<span class="font-mono text-[10px] uppercase tracking-[0.22em] text-copper-400">Phase 04</span>
<span aria-hidden="true" class="h-px w-6 bg-copper-500/30"></span>
<span class="text-xs uppercase tracking-eyebrow text-ink-300">Week 4</span>
</div>
<h3 class="mt-2 font-display text-lg leading-snug text-porcelain-100 md:text-xl">Harden & launch</h3>
<p class="mt-3 text-sm leading-relaxed text-ink-300 md:text-base md:leading-relaxed">Cost caps, PII guardrails, DATEV export, on-call runbook. Rolled out to the full ops team on day 28.</p>
</div>
</li>
</ol>
</div>

How we shipped it

Week 1 — Discovery that pays for itself

Before writing a single prompt we sat with three AP operators for a day each. The real bottleneck wasn't extraction — it was the six little decisions a human made per invoice (is this a duplicate? is the VAT ID in our vendor master? does the PO still have budget?). We captured those decisions as the spec for the agent, not the other way round.

We labelled 500 representative invoices as a gold set. That gold set stayed the source of truth for every benchmark that followed.

Week 2 — The agent, built boringly

Four deterministic steps, not one giant prompt:

Classify — is this an invoice, a credit note, a statement, or something else?
Extract — structured JSON against a Zod schema (header, line items, totals, VAT).
Validate — sums add up, VAT IDs check out against the EU VIES registry, currency matches the vendor master.
Match — propose a purchase order and a cost centre, with a confidence score.

Each step is independently evaluated, independently retryable, and independently cacheable. That's the boring part that makes it cheap to run and easy to debug at 2 a.m.

Week 3 — The review UI nobody dreaded using

The operator sees the PDF on the left, the extracted fields on the right, confidence-coded. Green fields need a glance. Orange fields need a click. Red fields are pre-focused for editing. The whole thing is keyboard-driven — Tab, Enter, done.

Every correction the operator makes writes back into a labelled evaluation set. The agent's accuracy keeps improving without anyone running a training job.

Week 4 — Hardening before the handover

Cost caps per invoice and per day, hard-enforced.
PII redaction on anything stored in logs.
DATEV export in the format the client's accountants already used.
An on-call runbook, a Grafana board, and a Slack alert for anything the agent flagged as ambiguous twice in a row.

We went live on a Thursday. The ops lead DM'd us on Friday evening: "First week in two years I'm leaving on time."

The architecture

A readable stack is a maintainable stack. Nothing exotic.

Ingestion: Microsoft 365 connector + a watched inbox → S3-compatible bucket.
Agent runtime: TypeScript, Anthropic Claude Sonnet for reasoning, structured outputs against a Zod schema, strict JSON validation.
Orchestration: Temporal for retries, idempotency, and human-in-the-loop waits.
Validation: EU VIES for VAT IDs, the client's vendor master as a read-only source of truth.
Review UI: Astro + HTMX on Cloudflare, deployed in under a minute per change.
Export: DATEV CSV, posted directly into the client's existing ERP import queue.
Observability: OpenTelemetry traces, token-spend dashboards, eval regressions wired into CI.

No lock-in. The client owns the repo, the infra, and the evaluation set on day 1.

The results

Four weeks after kickoff, measured over the following 90 days:

Average handling time dropped from 14 minutes to 3 minutes per invoice.
99.2% field-level extraction accuracy on the gold set — and rising as corrections feed back.
180 operator hours freed per month — redeployed into vendor-relationship work the team had been postponing for a year.
Zero incidents in the first 90 days. The agent refused to guess on 2.1% of invoices, which humans handled in the review UI without friction.
EU AI Act ready — classified as limited risk, documented accordingly, with a clean audit trail for every automated decision.

What's next

The same agent pattern is now being extended to expense reports and customer onboarding KYC — reusing 80% of the infrastructure we shipped in month one. That's the compounding return of craft-first engineering: the second agent costs a fraction of the first.

If you're staring at a repetitive back-office workflow and wondering whether an agent is worth the risk — talk to us. We'll tell you honestly if it's the wrong fit, and if it's the right fit we'll be in production before most vendors have scheduled a second meeting.

How we shipped an invoice-triage agent in four weeks — and gave 180 hours back to the ops team