← portfolio

Data Pipeline

4,900 records, zero manual entry: AI-powered data extraction from field operations

Field operations generate enormous data in formats no one planned for — spreadsheets with merged cells, multi-tab Excel files, scanned forms, embedded images. Here's how an AI pipeline extracted and validated thousands of records automatically.

5 min read Data Pipeline · LLM Extraction · Validation · Computer Vision

Here is a problem that every large field operation has, and almost none discuss openly: the data exists, but it’s trapped.

Thousands of records — maintenance visits, inspection reports, breakdown logs, repair invoices — generated by field staff and service providers across different regions, in whatever format was convenient at the time. A spreadsheet here. A scanned form there. Photos of handwritten notes. Excel files with merged cells, multiple tabs, images embedded in cells, and column headers that mean slightly different things across each provider’s template.

The data is there. Aggregating it is a different matter entirely.

The specific problem

In operations-intensive sectors — infrastructure maintenance, water networks, logistics, humanitarian programmes — data flows in from multiple providers and field teams in completely inconsistent formats. Provider A sends a clean CSV. Provider B sends a multi-tab Excel file with images of receipts embedded in cells. Provider C sends scanned PDF reports from the field. Provider D sends a spreadsheet where the asset identifier is hidden inside a compound string buried in a column called “Location Notes.”

To answer a basic question — “what percentage of our assets had at least one maintenance visit this quarter?” — an analyst would normally need days to clean, normalise, and reconcile all of these into a common structure.

I built an AI pipeline to do it instead.

The test case: a network of operational assets across multiple regions, receiving maintenance records from several service providers. The goal was to extract structured records from raw spreadsheets and validate them against a master registry. The result: 4,904 records extracted from four data sources, with 825 confirmed against the master register — work that would have taken days of manual effort done in a single pipeline run.

The four-stage pipeline

Stage 1: Deep parse. Before any AI touches the files, the system does a structural analysis of each spreadsheet: how many tables are there per sheet? Are there embedded images? Charts with data encoded visually? Merged cells that obscure meaningful information? This step builds a precise map of what each file actually contains — not what its filename suggests.

Stage 2: AI understanding. Each sheet’s structural map is sent to a vision language model, along with the master registry of valid asset IDs. The model’s job at this stage is not to extract data — it’s to produce a game plan: “this sheet contains maintenance records; the asset ID is in column C but formatted inconsistently; dates use a non-standard format; here’s how to map each column to the target schema.” This game plan is stored and used in the next stage.

Stage 3: Batch extraction. Using the game plan from Stage 2, the system sends the actual data — in batches of 20 rows — back to the model for structured extraction. Output is clean JSON records: visit date, breakdown duration, inferred asset ID, repair type, service provider. Crucially, each record includes an _explanation field documenting exactly which source columns were used and why. Every extraction decision is auditable.

Stage 4: Validation and cleaning. Extracted records are validated against the master registry. Asset IDs are checked. Dates are normalised — Excel has a date arithmetic bug from 1900 that still corrupts files in active use today; the pipeline corrects for it automatically. Records that fail validation are flagged separately rather than silently dropped.

Why splitting “understand” and “extract” matters

The design choice that made this work was separating the understanding phase from the extraction phase. Most extraction pipelines try to do both at once — send the raw data, get structured output back. The problem is that a model asked to simultaneously understand a messy spreadsheet and produce clean records from it does both worse.

By splitting the pipeline — first produce a documented game plan, then execute it — the system handles genuinely unusual formats reliably. When an asset ID is embedded inside a compound string like Region_AEP_00284_visit_2025, the understanding stage figures this out and documents the extraction logic explicitly. The extraction stage just follows the plan.

This also means the pipeline is improvable. If the understanding phase produces an incorrect game plan, you can correct it before extraction runs. The reasoning is inspectable rather than buried inside a single opaque call.

The pattern generalises

The field operations example is specific. The pattern is not. Any organisation that:

  • Receives data from multiple providers or field teams in inconsistent formats
  • Needs to reconcile incoming records against a master register or reference dataset
  • Spends significant analyst time on “data cleaning” before any actual analysis can happen
  • Has to report aggregated metrics to leadership, donors, or regulators from fragmented sources

…has the same problem. The specific formats change. The entities tracked change. The validators change. But the pipeline — deep parse, AI understanding, batch extraction, validation — applies to logistics tracking, maintenance networks, supply chain reporting, compliance submissions, field survey aggregation, and dozens of similar contexts.

What this changes in practice

The economic argument for automating extraction like this is straightforward: manual data entry at scale is slow, expensive, and error-prone in ways that compound. A data entry error in record 3,000 of 4,900 won’t be caught until someone notices an anomaly in the analysis — if they notice it at all.

An AI extraction pipeline fails differently. When it can’t confidently extract a value, it flags it explicitly rather than guessing silently. The output includes its reasoning. Validation against a master registry catches systemic mismatches early. The errors are visible, not hidden.

The harder argument is organisational: data that arrives in the wrong format doesn’t have to stay that way. The work of normalising it is automatable — and when it’s automated correctly, it becomes a solved problem rather than a recurring tax on analyst time.

If your operations generate data in formats you can’t easily consolidate, that’s not a data quality problem. It’s a pipeline problem. And pipeline problems have precise solutions.


Want something like this built for your business?

Every system on this page was built to solve a specific operational problem. If you have a similar one, let's talk about what a solution looks like for you.

Get in touch