← portfolio

Conversational Analytics

Asking questions of your files: a natural language interface for business data

Non-technical teams have more data than ever and less ability to query it. This is a conversational AI system that lets business users ask questions of spreadsheets, PDFs, and scanned documents in plain English — and get sourced answers.

4 min read RAG · NLP · Data Analysis · Multi-format

There’s a gap in most organisations between the people who have business questions and the people who can answer them from data.

A finance manager wants to know: “what’s our total spend on vendor X across all invoices this quarter?” The data is in an Excel file — maybe two. Getting the answer requires writing a formula, asking an analyst, or spending twenty minutes doing it manually.

A customer success lead wants to cross-reference their CRM export against the service terms in a set of PDF contracts. Not a hard question conceptually. Practically, it requires someone who can work across file formats programmatically.

A compliance officer needs to check whether a specific clause appears in any of the 200 contracts uploaded last month. The documents are there. The keyword search doesn’t understand what they’re looking for.

This is a solvable problem. I built a system to solve it.

What the system does

The Data Q&A app is a conversational interface for business files. Upload your data. Ask questions in plain English. Get answers with sources — the matched rows, the PDF excerpts, the page numbers, exactly where the information came from.

It handles:

  • Spreadsheets (CSV, Excel): filtering, aggregation, cross-sheet calculations, deduplication
  • PDFs: extracting clauses, cross-referencing text against structured data, both digital and scanned
  • Images and scanned documents: reading tables, labels, and handwritten content via vision AI

The input is a natural language question. The output is an answer with citations — not a guess, but a traceable response showing its work.

How it works

Under the hood this is a RAG (retrieval-augmented generation) pipeline, but with a design feature that separates it from most implementations: it plans before it retrieves.

Query planning. When a question comes in, the system doesn’t immediately search for data. It first builds a plan: which files contain relevant information? What filters or calculations are needed? Does this question require joining data from multiple sources? The plan is generated by an LLM with access to a catalogue of all loaded files — their column names, sample values, data types, and structure — so the plan is grounded in what’s actually available.

Data retrieval. The plan is executed. For spreadsheets, this means running the appropriate filter or aggregation on the full dataset, not a sample. For PDFs, it means finding the sections that match the query with a keyword-weighted relevance score. For scanned documents and images, it means extracting the text via a vision model first, then retrieving from that.

Answer generation with tools. Retrieved data is passed to an answer-generating LLM that has access to two tools: a sandboxed Python environment for arithmetic and data operations, and the ability to search PDFs for additional facts mid-answer. This enables multi-step reasoning: “calculate the total from the spreadsheet, then check whether the policy PDF sets a limit on that number.”

The answer includes sources. Every claim traces back to a specific file, row, or page. You can verify exactly where the information came from.

Where this changes how teams work

The obvious application is giving non-technical team members self-service access to their own data. But the more valuable application, in practice, is eliminating the category of questions that currently require an analyst but don’t justify one.

A few concrete examples:

Finance and accounting: Cross-reference expense data against budget line items. Check whether invoices match vendor contract terms. Answer “what’s our total exposure to currency X across all open positions?” without building a spreadsheet from scratch.

Operations and logistics: Query maintenance records, inventory data, or scheduling files. Ask “which sites had more than two incidents in Q3?” and get an answer with the source rows — not a task to hand to someone else.

Sales and revenue operations: Interrogate CRM exports, pipeline data, and commission calculations. The data is almost always in Excel. The questions are almost always ones that don’t require a data scientist — just the ability to ask them without writing a formula.

Legal and compliance: Search contract databases for specific clauses or party names. Cross-reference structured records against policy documents. Flag gaps between what the data shows and what the policy requires.

Research and analysis: Ask questions that span multiple datasets — a survey CSV, a reference PDF, an Excel lookup table — in a single query, with the system handling the join logic automatically.

A note on where it fits

A conversational data interface is not a replacement for a proper analytics platform at large scale. If you’re running hundreds of millions of rows through complex transformations daily, you need a different layer.

What it is right for: the enormous volume of mid-complexity analysis that currently sits between “something an analyst could answer quickly” and “something worth building a dashboard for.” Most organisations have far more of this than they realise — and it consumes disproportionate time because every question requires someone technical to answer it.

The pattern is also extensible. A version tuned for a specific domain — contract review, financial reconciliation, regulatory compliance, inventory management — with the right reference documents and validation logic built in can handle very specific, high-value questions with consistently high accuracy.

The underlying question

The point isn’t the technology. It’s what the technology makes possible: your team spending time on decisions instead of on data retrieval.

The data question for most organisations isn’t “do we have the data?” They usually do. It’s “can the people who need to understand it get to it without a bottleneck?”

If the answer to that is no, the bottleneck is a pipeline problem. And pipeline problems have precise solutions.


Want something like this built for your business?

Every system on this page was built to solve a specific operational problem. If you have a similar one, let's talk about what a solution looks like for you.

Get in touch