Document Search
How I made 2,000 documents searchable in under an hour
Most document archives are effectively unsearchable — a mix of scanned PDFs, images, spreadsheets, and text files with no text layer. Searchzilla is a search engine built for that reality, using AI vision and modern retrieval to find anything, across every format.
Every organisation I’ve worked with has a version of the same problem, even if they wouldn’t describe it this way: somewhere in the system, there’s a document that would answer the question — but no one can find it.
It might be a scanned invoice in a shared drive. A customer record saved as a photo. A compliance document exported from a legacy system with no readable text layer. The information exists. The search bar just doesn’t know it’s there.
I built Searchzilla to solve this.
Why document archives are effectively unsearchable
Standard search tools — SharePoint, Google Drive, Windows Search — work well when files are consistent, text-based, and modern. Real-world archives are rarely any of those things.
In a compliance or audit workflow, the folder you need to search might contain:
- Scanned paper forms saved as JPEGs
- PDF contracts with no extractable text (they were printed, signed, and re-scanned)
- Excel sheets with data hidden in merged cells or buried across multiple tabs
- Word documents, plain text exports, and photographs of handwritten notes — all in the same directory
When someone asks “show me everything related to account 12345,” a standard search returns a partial answer at best. Anything that can’t be read as plain text is invisible.
In practice, this means people do manual searches. They open files one by one. Someone gets assigned the task of “going through the folder.” Compliance deadlines slip because a record existed but wasn’t findable. Audit prep takes three times as long as it should.
What Searchzilla does differently
Searchzilla is a document search engine built from scratch for mixed archives. Point it at a folder. It indexes everything. You search once and get results from every file type at once.
Three things make it work differently from standard tools:
It reads what can’t normally be read. When Searchzilla encounters a scanned PDF or a photograph, it doesn’t skip it. It passes the image to a vision AI model — the same class of model that reads text in photos — and extracts the content automatically. No manual OCR setup. No separate tooling. It happens at indexing time, for every file.
It searches by meaning, not just by word. Alongside traditional keyword search, Searchzilla runs a second pipeline using visual embeddings: numerical representations of each page’s meaning and appearance, based on a technique called ColPali. This means “customer account” finds a document titled “client record” — because the meanings are close, even when the exact phrase isn’t.
It merges results across all formats in one query. Upload a folder with PDFs, Word documents, Excel spreadsheets, CSVs, and images — and search across all of them simultaneously. Results are ranked by relevance, merged into a single list, and shown with a preview of what matched and where.
How it was built
The architecture has three layers:
The extraction layer handles each file type appropriately. Text files are read directly. PDFs are extracted page by page; pages with too little text are assumed to be scanned and routed to a vision API for OCR. Images follow the same path. Excel and Word files are parsed for all visible content, including tables and headings. This runs once when you point Searchzilla at a folder.
The index layer stores extracted content in two places. All text goes into a full-text search index (SQLite FTS5), which handles fast keyword lookups and supports structured field queries like invoice-id = 9473. Each page also gets a set of visual embeddings — compact numerical fingerprints that capture what the page is about — stored for semantic retrieval.
The search layer runs both pipelines when a query comes in. FTS5 returns exact and near-exact matches; visual search returns semantically related pages. Both are scored, normalised, merged, and deduplicated. Each result is labelled by which method found it — text match, visual match, or both.
The whole system runs on a standard laptop. No GPU required. No cloud dependency. The data stays where you put it.
Where this has direct business value
I tested Searchzilla on a corpus of 2,000 mixed files and indexed them in under an hour on commodity hardware. The business contexts with the clearest return:
Compliance and audit: Search an entire document archive — including scanned copies — for a customer ID, a date range, or a contract clause. Find records that a standard search would miss entirely.
Insurance and claims management: Claims files routinely mix digital PDFs with photographs of damage and handwritten assessment forms. A single query surfaces everything relevant, regardless of how it was saved.
Legal and contract work: Search hundreds of contracts for a party name, a liability clause, or a specific provision. The semantic layer catches documents where the wording differs from what you typed.
HR and records management: Employee files in large organisations are rarely stored in a consistent format. Searchzilla indexes the inconsistency and makes it queryable.
Field operations and logistics: Maintenance logs, inspection forms, and equipment records from field teams arrive in whatever format was convenient. All of it can be indexed and searched from one place.
The point
The insight behind Searchzilla isn’t complicated: documents contain more value than most organisations can access, because the search tools available were built for a tidier world than the one we actually operate in.
A search system that works across everything — not just well-structured text files — changes the economics of records management in operations-heavy organisations. The answer was there the whole time. It just couldn’t be found.
If your team spends real hours hunting through document archives for records they know exist, that is a well-defined problem with a well-defined technical solution. The work is in building it correctly for your archive and workflows.
Want something like this built for your business?
Every system on this page was built to solve a specific operational problem. If you have a similar one, let's talk about what a solution looks like for you.
Get in touch