Docortex: An air-gap document engine
Doxwell is the product. Docortex is the engine inside it. Same code, two surfaces.
People who try Doxwell sometimes ask the same follow-up: “what’s under it?”
The short answer is Docortex. Doxwell is the product — installers, wizard, tray icon, the parts you actually click. Docortex is the engine — the five layers that read documents, build a fact graph, track what changed, and expose every operation as an agent-callable tool. Same code, two surfaces: a polished product for direct users, a deployable engine for partners who want to embed it.
This post walks the five layers. It is intentionally architectural, not a sales pitch. If you only care that it works, the download page is faster.
The split: product vs engine
The product needs to be opinionated. A first-time user shouldn’t have to choose between BM25 and hybrid retrieval before they can find their lease agreement. So Doxwell ships defaults: local embedding model by default, a setup wizard that asks for an LLM provider, a tray icon that surfaces queue counts.
The engine needs to be configurable. A partner integrating Docortex into their own workflow software cares about which embedding backend ships locally vs. in the cloud, which MCP tools are exposed vs. internal, how the fact graph is sharded by vault. Same code — different surface.
That split is also why “air-gap” is meaningful here and not marketing wallpaper. The product layer assumes a wizard; the engine layer assumes nothing. You can run Docortex with the local bge-m3 embedding backend, an on-laptop Ollama LLM, and no outbound network at all. Doxwell supports it; out-of-the-box it points at whichever LLM provider you pick in the wizard.
The five layers
Working bottom-up:
- Ingestion + preprocessing — pull files, extract text.
- Vector index — embeddings, hybrid search.
- Entity-fact graph — who, what, when, sourced from where.
- Temporal supersession — every fact has a lifetime.
- MCP surface — every operation is an agent-callable tool.
The next five sections take each in turn.
Layer 1: ingestion + preprocessing
The boring layer. Where most “real-world archive” pipelines fail.
Doxwell pulls from email mailboxes, Google Drive, OpenCloud, scanner folders, messenger exports, and call-recording directories. Each source has its own quirks — Outlook email with embedded winmail.dat, scanned PDFs that need OCR fallback, image attachments inside email replies, ZIP-of-ZIPs from quarterly bank statements.
The pipeline is an 11-step DAG with dependency tracking, staleness cascade, lazy evaluation by input/output hash, and parallel workers.[§6.5] When you add a new source, the DAG figures out which downstream steps need to re-run and which can reuse cached output. This matters at 20K+ documents — re-running OCR on a corpus that already had it is the kind of mistake that costs hours.
Pipeline coverage on the live 21K-doc corpus (2026-05-19): text extraction 97.8%, fragment extraction 97.8%, page embedding 97.8%, OCR assessment 96.1%, embedding 93.1%, thread detection 91.3%, entity extraction 91.2%, document understanding 70.6%, entity resolution 61.5%, fact extraction 22.8%.[§3.1] The lower steps drop because they require LLM calls — they don’t drop because of bugs. Each step exposes its coverage as a metric so the wizard can tell you “this folder is 96% indexed, the remaining 4% are PDFs with no text layer; run OCR?”
This is the layer where pipeline branching by content class (formal email vs. messenger thread vs. invoice vs. handwritten note) actually pays off. A messenger export does not need OCR; an invoice does not need thread detection. Branching cuts both compute and noise.
Layer 2: vector index
The retrieval-baseline layer. This is what most people mean when they say “RAG.”
Docortex ships hybrid retrieval — BM25 + semantic + a knowledge channel that incorporates the entity-fact graph (see Layer 3). On our 21K-doc nightly harness with 210 multilingual questions, the hybrid strategy reaches NDCG@10 = 0.332 against pure BM25 at 0.244 and pure semantic at 0.242 — about a 36% relative lift over the best single-channel baseline.[§1.1] Third-party replication is in progress; the harness, golden set, and methodology are internal but reproducible from the nightly report.
Latency on the same corpus stays under 250ms p95 for the hot-path strategies (218ms for understanding-boost, 216ms for RRF).[§1.3] Multilingual penalty is essentially zero — the English-vs-native-language NDCG@10 delta is -0.009 for hybrid_full.[§1.4]
Default text-embedding backend is BAAI/bge-m3 — local, 1024-dimensional, 8192-token context, multilingual.[§6.1] OpenAI’s text-embedding-3-small/large are available as cloud options but are not the default. This matters for the air-gap claim: vectors stay on disk; nothing is shipped to a third-party embedding API unless you opt in.
We also surveyed the alternatives. LangChain, LlamaIndex, Haystack, RAGFlow — useful frameworks, none of which would have improved our NDCG on this corpus. They add abstractions, not retrieval quality.[§4.5] That informed the decision to build retrieval as a layer in Docortex rather than wrap an existing framework.
Layer 3: entity-fact graph
The layer where Docortex stops being a retrieval system and becomes a document intelligence system.
When the entity-extraction step runs, it produces named entities (people, companies, addresses, products, dates). The entity-resolution step then deduplicates them — “Hans Müller”, “H. Müller”, “hmueller@…”, and “Geschäftsführer Müller” collapse into one entity, with the evidence trail kept.
The collapse is not a single fuzzy-string heuristic. It runs as a cost-ordered cascade: a deterministic regex pre-pass extracts cheap structured signals (emails, phone numbers, tax IDs, dates) before any LLM call;[§2.1] embedding similarity scores candidate pairs; a graph walk pulls in transitive evidence from co-occurrences across the corpus. Each merge decision is recorded as a first-class event in the ledger.
At production scale: 38,066 entities resolved, 102,417 facts across 8,932 distinct predicates spanning 13,154 documents.[§3.1] Facts are stored as RDF-like triples — subject, predicate, object. The same triple asserted from two different documents counts as corroboration, not duplication.[§2.3]
The graph also powers cross-document queries that pure vector retrieval can’t span. Two documents about a 2021 framework agreement and a 2024 amendment can be linked by entity and predicate even when their text similarity is low. Vectors find the surface; the graph walks the structure.
Layer 4: temporal supersession
Every fact in the graph carries a valid-from and a valid-to. When a 2024 amendment overrides a clause in a 2021 framework, the engine doesn’t overwrite the old fact — it records a supersession event, with both facts kept and the lineage explicit.
This is implemented by RIGOR, the append-only event ledger that sits underneath the fact store. RIGOR — Replayable Inspectable Governed Orchestration Records[§5.1] — defines 22 structural invariants (R0–R21) enforced both when events are appended and when a ledger is loaded back from disk via rigor verify.[§5.3]
Three invariants matter for supersession in particular:
- R12 — verdict-value coupling: a ruling that overrides a previous fact must carry the replacement value; a ruling that rejects must not.
- R13 — supersession ordering: superseded events stay in the ledger, they don’t disappear.
- R15 — sticky ground truth: once an event is marked as ground-truth-certainty, later rulings can comment on it but can’t silently overwrite it.
Replay works from disk alone. Given a run_id, rigor verify re-derives every deterministic output without making a live LLM call or hitting the network[§5.4] — the ledger plus pinned model and prompt versions are sufficient. This is what turns “we have an audit log” into “we have an audit log that an auditor can actually run.”
Doxwell exposes the substrate; the user-facing surfaces that read it (history views, version chains, change explorers) are added on top. The data model is the load-bearing piece.
Layer 5: MCP surface
Doxwell is MCP-native — Model Context Protocol — from the beginning. It is not a chat interface with an MCP plugin bolted on later.
The engine exposes over 600 MCP tools[§6 catalogue], organized by scenario: ingestion, retrieval, entity-graph operations, fact queries, ledger replay, scope management. Any MCP-compatible client — Claude Desktop, Claude Code, Codex, n8n, or something you wrote — connects as a first-class consumer. No new SDK to learn, no proprietary API to glue against.
The reason for going MCP-native rather than chat-native is simple: agents read documents differently than humans do. A human asks one question at a time; an agent issues a broad search, then five targeted follow-ups, then a fact-graph traversal, then a ledger query for the supersession chain. Building chat first and exposing tools later forces the agent into the human-shaped flow. Building tools first and letting humans drive them through a CLI or a chat surface keeps the agent path uncramped.
Pinecone made a similar argument in their April 2025 post on knowledge infrastructure for agents — that the missing layer is not retrieval primitives but the structured, agent-callable abstraction above them. We agree with the direction. The bet we made early was that the structured layer needed to ship as MCP tools, not as a proprietary API.
Why air-gap is the design center
Most “local-first” AI products are local-friendly. They run locally if you go out of your way to configure them, but the defaults push you toward a cloud account.
Docortex flips the order. The default text-embedding backend is local. The default storage is local. The release artifacts are SHA-256 hashed and signed with minisign,[§6.4] a CycloneDX SBOM ships with every release,[§6.4] and the verify-before-run instructions are on the verify page in three platform variants.
Why air-gap-first? Because the customers who care most about document intelligence — German Kanzleien, M&A boutiques, family offices — operate under privacy regimes that don’t allow client data to leave the building. We could have built for the cloud-first customer and added “on-prem” as an afterthought; instead we built for the air-gap customer and added cloud-LLM as a routing option.
The result is also legible to less-regulated customers. You can switch to a cloud LLM provider by flipping a setting — but if you keep the local Ollama backend, nothing about your documents ever leaves your laptop. The honest framing: Doxwell is designed to run fully offline; the system supports it; you choose your LLM provider, including local Ollama.
How partners integrate
Three shapes, mirroring the engine page:
- MCP server — point Claude Desktop, Claude Code, or any MCP client at the Docortex server. One config block, no new client to build. About 30 minutes to first answer.
- Embedded library — Python or TypeScript bindings for partners building their own product surface. The fact graph, ledger, and retrieval APIs are callable from your code.
- Co-branded surface — full white-label of the Doxwell desktop product with a partner’s branding and module configuration.
Licensing is module-gated: a 3-tier classification (FREE / GATED / ENCRYPTED), Ed25519 license tokens, and hardware fingerprint binding.[§6.6] Partners enable the modules they need; the rest stay encrypted on disk.
What to do next
If you want to try the engine — the download page ships the Linux AppImage today, with signed macOS and Windows packages landing inside the 1.0 line.
If you want to embed it — the engine page describes the three integration shapes and the design-partner process.
If you want to see the numbers before committing — every claim in this post traces to a row in our internal evidence catalogue; the most-cited benchmark numbers come from the benchmarks/nightly/REPORT-2026-05-19.md harness on our 21,702-document corpus. See /methodology for today’s nightly figures and how we publish.
Notes: superscript references like [§1.1] map to sections of our internal evidence catalogue (private). Numeric claims are reproducible from the cited benchmark report on our nightly harness; third-party replication on a shared corpus is in progress, not done.