Methodology

How we measure

We run a nightly harness against our internal 21,702-document corpus. When the numbers move, this page moves.

Last measured: May 19, 2026.

We publish what the nightly harness reports. When it improves, we update. When it regresses, we say so.

Today's numbers

Metric	Value	Note
NDCG@10 (hybrid retrieval)	0.332	+36% over BM25 (0.244), +37% over semantic (0.242) — on our 21,702-doc nightly
Latency p50	218 ms	understanding_boost stage
Documents indexed	21,702	nightly corpus
Entities resolved	38,066	from entity-resolution pipeline
Facts extracted	102,417	across 8,932 distinct predicates
RIGOR invariants	22 (R0–R21)	all green on most recent nightly
MCP tools	618	shipped surface

How we read these

The corpus is real production data — the Condor Immoprojekt design-partner deployment plus the author's personal archive, disguised. It is not a synthetic golden set; it is the same documents Doxwell runs on every day. The harness rebuilds the same set of 21,702 documents after every meaningful index change, runs the full retrieval matrix, and writes a dated report.

The numbers in the table above are what the harness emitted on May 19, 2026. When a new nightly produces meaningfully different numbers, the measuredAt date on this page moves with them.

Where we know there's room

We publish these because they are the design space we're actively working in — every nightly run measures whether we've closed the gap.

Entity coreference at long range (cross-document, >5 sources)
Multilingual delta on de-DE corpora vs en-US
Thread-splitting recall — current precision/recall trade-off documented

The tooling we built to ship this

Shipping a 21,702-document engine with one developer required tooling that didn't exist. We open-sourced the parts that turned out generally useful and kept domain-specific internals private. Both lower the cost of moving fast without losing track.

Open source

Permissively licensed, on github.com/faxik.

codebugs
faxik/codebugs ↗

AI-native finding tracker. SQLite + MCP server. Where we record every defect, refactor question, and architectural decision for triage.
duphunter
faxik/duphunter ↗

AST-driven near-duplicate detector for Python. Keeps the codebase from accumulating subtle copy-paste drift.
smart-approve
faxik/smart-approve ↗

Rule-first, AST-aware permission hook for Claude Code's Bash tool. Lets a human stay out of the loop for safe operations.
git-split
faxik/git-split ↗

Split files while preserving git-blame history. Used during refactors that lift shared modules out of monoliths.

Private tooling

Internal tools coordinate parallel worktrees, route MCP servers per project, and orchestrate multi-agent codebase work. Domain-specific; less generally useful; not (yet) open-sourced.

Tests

Over 27,000 tests across the engine repositories (27,995 collected on the most recent sweep). The nightly harness runs them all before publishing the numbers above. If they don't pass, this page doesn't update.

Replication and verification

A public BEIR / FinanceBench replication is on our roadmap.
The corpus is private (design-partner NDA).
The harness scripts live in our autosorter repo.
We can run the harness against a customer-supplied corpus under pilot agreement — see /pricing.

Versions

Doxwell 1.0 "Happy Birthday"

Released released April 25, 2026

First public release. Installers for Linux / macOS / Windows. Signed binaries with minisign.

Doxwell 1.1

Nightly testing

In nightly testing: improvements to entity coreference, multilingual delta, thread-splitting recall. No commit date.

Doxwell 1.2

Design phase

Design phase. Scope not yet committed.

We don't commit to deltas from upcoming versions. We commit to publishing the numbers when they land.