Testing¶
BlindProof uses red/green TDD throughout. Tests are the primary tool, not a safety net bolted on at the end.
Why TDD matters more than usual here¶
Cryptographic and timestamping paths have silent-failure modes — wrong key derivation, wrong HMAC inputs, wrong Merkle ordering, wrong endianness on a digest. Code that is wrong in these ways runs fine and produces convincing-looking output that is simply not what it claims to be. Inspection does not catch these bugs. Tests do.
Concrete examples from the project's history:
- The
verify.pyscript was passing the target file toots verifyinstead of the.otssidecar. It ran without errors for months, against fake receipts that short-circuit before reaching the real CLI. Only a first-ever real-receipt test caughtError! 'root.bin' is not a timestamp file. - The Merkle tree's odd-level duplication rule differs by one hash iteration from other implementations. Getting the rule wrong produces a plausible-looking root that never matches the anchor.
If a test is hard to write for a given change, that's usually a signal that the change has a subtle shape — not that the test is being difficult.
Workflow¶
For every unit of behaviour:
- Red — write the failing test first. Run it. Confirm it fails for the right reason — an
AssertionError, not anImportErroror a typo. - Green — write the minimum implementation that makes the test pass.
- Refactor — tidy once green, with the tests as a safety net.
Do not write implementation code without a failing test already in place. Don't batch several behaviours into one commit. Small, focused cycles — one behaviour, one test, one pass. This discipline is load-bearing for this codebase; the cost of skipping it is the kind of silent-failure bug described above.
Test commands¶
Client¶
uv run --with pytest --with watchdog --with cryptography --with argon2-cffi \
pytest client/tests/ -v
The client's runtime deps are declared in client/blindproof.py's PEP 723 header, which uv run uses when the script itself is invoked. Running pytest against the tests is a different entry point — those deps have to be re-stated via --with. When the runtime dep list grows further, add matching --with flags. If that becomes unwieldy, promote to client/pyproject.toml with dev-dep management.
Backend¶
The backend has its own backend/pyproject.toml because Django needs a proper project layout and the dev-dep story (pytest-django, etc.) is cleaner through uv's dependency groups. Runtime deps live in [project.dependencies], dev deps in [dependency-groups].dev. pytest discovers DJANGO_SETTINGS_MODULE from [tool.pytest.ini_options] in that same file.
Desktop¶
PyObjC-dependent tests only run on macOS; CI skips them for now because PyObjC won't install on Linux runners. Wiring these into CI (gated on macos-latest) is a documented follow-up.
CI¶
.github/workflows/tests.yml runs the client and backend suites on every push to main and on pull requests. Failing CI blocks merge.
Desktop tests are intentionally not yet in CI — the macos-latest runner + Homebrew framework-Python setup adds a few minutes per run, and the desktop suite is small.
What the tests don't do¶
- No real OTS submissions in the test suite.
FakeOTSSubmitterstands in forOpenTimestampsSubmitter; a duck-typedFakeCalendarreturns library-shapedTimestamp+PendingAttestationobjects so production code is exercised against the same object model it sees in production, without touching the public OTS network. - No real Bitcoin anchoring. Same reason — no network, no waiting, no flake.
- No live backend round-trips from client tests.
BackendClienttakes an injectabletransportparameter; tests use a fake that returns canned responses. The real client runs against a real backend in the smoke-test step of Running locally.
The end-to-end exercise happens in production, on the daily ots-daily GitHub Actions schedule, and is verified against the shipped verify.py output on a freshly produced bundle. That's the closest thing to an integration test in the system — and it runs daily.
Test counts (snapshot)¶
At the time of writing, roughly: 79 client tests + 100 backend tests + 17 desktop tests = 196 tests passing. Numbers drift as features land; check status.md for the current figure.
Guidelines¶
- Keep tests fast. No sleeps, no real network, no real file locks. Fast tests are ones you'll actually re-run.
- Keep tests hermetic. Each test should set up and tear down its own state. A test that depends on a previous test's side effects is a flake waiting to happen.
- Prefer real objects to mocks where you can.
SnapshotStoreagainst an in-memory SQLite is simpler to read than a mockedSnapshotStore; duck-typed fakes for the OTS calendar clarify the contract better thanMagicMock. - One behaviour per test. Name tests as sentences —
test_capture_path_writes_ciphertext_under_uuid_filename, nottest_capture_1. - Fail loudly. An assertion that silently passes when the code is wrong is worse than no test at all.