Testing¶

BlindProof uses red/green TDD throughout. Tests are the primary tool, not a safety net bolted on at the end.

Why TDD matters more than usual here¶

Cryptographic and timestamping paths have silent-failure modes — wrong key derivation, wrong HMAC inputs, wrong Merkle ordering, wrong endianness on a digest. Code that is wrong in these ways runs fine and produces convincing-looking output that is simply not what it claims to be. Inspection does not catch these bugs. Tests do.

Concrete examples from the project's history:

The verify.py script was passing the target file to ots verify instead of the .ots sidecar. It ran without errors for months, against fake receipts that short-circuit before reaching the real CLI. Only a first-ever real-receipt test caught Error! 'root.bin' is not a timestamp file.
The Merkle tree's odd-level duplication rule differs by one hash iteration from other implementations. Getting the rule wrong produces a plausible-looking root that never matches the anchor.

If a test is hard to write for a given change, that's usually a signal that the change has a subtle shape — not that the test is being difficult.

Workflow¶

For every unit of behaviour:

Red — write the failing test first. Run it. Confirm it fails for the right reason — an AssertionError, not an ImportError or a typo.
Green — write the minimum implementation that makes the test pass.
Refactor — tidy once green, with the tests as a safety net.

Do not write implementation code without a failing test already in place. Don't batch several behaviours into one commit. Small, focused cycles — one behaviour, one test, one pass. This discipline is load-bearing for this codebase; the cost of skipping it is the kind of silent-failure bug described above.

Test commands¶

Client¶

uv run --with pytest --with watchdog --with cryptography --with argon2-cffi \
  pytest client/tests/ -v

The client's runtime deps are declared in client/blindproof.py's PEP 723 header, which uv run uses when the script itself is invoked. Running pytest against the tests is a different entry point — those deps have to be re-stated via --with. When the runtime dep list grows further, add matching --with flags. If that becomes unwieldy, promote to client/pyproject.toml with dev-dep management.

Backend¶

uv run --project backend pytest backend

The backend has its own backend/pyproject.toml because Django needs a proper project layout and the dev-dep story (pytest-django, etc.) is cleaner through uv's dependency groups. Runtime deps live in [project.dependencies], dev deps in [dependency-groups].dev. pytest discovers DJANGO_SETTINGS_MODULE from [tool.pytest.ini_options] in that same file.

Desktop¶

cd desktop && uv run pytest tests/

PyObjC-dependent tests only run on macOS; CI skips them for now because PyObjC won't install on Linux runners. Wiring these into CI (gated on macos-latest) is a documented follow-up.

CI¶

.github/workflows/tests.yml runs the client and backend suites on every push to main and on pull requests. Failing CI blocks merge.

Desktop tests are intentionally not yet in CI — the macos-latest runner + Homebrew framework-Python setup adds a few minutes per run, and the desktop suite is small.

What the tests don't do¶

No real OTS submissions in the test suite. FakeOTSSubmitter stands in for OpenTimestampsSubmitter; a duck-typed FakeCalendar returns library-shaped Timestamp + PendingAttestation objects so production code is exercised against the same object model it sees in production, without touching the public OTS network.
No real Bitcoin anchoring. Same reason — no network, no waiting, no flake.
No live backend round-trips from client tests. BackendClient takes an injectable transport parameter; tests use a fake that returns canned responses. The real client runs against a real backend in the smoke-test step of Running locally.

The end-to-end exercise happens in production, on the daily ots-daily GitHub Actions schedule, and is verified against the shipped verify.py output on a freshly produced bundle. That's the closest thing to an integration test in the system — and it runs daily.

Test counts (snapshot)¶

At the time of writing, roughly: 79 client tests + 100 backend tests + 17 desktop tests = 196 tests passing. Numbers drift as features land; check status.md for the current figure.

Guidelines¶

Keep tests fast. No sleeps, no real network, no real file locks. Fast tests are ones you'll actually re-run.
Keep tests hermetic. Each test should set up and tear down its own state. A test that depends on a previous test's side effects is a flake waiting to happen.
Prefer real objects to mocks where you can. SnapshotStore against an in-memory SQLite is simpler to read than a mocked SnapshotStore; duck-typed fakes for the OTS calendar clarify the contract better than MagicMock.
One behaviour per test. Name tests as sentences — test_capture_path_writes_ciphertext_under_uuid_filename, not test_capture_1.
Fail loudly. An assertion that silently passes when the code is wrong is worse than no test at all.