Cold-Start Audit (2026-01-03) ARCHIVE
MathLedger Cold-Start Audit Report
Date: January 3, 2026 Auditor Perspective: External safety lead/auditor/potential acquirer with no prior contextPart 1 — First Contact (10–15 seconds)
What I See Above the Fold
Headings:- "MathLedger — Version v0.2.1"
- "Version v0.2.1 Archive" with "ARCHIVE" badge
- "Hosted Interactive Demo"
- Green-highlighted box at top showing "Status: CURRENT"
- Tag: v0.2.1-cohesion | Commit: 27a94c8a5813 | Locked: 2026-01-03
- "Tier A (enforced): 10"
- "Tier B (logged): 1"
- "Tier C (aspirational): 3"
- No Lean/Z3 verifier: FV claims always return ABSTAINED
- Single template partitioner: no multi-model consensus
- No learning loop: RFL not active
- MV edge cases: overflow, float precision not fully covered
- Navigation tabs: Scope, Explanation, Invariants, Fixtures, Evidence, All Versions
- Large green button: "Open Interactive Demo"
- Link: "5-minute auditor checklist" (described as "New to MathLedger? Start with...")
- Statement: "Interactive demo is hosted; archive remains immutable."
- Statement: "This is the archive for MathLedger version v0.2.1. All artifacts below are static, verifiable, and immutable."
After ~10 Seconds: What I Believe This Project Is
What I believe this project is: MathLedger appears to be a versioned epistemic archive system that demonstrates some form of mathematical or logical claim verification with explicit enforcement tiers. It presents itself as an immutable, verifiable artifact with a hosted interactive demo. The emphasis on "what this version cannot enforce" and the tier system (A/B/C) suggests it's a rigorous demonstration of claim verification with explicit limitations. What I believe it is explicitly NOT:- NOT a production-ready verification system (given the prominent warnings about missing Lean/Z3 verifier, no multi-model consensus, no learning loop)
- NOT claiming to verify all mathematical edge cases (explicitly calls out overflow and float precision gaps)
- NOT a general-purpose AI demo (the language is unusually technical and limitation-focused)
- NOT making implicit promises (the "What this version cannot enforce" section is more prominent than typical feature lists)
- "What this version cannot enforce" (unusual negative framing)
- "static, verifiable, and immutable"
- "Tier A (enforced): 10 / Tier B (logged): 1 / Tier C (aspirational): 3"
- "archive remains immutable"
Part 2 — Archive Understanding
Scope Lock Exploration
What v0 Demonstrates: The Scope Lock document states that "v0 is a governance demo" that demonstrates:- UVIL (User-Verified Input Loop): Human binding mechanism for authority
- Trust Classes: FV, MV, PA (authority-bearing) vs ADV (exploration-only)
- Dual Attestation Roots: U_t (UI), R_t (reasoning), H_t (composite)
- Determinism: Same inputs produce same outputs, replayable
- Exploration/Authority Boundary: DraftProposal never enters hash-committed paths
- RFL learning loop: No curriculum, no policy, no uplift
- Multi-model arena: Single template partitioner, no LiteLLM, no model competition
- Agent tools: No code execution, no sandbox, no E2B
- Real verifier: No Lean, no Z3, no mechanical proof checking
- Production auth: No user accounts, no API keys, no rate limiting
- Persistence: In-memory only, restart-loss accepted
- Long-running agents: Synchronous request/response only
The document explicitly states what v0 IS and what it does NOT show:
- It shows: The boundary between exploration and authority is real, testable, and replayable; the system stops when it cannot verify
- It does NOT show: "That the system is intelligent / That the system is aligned / That the system is safe / That verification works (v0 has no verifier)"
- VERIFIED: MV claim parsed AND arithmetic confirmed (e.g., "2 + 2 = 4")
- REFUTED: MV claim parsed AND arithmetic failed (e.g., "2 + 2 = 5")
- ABSTAINED: Cannot mechanically verify (PA, FV, unparseable MV, ADV-only)
Explanation Page Exploration
Core Mechanism: The Explanation document describes how the demo separates exploration from authority:- Exploration phase: System generates a DraftProposal with random identifier; nothing is committed; editing is free
- Authority phase: When user clicks "Commit," system creates CommittedPartitionSnapshot with content-derived identifier (hash); claims become immutable
- VERIFIED: machine-checkable proof that claim holds
- REFUTED: machine-checkable proof that claim does not hold
- ABSTAINED: system did not find either (described as "not a failure" but "the correct output")
- U_t: UI Merkle root (commits to human actions)
- R_t: Reasoning Merkle root (commits to system's established claims)
- H_t: Composite root (binds U_t and R_t together)
- System is aligned with human values
- System is intelligent
- System is safe
- System will behave well in novel situations
Invariants Page Exploration
Document Description: "This document provides a brutally honest classification of governance invariants." Tier Classification System:- Tier A: Cryptographically or structurally enforced. Violation is impossible without detection. (10 invariants)
- Tier B: Logged and replay-visible. Violation is detectable but not prevented. (1 invariant)
- Tier C: Documented but not enforced in v0. Aspirational. (3 invariants)
- Canonicalization Determinism
- H_t = SHA256(R_t || U_t)
- ADV Excluded from R_t
- Content-Derived IDs
- Replay Uses Same Code Paths
- Double-Commit Returns 409
- No Silent Authority
- Trust-Class Monotonicity
- Abstention Preservation
- Audit Surface Version Field
- MV Validator Correctness (edge cases) - logged but not hard-gated
- FV Mechanical Verification (no Lean/Z3 verifier)
- Multi-Model Consensus (single template partitioner)
- RFL Integration (no learning loop)
Fixtures Page Exploration
Description: Regression test fixtures for version v0.2.1. Each fixture contains input and expected output JSON files. 9 Test Fixtures:- adv_only
- mixed_mv_adv
- mv_arithmetic_refuted
- mv_arithmetic_verified
- mv_only
- pa_only
- same_claim_as_adv
- same_claim_as_pa
- underdetermined_navier_stokes
- Checksum verification available via index.json with SHA256 checksums
- Regression harness can be run locally with:
uv run python tools/run_demo_cases.py
Evidence Pack Page Exploration
Purpose: "The evidence pack enables independent replay verification. An auditor can recompute attestation hashes without running the demo." What Replay Verification Proves:- The recorded hashes match what the inputs produce
- The attestation trail has not been tampered with
- Determinism: same inputs produce same outputs
- That the claims are true
- That the verification was sound
- That the system behaved safely
Part 2 Summary: What MathLedger Claims and Refuses to Claim
What MathLedger Claims:- It is a governance substrate demo, not a capability demo
- It demonstrates a structural boundary between exploration and authority that is cryptographically enforced
- It provides deterministic, replayable attestations with content-derived identifiers
- It has 10 Tier A invariants that are cryptographically or structurally enforced
- The system's governance is legible - you can see what the human committed and what the system established
- Replay verification proves structural integrity (that the audit trail is intact)
- That the system is intelligent, aligned, or safe
- That verification works (v0 has no verifier - all FV claims return ABSTAINED)
- That it demonstrates capability (explicitly: "not a capability demo")
- That replay verification proves truth or soundness (only structural integrity)
- That it generalizes to production or handles all edge cases
- That it has multi-model consensus, learning loops, or real mechanical verification
The most striking feature is the prominence of non-claims and limitations. The homepage displays "What this version cannot enforce" in red warning text before any feature descriptions. Every major document includes explicit sections on what is NOT claimed or NOT proven. The Invariants document is titled "brutally honest classification" and includes a column for "How It Can Be Violated Today." The Evidence Pack page explicitly states what replay does NOT prove. This negative framing is far more prominent than in typical AI demos, which usually emphasize capabilities.
The tier system (A/B/C) provides unusual transparency about enforcement levels - most systems would claim everything is "secure" without distinguishing between cryptographically enforced, logged-but-not-prevented, and aspirational invariants.
The terminology discipline is rigorous: ABSTAINED is treated as a "first-class outcome" rather than a failure, and the system explicitly refuses to return VERIFIED when it cannot mechanically verify.
What Feels Confusing or Underspecified:- Target audience ambiguity: The site assumes significant technical sophistication (terms like "Merkle root," "RFC 8785-style canonicalization," "content-derived IDs") without clear onboarding for different audience levels.
- "Governance demo" framing: While the site repeatedly states this is a "governance substrate demo," it's not immediately clear what problem this solves or why governance without capability matters. The value proposition requires significant inference.
- Missing context for "FM": The Invariants page references "FM Section" repeatedly (§1.5, §4, etc.) but doesn't explain what FM is or link to it.
- UVIL acronym: "User-Verified Input Loop" is mentioned but not explained in depth on the homepage - requires clicking through to understand.
- Transition to demo: While the green "Open Interactive Demo" button is visible, there's no clear narrative bridge explaining "now that you understand the governance claims, here's how to see them in action."
- Understanding the distinction between "authority-bearing" and "exploration-only" required reading multiple documents
- Grasping why a demo with no real verifier is valuable took time - the point is governance infrastructure, not verification capability
- The relationship between the archive (static documentation) and the demo (interactive) wasn't immediately clear
- Understanding what "epistemic archive" means in practice
- Connecting the tier system to actual enforcement mechanisms required reading code examples
Part 3 — Transition to the Demo
How the Site Points to the Interactive Demo
Link Visibility: The link is obvious. There is a large green button labeled "Open Interactive Demo" prominently displayed in a light green box titled "Hosted Interactive Demo" near the top of the archive page. The button uses high contrast (dark green on light green background) and is the only call-to-action button on the page. Framing: The framing is clear but minimal. The box states: "Interactive demo is hosted; archive remains immutable." This establishes the relationship between the two artifacts (demo is live, archive is static) but doesn't explain what the demo will show or why you should use it.Below the button, there's a secondary link: "New to MathLedger? Start with the 5-minute auditor checklist" - this provides an alternative entry point for newcomers.
Continuation vs. Separate Thing: It feels like intentionally separated but related artifacts. The framing "Interactive demo is hosted; archive remains immutable" explicitly distinguishes them. The archive is presented as the authoritative, immutable documentation, while the demo is positioned as a separate hosted application. This separation feels deliberate - the archive documents what the demo does, and the demo demonstrates what the archive describes.Transition Feel
The transition feels intentionally gated rather than seamless. The archive requires you to understand the governance claims (Scope Lock, Explanation, Invariants) before you interact with the demo. There's no narrative flow like "Now let's see this in action!" - instead, the demo is presented as a parallel artifact that you can access when ready.
The transition is not confusing - the button is clear and the relationship is stated. However, it is somewhat technical in that it assumes you understand what "hosted" vs "immutable archive" means and why that distinction matters.
Part 4 — Interactive Demo Evaluation
Initial Demo Interface Observations
Header Banner: Black banner at top: "GOVERNANCE DEMO (not capability)" with version info: v0.2.0 | v0.2.0-demo-lock | 27a94c8a5813 Framing Text (above the fold): Three key statements displayed prominently:- "The system does not decide what is true. It decides what is justified under a declared verification route."
- "This demo will stop more often than you expect. It reports what it cannot verify."
- "If you are looking for a system that always has an answer, this demo is not it." (in italics)
- Title: "Same Claim, Different Authority"
- Button: "Run 90-Second Proof" (white button with red background)
- Scenario dropdown with 6 options:
- Left panel: "EXPLORATION STREAM (NOT AUTHORITY)" - currently shows "Select a scenario or enter custom input."
- Right panel: "AUTHORITY STREAM (BOUND)" - currently shows "Nothing committed yet. Authority stream is empty."
- FV: Formal proof (ABSTAINED in v0)
- MV: Mechanical validation (arithmetic only)
- PA: User attestation (ABSTAINED)
- ADV: Advisory (excluded from R_t)
"Same Claim, Different Authority" Demo Observations
Timing: The demo completed in approximately 5-8 seconds (not 90 seconds as the button name suggests). The animation showed results appearing sequentially with brief pauses between each item. Animation Sequence:- Button changed to "Running..." state
- Four claims appeared sequentially in a dark box:
- Button changed to "Run Again" when complete
- Trust class label (ADV, PA, MV)
- Claim text in quotes
- Arrow separator
- Outcome in colored text (ABSTAINED in orange, VERIFIED in green, REFUTED in red)
- Explanation text below (e.g., "Excluded from authority stream", "Arithmetic validator confirmed")
- ABSTAINED: Orange text
- VERIFIED: Green text
- REFUTED: Red text
- The color coding (green = good, red = bad, orange = neutral/uncertain) is intuitive
- The explanation text for each outcome provides context
- The summary statement clarifies the point being demonstrated
- However, understanding WHY ADV is excluded or WHY PA returns ABSTAINED requires reading the documentation
Additional Demo Interface Observations
Two-Stream Display: The demo interface clearly separates:- EXPLORATION STREAM (NOT AUTHORITY): Left panel showing "Select a scenario or enter custom input."
- AUTHORITY STREAM (BOUND): Right panel showing "Nothing committed yet. Authority stream is empty."
- MV Only (Mechanically Validated)
- Mixed MV + ADV
- PA Only (User Attestation)
- ADV Only (Exploration)
- Underdetermined (Open Problem)
- Custom Input
- Animation timing: ~5-8 seconds (not 90 seconds as button name suggests)
- Visual clarity: High - color coding, clear labels, explanation text
- Understandability: Moderate - basic outcomes are clear, but deeper understanding of WHY requires documentation
- Errors: None observed
- Delays: None observed
- Unexpected behavior: Button name "Run 90-Second Proof" is misleading about timing
Part 5 — Coherence Check
Do mathledger.ai (archive) and mathledger.ai/demo Feel Like One Coherent System?
Answer: One coherent epistemic system.The archive and demo feel like intentionally separated but tightly coupled artifacts that form a coherent whole. They are not "stitched together" - they appear designed from the start to work as a documentation-demonstration pair.
Evidence of Coherence:- Consistent Terminology:
- Matching Non-Claims:
- Demo Behavior Matches Archive's Non-Claims:
- Cross-Linking:
- Shared Framing:
- Version number discrepancy: The demo shows "v0.2.0" while the archive shows "v0.2.1" - this suggests the demo may not have been updated to the latest archive version, though both reference similar commit hashes.
- "90-Second Proof" naming: The button name doesn't match the actual timing (~5-8 seconds), which could indicate the demo was updated but the button label wasn't changed, or the name refers to something other than wall-clock time.
Part 6 — Acquirer / Safety Lead Lens
What Feels Unusually Rigorous or Novel?
1. Negative Capability Framing The prominence of non-claims and limitations is unprecedented in AI demos. Most systems lead with capabilities; MathLedger leads with "What this version cannot enforce" in red warning text on the homepage. This inverted framing is a genuine novelty in the field. 2. Tiered Enforcement Transparency The Tier A/B/C classification system provides unusual granularity about what is actually enforced versus aspirational. The "How It Can Be Violated Today" column in the Invariants table is something I have never seen in production systems, let alone demos. This level of attack surface transparency is typically reserved for internal security documentation. 3. Abstention as First-Class Outcome Treating ABSTAINED as a legitimate, non-failure outcome - and structurally enforcing that it cannot be silently converted to a claim - is a rigorous design choice. Most AI systems are incentivized to always produce an answer; MathLedger's architecture makes "I don't know" a core feature. 4. Exploration/Authority Boundary Enforcement The structural separation between DraftProposal (with random IDs) and CommittedPartitionSnapshot (with content-derived IDs) is more than documentation - it's enforced in the code with ValueError exceptions. The claim that "exploration identifiers never appear in committed data" is verifiable. 5. Replay Verification with Explicit Non-Claims The evidence pack system provides cryptographic replay verification while explicitly stating it proves "structural integrity, not truth." This distinction between "the audit trail is intact" and "the claims are correct" is philosophically sophisticated and rarely articulated this clearly. 6. Immutable Versioned Archives The epistemic archive concept - where each version is locked with commit hashes, checksums, and explicit scope locks - creates a verifiable historical record. The "Date Locked" timestamps and "This is an epistemic archive. Content is immutable once published" footer establish accountability. 7. Governance Without Capability The framing as a "governance substrate demo, not capability demo" is conceptually novel. Most AI safety work focuses on making capable systems safe; MathLedger demonstrates governance infrastructure before adding capability. This is architecturally backwards from typical AI development, and that's the point.What Feels Unfinished, Underspecified, or Missing?
1. No Failed Verification Examples The site shows VERIFIED, REFUTED, and ABSTAINED outcomes, but doesn't demonstrate what happens when verification infrastructure itself fails (e.g., validator crashes, hash computation errors, Byzantine failures). A "failure modes" page would strengthen credibility. 2. Threat Model Absence There is no explicit threat model. Who is the adversary? What attacks is the system designed to resist? What attacks is it explicitly NOT designed to resist? The Invariants page shows "How It Can Be Violated Today" but doesn't frame this as adversarial threat modeling. 3. Missing "For Auditors" Entry Point Clarity While there's a "5-minute auditor checklist" link, it's not prominent enough. An auditor landing on the homepage would need to infer that this is the right starting point. A clearer "If you're auditing this system, start here" banner would help. 4. No Comparison to Existing Standards The site doesn't position MathLedger relative to existing audit standards (SOC 2, ISO 27001), AI governance frameworks (NIST AI RMF), or formal verification approaches (Coq, Isabelle). This makes it harder to assess whether MathLedger is complementary, competitive, or orthogonal to existing approaches. 5. Scalability and Performance Claims Absent There are no claims about performance, throughput, or scalability. Can this handle 1000 claims? 1 million? Is there a performance model? While this is consistent with "governance substrate only," it leaves open questions about production viability. 6. Multi-Party Scenarios Underexplored The demo shows single-user flows. What happens when multiple parties commit conflicting claims? How are disputes resolved? The UVIL (User-Verified Input Loop) suggests human-in-the-loop, but multi-stakeholder scenarios aren't demonstrated. 7. Integration Guidance Missing There's no "How to Integrate MathLedger" guide. If I wanted to use this in my organization, what would that look like? Is it a library, a service, a protocol? The local execution instructions exist but aren't framed as integration guidance. 8. Economic and Incentive Model Absent There's no discussion of incentives. Why would users commit claims? Why would validators participate? In production, governance systems need incentive alignment, but this is not addressed (which is fine for v0, but should be acknowledged as future work).What Would I Want to See Next? (Concrete Requests)
1. A Live Example of a Failed Verification Show a case where the verification infrastructure itself fails (not just ABSTAINED, but an actual system error). Demonstrate how the system handles and reports infrastructure failures. Include this in the fixtures with expected error states. 2. A Threat Model Page Create a document titled "Threat Model and Attack Surface" that explicitly lists:- Adversaries the system is designed to resist (e.g., malicious validators, data tampering)
- Adversaries the system is NOT designed to resist (e.g., compromised hardware, social engineering)
- Attack vectors and mitigations for each Tier A invariant
- Explicitly out-of-scope threats
/for-auditors that provides:
- 30-second summary of what to audit
- Links to the 5 most critical documents in priority order
- Checklist of verification steps with estimated time for each
- Expected outputs for each verification step
- Contact information for questions
- Traditional audit frameworks (SOC 2, ISO 27001)
- AI governance frameworks (NIST AI RMF, EU AI Act)
- Formal verification systems (Lean, Coq, Z3)
- Blockchain/distributed ledger approaches
- Known failure modes (with examples)
- Edge cases not covered by current validators
- Scalability limits (if any)
- Performance characteristics (latency, throughput)
- Degradation behavior under load or attack
- Shows a valid evidence pack with PASS verification
- Allows the user to modify a field
- Re-runs verification and shows FAIL with specific diff
- Two users committing conflicting claims about the same fact
- How the system records both without choosing a winner
- How the attestation structure preserves both perspectives
- Sample code for a Python application
- API documentation (if applicable)
- Deployment guide
- Monitoring and observability recommendations
Part 7 — Verdict
Core Value Proposition (One Sentence)
MathLedger provides a cryptographically enforced governance substrate that separates AI system exploration from authority-bearing claims, making the boundary between "what the system suggested" and "what was committed as justified" structurally verifiable and replayable, with abstention as a first-class outcome when verification cannot be established.
Single Biggest Improvement for Credibility (One Sentence)
Add a prominent threat model page that explicitly names the adversaries the system is designed to resist and those it is not, with concrete attack scenarios and corresponding Tier A invariant protections, because the current documentation demonstrates unusual rigor in showing limitations but stops short of framing those limitations as adversarial threat modeling, which is what auditors and acquirers need to assess whether the governance substrate is fit for their threat environment.
Summary of Key Findings
Strengths:- Unprecedented transparency about limitations and non-claims
- Rigorous tier system (A/B/C) with explicit enforcement levels
- Structural enforcement of exploration/authority boundary
- Abstention treated as first-class outcome, not failure
- Immutable versioned archives with cryptographic verification
- Terminology consistency between documentation and demo
- Evidence pack replay verification with explicit scope
- No explicit threat model or adversary framing
- Missing failed verification examples (infrastructure failures, not just ABSTAINED)
- Target audience ambiguity (assumes high technical sophistication)
- "90-Second Proof" button name misleading (actually ~5-8 seconds)
- No comparison to existing audit/governance frameworks
- Integration guidance absent
- Multi-stakeholder scenarios underexplored