ZMem

Benchmarks

Run proof-backed synthetic, LongMemEval-style, and LoCoMo-style benchmark evidence.

ZMem should compete on benchmark accuracy, latency, token efficiency, and proof-backed reproducibility. The current benchmark harness is designed for internal development and honest public evidence, not unsupported leaderboard claims.

Synthetic Matrix

zmem bench matrix synthetic \
  --out .zerker/bench \
  --seed 0 \
  --run-id synthetic-local

Generate the engineering and public evidence surfaces:

zmem bench dashboard .zerker/bench/synthetic-local
zmem bench public-page .zerker/bench/synthetic-local

Verify Evidence

zmem bench verify .zerker/bench/synthetic-local/fts/benchmark-result.json
zmem bench verify .zerker/bench/synthetic-local/fts-multihop/benchmark-result.json

LongMemEval-Style Fixtures

zmem bench matrix longmemeval \
  --dataset /path/to/local-longmemeval.jsonl \
  --split dev \
  --out .zerker/bench \
  --seed 0 \
  --run-id longmemeval-dev-local

LoCoMo-Style Fixtures

zmem bench matrix locomo \
  --dataset /path/to/local-locomo.jsonl \
  --split dev \
  --out .zerker/bench \
  --seed 0 \
  --run-id locomo-dev-local

Public Claim Rules

Allowed before official submissions:

  • "ZMem publishes proof-backed local benchmark evidence."
  • "This matrix is reproducible from the attached artifact hashes and receipts."
  • "This local scaffold tracks retrieval accuracy, latency, tokens, and proof verification."

Do not claim official LongMemEval or LoCoMo ranking, vendor superiority, or canonical leaderboard score from local scaffold output.

What To Track

  • accuracy and category accuracy,
  • recall and precision evidence where ground-truth support exists,
  • p50, p95, and p99 retrieval latency,
  • total tokens and injected context tokens,
  • retrieved, injected, withheld, and budget-dropped memory counts,
  • verification status,
  • matrix hash and comparison hash,
  • optional Treeship proof URL when public proof publishing is enabled.

On this page