Skip to content

evidence(hw_gates): memtester 4G/32G/64G CLEAN on Spark (2026-05-14)#14

Merged
Natfii merged 1 commit into
mainfrom
evidence/memtester-32g-64g-clean
May 16, 2026
Merged

evidence(hw_gates): memtester 4G/32G/64G CLEAN on Spark (2026-05-14)#14
Natfii merged 1 commit into
mainfrom
evidence/memtester-32g-64g-clean

Conversation

@Natfii
Copy link
Copy Markdown

@Natfii Natfii commented May 16, 2026

Summary

HW evidence dir for the 2026-05-14 memtester sweep that gated the β-coop sustained-load collapse diagnosis arc. All three bands CLEAN.

  • sanity_4G: rc=0, 0 FAILURE, 436s (7m 16s)
  • band_32G: rc=0, 0 FAILURE, 3391s (56m 31s)
  • band_64G: rc=0, 0 FAILURE, 6764s (1h 52m 44s)

Total wall: ~3 h, host unchanged pre→post (no leak).

The 32-64 GB band is the suspect one per NVIDIA forum reports of reproducible memtester failures on some Spark units. This unit passed.

Why this gate exists

Spark fleet caveats that change priors for any host-state debugging:

  • LPDDR5x on Spark has no ECC — silent bit-flip corruption is possible. (NVIDIA forum)
  • Memtester 32-64 GB failures reported on other Spark units; unresolved upstream as of 2026-05-14. (NVIDIA forum)
  • Thermal-sensor errata on at least one Spark unit means nvidia-smi throttle-reasons is not a reliable HW-health gate. (NVIDIA forum)

The collapse arc was eventually resolved by PR #13 (SSM zero-on-realloc); D2.7 (the next planned bisection leg that would have triggered this HW gate's follow-up) was skipped. Committing this evidence so the durable Spark HW caveats and the CLEAN baseline are preserved for future debugging.

What's committed

  • summary.md — full host/manifest, what this does/doesn't prove, reproduce commands, source links
  • memtester_sanity_4G.log (1.1 MB)
  • memtester_band_32G.log (8.6 MB)
  • memtester_band_64G.log (17.2 MB)
  • runner.log (2.1 MB) — orchestration trace
  • (Drops runner.log.raw — duplicative of runner.log)

Force-added per memory:feedback_evidence_force_add since *.log is gitignored.

Test plan

  • All four memtester_*.log files end with Done. and contain 0 FAILURE lines (grep -c FAILURE).
  • runner.log shows rc=0 for every band.
  • Pre-run and post-run free -h snapshots show no leak.
  • Not exercised: 64-120 GB band (would approach OOM), thermal-stress band (orthogonal to memtester), CPU-GPU coherence in unified memory (memtester is CPU-only).

🤖 Generated with Claude Code

All three memtester runs returned rc=0 with 0 FAILURE lines:
- sanity_4G:  436s,  rc=0, FAILURE count=0
- band_32G:   3391s, rc=0, FAILURE count=0
- band_64G:   6764s, rc=0, FAILURE count=0

This gate ruled out RAM in the 4G / 32G / 64G bands as a hardware
contributor to the β-coop sustained-load collapse arc, prior to running
D2.7. D2.7 was subsequently skipped because the SSM zero-on-realloc fix
(PR #13) closed the diagnosis arc.

The 32-64 GB band is the suspect one per NVIDIA forum reports of
reproducible memtester failures on some Spark units. This unit passed.

Context for why this gate exists (Spark fleet caveats):
- LPDDR5x on Spark has no ECC (NVIDIA-confirmed); silent bit-flip
  corruption is possible.
- Memtester 32-64 GB failures reported by other Spark owners; unresolved
  upstream as of 2026-05-14.
- Thermal-sensor errata on at least one unit means nvidia-smi
  throttle-reasons is not a reliable HW-health gate.

See benchmarks/nvllm/traces/hw_gates/2026-05-14-memtester-32G-64G-clean/
summary.md for the full host/manifest, what this does and does not
prove, and reproduction commands. Sources linked there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Natfii Natfii merged commit 56f881f into main May 16, 2026
@Natfii Natfii deleted the evidence/memtester-32g-64g-clean branch May 16, 2026 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant