evidence(hw_gates): memtester 4G/32G/64G CLEAN on Spark (2026-05-14)#14
Merged
Conversation
All three memtester runs returned rc=0 with 0 FAILURE lines: - sanity_4G: 436s, rc=0, FAILURE count=0 - band_32G: 3391s, rc=0, FAILURE count=0 - band_64G: 6764s, rc=0, FAILURE count=0 This gate ruled out RAM in the 4G / 32G / 64G bands as a hardware contributor to the β-coop sustained-load collapse arc, prior to running D2.7. D2.7 was subsequently skipped because the SSM zero-on-realloc fix (PR #13) closed the diagnosis arc. The 32-64 GB band is the suspect one per NVIDIA forum reports of reproducible memtester failures on some Spark units. This unit passed. Context for why this gate exists (Spark fleet caveats): - LPDDR5x on Spark has no ECC (NVIDIA-confirmed); silent bit-flip corruption is possible. - Memtester 32-64 GB failures reported by other Spark owners; unresolved upstream as of 2026-05-14. - Thermal-sensor errata on at least one unit means nvidia-smi throttle-reasons is not a reliable HW-health gate. See benchmarks/nvllm/traces/hw_gates/2026-05-14-memtester-32G-64G-clean/ summary.md for the full host/manifest, what this does and does not prove, and reproduction commands. Sources linked there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
HW evidence dir for the 2026-05-14 memtester sweep that gated the β-coop sustained-load collapse diagnosis arc. All three bands CLEAN.
Total wall: ~3 h, host unchanged pre→post (no leak).
The 32-64 GB band is the suspect one per NVIDIA forum reports of reproducible memtester failures on some Spark units. This unit passed.
Why this gate exists
Spark fleet caveats that change priors for any host-state debugging:
nvidia-smithrottle-reasons is not a reliable HW-health gate. (NVIDIA forum)The collapse arc was eventually resolved by PR #13 (SSM zero-on-realloc); D2.7 (the next planned bisection leg that would have triggered this HW gate's follow-up) was skipped. Committing this evidence so the durable Spark HW caveats and the CLEAN baseline are preserved for future debugging.
What's committed
summary.md— full host/manifest, what this does/doesn't prove, reproduce commands, source linksmemtester_sanity_4G.log(1.1 MB)memtester_band_32G.log(8.6 MB)memtester_band_64G.log(17.2 MB)runner.log(2.1 MB) — orchestration tracerunner.log.raw— duplicative ofrunner.log)Force-added per
memory:feedback_evidence_force_addsince*.logis gitignored.Test plan
memtester_*.logfiles end withDone.and contain0FAILURE lines (grep -c FAILURE).runner.logshowsrc=0for every band.free -hsnapshots show no leak.🤖 Generated with Claude Code