Problem
Optimizer validation/profiling can use nearly all host RAM and significant swap when an op has large captured .pt entries, especially torch.nn.functional.embedding.
Observed on 2026-04-24 while optimizing project gemma4-e2b-gb10, op torch_nn_functional_embedding, 5-iteration optimize run.
Evidence
The embedding capture directory was very large:
kernels/projects/gemma4-e2b-gb10/io/individual_ops/torch_nn_functional_embedding: about 52 GiB
- 20
entry_*.pt files
- largest entries were about 4.4 GiB each
- smaller entries were about 769 MiB each
zipinfo showed the large entries contain a raw tensor storage of about 4.7 GB, consistent with repeated embedding weight capture
During the profiling window:
- The persistent validation worker PID
1554037 stayed around 53,938,168 kB RSS with 57,164,344 kB high-water RSS.
- The profiler/pipeline PID
1553583 simultaneously increased GPU memory usage from about 4,654 MiB to 53,294 MiB according to nvidia-smi.
- System available RAM dropped as low as about
0.58 GiB.
- System swap stayed heavily used, around
12.7 GiB during the profiling window.
- After profiling completed and the worker exited, memory recovered quickly. Final snapshot showed about
114 GiB available RAM and no Forge compute process in nvidia-smi.
This means validation and profiling memory overlap. The model size itself is not the main issue.
Suspected Root Causes
src/optimizer/backends/cuda/verifier.py loads every entry_*.pt into an entries list before validating:
entries = []
for f in entry_files:
e = torch.load(f)
entries.append(e)
For embedding, this can hold most of the 52 GiB capture set in host RAM.
-
The verifier worker is persistent. Its allocator/RSS can remain high after validation, while profiling starts in the parent pipeline process.
-
src/optimizer/backends/cuda/profiler.py uses settings.batch_size, defaulting to 50. For embedding there are only 20 entries, so profiling loads the whole 52 GiB capture set as one batch.
-
src/optimizer/benchmarking/profile_project.py serializes full tensors with _serialize(v) -> v.detach().cpu(), so embedding weights are duplicated into many capture entries.
Desired Fix
Preserve benchmark integrity while reducing memory pressure:
- Stream validation entries one at a time instead of retaining all entries in memory.
- Restart or explicitly recycle the verifier worker after large validation jobs, or before profiling starts, so memory is returned to the OS.
- Make profiler batching byte-aware rather than count-aware. For example, cap each batch by total
.pt file size and force batch size 1 for multi-GB entries.
- Add
gc.collect() and device cache cleanup after validation/profiling batches where appropriate.
Longer-term optional improvement:
- Deduplicate repeated constant tensors in captured IO, especially embedding weights. This should be treated as a capture-format/provenance change and must reconstruct byte-identical inputs before benchmark execution.
Benchmark Integrity Notes
The first three fixes should not change benchmark semantics if implemented correctly:
- Same
.pt entries
- Same inputs and outputs
- Same correctness comparisons
- Same per-entry timing protocol
They only change residency/lifetime of tensors in memory. This likely improves timing quality because current RAM exhaustion and swap pressure can distort profiling measurements.
Acceptance Criteria
- Embedding validation no longer holds all captured entries at once.
- Profiling batch selection respects a byte cap and does not load a 52 GiB op directory in one batch.
- Validation memory is released before profiling starts, or the validation worker is recycled before profiling.
- Peak RAM during
torch_nn_functional_embedding optimization stays comfortably below physical memory with minimal/no swap growth.
- Behavior and benchmark provenance remain unchanged except for explicitly logged batching/memory-management policy.
Problem
Optimizer validation/profiling can use nearly all host RAM and significant swap when an op has large captured
.ptentries, especiallytorch.nn.functional.embedding.Observed on 2026-04-24 while optimizing project
gemma4-e2b-gb10, optorch_nn_functional_embedding, 5-iteration optimize run.Evidence
The embedding capture directory was very large:
kernels/projects/gemma4-e2b-gb10/io/individual_ops/torch_nn_functional_embedding: about 52 GiBentry_*.ptfileszipinfoshowed the large entries contain a raw tensor storage of about 4.7 GB, consistent with repeated embedding weight captureDuring the profiling window:
1554037stayed around53,938,168 kBRSS with57,164,344 kBhigh-water RSS.1553583simultaneously increased GPU memory usage from about4,654 MiBto53,294 MiBaccording tonvidia-smi.0.58 GiB.12.7 GiBduring the profiling window.114 GiBavailable RAM and no Forge compute process innvidia-smi.This means validation and profiling memory overlap. The model size itself is not the main issue.
Suspected Root Causes
src/optimizer/backends/cuda/verifier.pyloads everyentry_*.ptinto anentrieslist before validating:For embedding, this can hold most of the 52 GiB capture set in host RAM.
The verifier worker is persistent. Its allocator/RSS can remain high after validation, while profiling starts in the parent pipeline process.
src/optimizer/backends/cuda/profiler.pyusessettings.batch_size, defaulting to 50. For embedding there are only 20 entries, so profiling loads the whole 52 GiB capture set as one batch.src/optimizer/benchmarking/profile_project.pyserializes full tensors with_serialize(v) -> v.detach().cpu(), so embedding weights are duplicated into many capture entries.Desired Fix
Preserve benchmark integrity while reducing memory pressure:
.ptfile size and force batch size 1 for multi-GB entries.gc.collect()and device cache cleanup after validation/profiling batches where appropriate.Longer-term optional improvement:
Benchmark Integrity Notes
The first three fixes should not change benchmark semantics if implemented correctly:
.ptentriesThey only change residency/lifetime of tensors in memory. This likely improves timing quality because current RAM exhaustion and swap pressure can distort profiling measurements.
Acceptance Criteria
torch_nn_functional_embeddingoptimization stays comfortably below physical memory with minimal/no swap growth.