Skip to content

Fix optimizer validation/profiling memory spikes for large captured op IO #76

@Dhravidk

Description

@Dhravidk

Problem

Optimizer validation/profiling can use nearly all host RAM and significant swap when an op has large captured .pt entries, especially torch.nn.functional.embedding.

Observed on 2026-04-24 while optimizing project gemma4-e2b-gb10, op torch_nn_functional_embedding, 5-iteration optimize run.

Evidence

The embedding capture directory was very large:

  • kernels/projects/gemma4-e2b-gb10/io/individual_ops/torch_nn_functional_embedding: about 52 GiB
  • 20 entry_*.pt files
  • largest entries were about 4.4 GiB each
  • smaller entries were about 769 MiB each
  • zipinfo showed the large entries contain a raw tensor storage of about 4.7 GB, consistent with repeated embedding weight capture

During the profiling window:

  • The persistent validation worker PID 1554037 stayed around 53,938,168 kB RSS with 57,164,344 kB high-water RSS.
  • The profiler/pipeline PID 1553583 simultaneously increased GPU memory usage from about 4,654 MiB to 53,294 MiB according to nvidia-smi.
  • System available RAM dropped as low as about 0.58 GiB.
  • System swap stayed heavily used, around 12.7 GiB during the profiling window.
  • After profiling completed and the worker exited, memory recovered quickly. Final snapshot showed about 114 GiB available RAM and no Forge compute process in nvidia-smi.

This means validation and profiling memory overlap. The model size itself is not the main issue.

Suspected Root Causes

  1. src/optimizer/backends/cuda/verifier.py loads every entry_*.pt into an entries list before validating:
entries = []
for f in entry_files:
    e = torch.load(f)
    entries.append(e)

For embedding, this can hold most of the 52 GiB capture set in host RAM.

  1. The verifier worker is persistent. Its allocator/RSS can remain high after validation, while profiling starts in the parent pipeline process.

  2. src/optimizer/backends/cuda/profiler.py uses settings.batch_size, defaulting to 50. For embedding there are only 20 entries, so profiling loads the whole 52 GiB capture set as one batch.

  3. src/optimizer/benchmarking/profile_project.py serializes full tensors with _serialize(v) -> v.detach().cpu(), so embedding weights are duplicated into many capture entries.

Desired Fix

Preserve benchmark integrity while reducing memory pressure:

  • Stream validation entries one at a time instead of retaining all entries in memory.
  • Restart or explicitly recycle the verifier worker after large validation jobs, or before profiling starts, so memory is returned to the OS.
  • Make profiler batching byte-aware rather than count-aware. For example, cap each batch by total .pt file size and force batch size 1 for multi-GB entries.
  • Add gc.collect() and device cache cleanup after validation/profiling batches where appropriate.

Longer-term optional improvement:

  • Deduplicate repeated constant tensors in captured IO, especially embedding weights. This should be treated as a capture-format/provenance change and must reconstruct byte-identical inputs before benchmark execution.

Benchmark Integrity Notes

The first three fixes should not change benchmark semantics if implemented correctly:

  • Same .pt entries
  • Same inputs and outputs
  • Same correctness comparisons
  • Same per-entry timing protocol

They only change residency/lifetime of tensors in memory. This likely improves timing quality because current RAM exhaustion and swap pressure can distort profiling measurements.

Acceptance Criteria

  • Embedding validation no longer holds all captured entries at once.
  • Profiling batch selection respects a byte cap and does not load a 52 GiB op directory in one batch.
  • Validation memory is released before profiling starts, or the validation worker is recycled before profiling.
  • Peak RAM during torch_nn_functional_embedding optimization stays comfortably below physical memory with minimal/no swap growth.
  • Behavior and benchmark provenance remain unchanged except for explicitly logged batching/memory-management policy.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions