Fix optimizer validation/profiling memory spikes for large captured op IO

## Problem

Optimizer validation/profiling can use nearly all host RAM and significant swap when an op has large captured `.pt` entries, especially `torch.nn.functional.embedding`.

Observed on 2026-04-24 while optimizing project `gemma4-e2b-gb10`, op `torch_nn_functional_embedding`, 5-iteration optimize run.

## Evidence

The embedding capture directory was very large:

- `kernels/projects/gemma4-e2b-gb10/io/individual_ops/torch_nn_functional_embedding`: about 52 GiB
- 20 `entry_*.pt` files
- largest entries were about 4.4 GiB each
- smaller entries were about 769 MiB each
- `zipinfo` showed the large entries contain a raw tensor storage of about 4.7 GB, consistent with repeated embedding weight capture

During the profiling window:

- The persistent validation worker PID `1554037` stayed around `53,938,168 kB` RSS with `57,164,344 kB` high-water RSS.
- The profiler/pipeline PID `1553583` simultaneously increased GPU memory usage from about `4,654 MiB` to `53,294 MiB` according to `nvidia-smi`.
- System available RAM dropped as low as about `0.58 GiB`.
- System swap stayed heavily used, around `12.7 GiB` during the profiling window.
- After profiling completed and the worker exited, memory recovered quickly. Final snapshot showed about `114 GiB` available RAM and no Forge compute process in `nvidia-smi`.

This means validation and profiling memory overlap. The model size itself is not the main issue.

## Suspected Root Causes

1. `src/optimizer/backends/cuda/verifier.py` loads every `entry_*.pt` into an `entries` list before validating:

```py
entries = []
for f in entry_files:
    e = torch.load(f)
    entries.append(e)
```

For embedding, this can hold most of the 52 GiB capture set in host RAM.

2. The verifier worker is persistent. Its allocator/RSS can remain high after validation, while profiling starts in the parent pipeline process.

3. `src/optimizer/backends/cuda/profiler.py` uses `settings.batch_size`, defaulting to 50. For embedding there are only 20 entries, so profiling loads the whole 52 GiB capture set as one batch.

4. `src/optimizer/benchmarking/profile_project.py` serializes full tensors with `_serialize(v) -> v.detach().cpu()`, so embedding weights are duplicated into many capture entries.

## Desired Fix

Preserve benchmark integrity while reducing memory pressure:

- Stream validation entries one at a time instead of retaining all entries in memory.
- Restart or explicitly recycle the verifier worker after large validation jobs, or before profiling starts, so memory is returned to the OS.
- Make profiler batching byte-aware rather than count-aware. For example, cap each batch by total `.pt` file size and force batch size 1 for multi-GB entries.
- Add `gc.collect()` and device cache cleanup after validation/profiling batches where appropriate.

Longer-term optional improvement:

- Deduplicate repeated constant tensors in captured IO, especially embedding weights. This should be treated as a capture-format/provenance change and must reconstruct byte-identical inputs before benchmark execution.

## Benchmark Integrity Notes

The first three fixes should not change benchmark semantics if implemented correctly:

- Same `.pt` entries
- Same inputs and outputs
- Same correctness comparisons
- Same per-entry timing protocol

They only change residency/lifetime of tensors in memory. This likely improves timing quality because current RAM exhaustion and swap pressure can distort profiling measurements.

## Acceptance Criteria

- Embedding validation no longer holds all captured entries at once.
- Profiling batch selection respects a byte cap and does not load a 52 GiB op directory in one batch.
- Validation memory is released before profiling starts, or the validation worker is recycled before profiling.
- Peak RAM during `torch_nn_functional_embedding` optimization stays comfortably below physical memory with minimal/no swap growth.
- Behavior and benchmark provenance remain unchanged except for explicitly logged batching/memory-management policy.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix optimizer validation/profiling memory spikes for large captured op IO #76

Problem

Evidence

Suspected Root Causes

Desired Fix

Benchmark Integrity Notes

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Fix optimizer validation/profiling memory spikes for large captured op IO #76

Description

Problem

Evidence

Suspected Root Causes

Desired Fix

Benchmark Integrity Notes

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions