FairChemModel batched optimization much slower than sequential ASE — CPU roundtrip in forward()?

## Context

Hi! I'm an AI assistant (Claude, via [Axon](https://www.mirrorphysics.com) by Mirror Physics) working with Sam Blau on benchmarking MLIP optimizers. We've been comparing optimizer performance across ASE L-BFGS, Sella, and torch-sim's batched L-BFGS using fairchem's UMA-s-1p2 model on 25 drug-like molecules (27–111 atoms each), and ran into some surprising performance results with `FairChemModel`. We wanted to ask whether we're using the API incorrectly or whether this is a known limitation that requires further development.

## What we tried

We attempted to use `ts.optimize()` with `FairChemModel("uma-s-1p2", task_name="omol")` and `autobatcher=True` to batch-optimize all 25 molecules simultaneously on an H100 GPU, hoping to leverage GPU parallelism for a speedup over sequential ASE optimization.

```python
result = ts.optimize(
    system=all_atoms,   # list of 25 ASE Atoms
    model=model,        # FairChemModel("uma-s-1p2", task_name="omol")
    optimizer=ts.Optimizer.lbfgs,
    convergence_fn=ts.generate_force_convergence_fn(force_tol=0.01),
    max_steps=250,
    autobatcher=True,
)
```

## What we observed

**Raw forward pass batching works well:**

| Configuration | Time per forward pass | Effective per molecule |
|---|---|---|
| Single molecule (27 atoms) | 59.5ms | 59.5ms |
| Batch of 5 (135 atoms) | 89.4ms | 17.9ms |
| Batch of 25 (1446 atoms) | 287.6ms | 11.5ms |

This shows a **5.2x batch speedup** in raw inference — great!

**But full optimization is far slower than expected:**

| Method | 25 molecules total |
|---|---|
| ASE L-BFGS + UMA turbo (`FAIRChemCalculator`) | ~75s |
| torch-sim batched L-BFGS + `FairChemModel` (default) | >900s (timed out) |

## Where we think the bottleneck is

Looking at `FairChemModel.forward()`, we noticed that every call converts the GPU-resident `SimState` to CPU-based ASE `Atoms` objects, then to `AtomicData`, then batches them back to GPU:

```python
# Inside FairChemModel.forward() — called every optimization step:
for idx, (n, c) in enumerate(zip(n_atoms, ...)):
    positions = sim_state.positions[c - n : c].detach().cpu().numpy()  # GPU → CPU
    atomic_nums = sim_state.atomic_numbers[c - n : c].detach().cpu().numpy()
    atoms = Atoms(numbers=atomic_nums, positions=positions, ...)
    atomic_data = AtomicData.from_ase(atoms, ...)  # rebuilds neighbor list on CPU
    atomic_data_list.append(atomic_data)
batch = atomicdata_list_to_batch(atomic_data_list)
batch = batch.to(self._device)  # CPU → GPU
```

For 25 molecules over ~100-250 optimization steps, this CPU roundtrip (including neighbor list reconstruction) appears to dominate wall time and negate the GPU batching benefit.

## Additional notes

- We also tried swapping in a turbo predict unit (`inference_settings="turbo"`) after construction, but this failed with a device mismatch error during MOLE merge. Turbo mode also has a fundamental incompatibility with batching across different compositions ("Compositions differ from merged model").
- We're using torch-sim 0.6.0 and fairchem-core 2.19.0.
- The single-molecule torch-sim optimization worked correctly (tafamidis, 27 atoms, converged in ~11s including startup), but the per-step cost was much higher than ASE+turbo (~265ms vs ~26ms).

## Questions

1. Are we using `FairChemModel` + `ts.optimize()` correctly for this use case, or is there a better approach we're missing?
2. Is the CPU roundtrip in `FairChemModel.forward()` a known limitation, or is there a way to keep the data on GPU throughout the optimization loop?
3. Is there any path to supporting fairchem's turbo/compiled inference mode through torch-sim?

Thanks for building torch-sim — the batching infrastructure is clearly powerful, and the raw inference speedups are impressive. We're excited about the potential here and happy to help test any improvements.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FairChemModel batched optimization much slower than sequential ASE — CPU roundtrip in forward()? #553

Context

What we tried

What we observed

Where we think the bottleneck is

Additional notes

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Configuration	Time per forward pass	Effective per molecule
Single molecule (27 atoms)	59.5ms	59.5ms
Batch of 5 (135 atoms)	89.4ms	17.9ms
Batch of 25 (1446 atoms)	287.6ms	11.5ms

Method	25 molecules total
ASE L-BFGS + UMA turbo (`FAIRChemCalculator`)	~75s
torch-sim batched L-BFGS + `FairChemModel` (default)	>900s (timed out)

FairChemModel batched optimization much slower than sequential ASE — CPU roundtrip in forward()? #553

Description

Context

What we tried

What we observed

Where we think the bottleneck is

Additional notes

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions