Skip to content

Conversation

@daphne-cornelisse
Copy link

@daphne-cornelisse daphne-cornelisse commented Nov 3, 2025

The neural network in pufferdrive is currently one of the main speed bottlenecks.

The linear + max operations are particularly expensive. This PR speeds up training by replacing the torch linear + max operations for the road and partner encoder with a cuda kernel that fuses these operations.

Screenshot 2025-11-17 at 12 11 12

The result is a speed up of ~ 3.5 X in SPS. On an RTX-4080: 200K (main) -> 700K (new)

While the new network needs more steps, it is still an improvement over the net in main because identical performance is reached in less wall clock time.

Screenshot 2025-11-17 at 10 06 39

I also switched the number of road points from 200 -> 128 and verified empirically that that is enough to get an off-road rate of near zero.

@daphne-cornelisse daphne-cornelisse changed the title Obscfg Faster training Nov 8, 2025
greptile-apps[bot]

This comment was marked as outdated.

@daphne-cornelisse daphne-cornelisse changed the title Faster training Replace neural network encoder with fused LinearMax cuda kernel Nov 17, 2025
@Emerge-Lab Emerge-Lab deleted a comment from greptile-apps bot Nov 17, 2025
@daphne-cornelisse daphne-cornelisse marked this pull request as ready for review November 17, 2025 14:52
greptile-apps[bot]

This comment was marked as off-topic.

@daphne-cornelisse daphne-cornelisse changed the title Replace neural network encoder with fused LinearMax cuda kernel Replace neural network encoder layer with fused LinearMax cuda kernel Nov 17, 2025
@daphne-cornelisse daphne-cornelisse changed the base branch from main to gsp_dev November 25, 2025 14:26
@daphne-cornelisse daphne-cornelisse merged commit f1bf6aa into gsp_dev Nov 25, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants