feat: swiglu forward optimizations by aghilann · Pull Request #63 · NVIDIA/TileGym

aghilann · 2026-02-23T06:14:49Z

Description

Implements a minimal, forward-only SwiGLU optimization in src/tilegym/ops/cutile/swiglu.py.
Uses fast sigmoid math flush_to_zero=True + approximate reciprocal via rounding_mode=RMd.APPROX to reduce scalar math cost.
Use gather/scatter instead of load-store
Preserves backward behavior while improving forward throughput.

Benchmark Results (Added bfloat16 + float32 in addition to float16)

Suite	main CuTile (GB/s)	swiglu-optimizations CuTile (GB/s)	Speedup
swiglu-batch1-M128-bfloat16-GBps	1083.29	1723.81	1.591x
swiglu-batch1-M128-float16-GBps	1206.48	1741.01	1.443x
swiglu-batch1-M128-float32-GBps	1767.60	2330.74	1.319x
swiglu-batch1-M4096-bfloat16-GBps	1685.89	1877.48	1.114x
swiglu-batch1-M4096-float16-GBps	1593.83	1742.88	1.094x
swiglu-batch1-M4096-float32-GBps	1236.80	1252.18	1.012x
swiglu-batch4-M128-bfloat16-GBps	1919.65	2634.60	1.372x
swiglu-batch4-M128-float16-GBps	1987.82	2639.57	1.328x
swiglu-batch4-M128-float32-GBps	2008.17	2471.87	1.231x
swiglu-batch4-M4096-bfloat16-GBps	787.96	790.20	1.003x
swiglu-batch4-M4096-float16-GBps	787.30	791.15	1.005x
swiglu-batch4-M4096-float32-GBps	775.18	774.31	0.999x
swiglu-batch8-M128-bfloat16-GBps	2063.11	2593.24	1.257x
swiglu-batch8-M128-float16-GBps	2088.77	2598.10	1.244x
swiglu-batch8-M128-float32-GBps	1994.69	2197.19	1.102x
swiglu-batch8-M4096-bfloat16-GBps	774.35	776.40	1.003x
swiglu-batch8-M4096-float16-GBps	773.72	773.67	1.000x
swiglu-batch8-M4096-float32-GBps	773.00	773.09	1.000x
Overall (mean of suites)	1405.98	1693.42	1.204x

Notes for PR:

CI Configuration

config:
  build: true
  # valid options are "ops" and "benchmark"
  test: ["ops", "benchmark"]

Checklist

Code formatted and imports sorted via repo specifications (./format.sh)
Documentation updated (if needed)
CI configuration reviewed

copy-pr-bot · 2026-02-23T06:14:53Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

aghilann · 2026-02-23T06:23:50Z

src/tilegym/ops/cutile/swiglu.py


    # Compute sigmoid(a) and silu(a)
-    sigmoid_a = sigmoid(a_tile_f32)
+    sigmoid_a = 1.0 / (1.0 + ct.exp(-a_tile_f32))


Inlined this for now because I didn’t want to modify the backward kernel in this PR - that would require re-benchmarking it as well. I have additional optimizations planned that I’ll include in a separate PR, which will also make use of the new sigmoid implementation I added.

aghilann · 2026-02-23T06:28:58Z

src/tilegym/ops/cutile/swiglu.py

 def sigmoid(x):
-    return 1.0 / (1.0 + ct.exp(-x))
+    denom = ct.add(1.0, ct.exp(-x), flush_to_zero=True)
+    return ct.truediv(1.0, denom, flush_to_zero=True, rounding_mode=RMd.APPROX)


A good chunk of the savings came from Rmd.APPROX without losing precision - verified via tests

aghilann · 2026-02-23T06:35:55Z

src/tilegym/ops/cutile/swiglu.py

-    # Sigmoid requires type float32
-    c_tile = silu(a_tile.astype(ct.float32)).astype(a.dtype) * b_tile
-    ct.store(c, index=(row, col), tile=c_tile)
+    a_tile = ct.gather(a, (row, offsets), check_bounds=True, padding_value=0.0)


Good chunk of the perf improvements came from gather scatter vs load/store

aghilann · 2026-02-23T06:58:40Z

tests/benchmark/bench_swiglu.py

+        create_benchmark_config(batch_size, M, dtype)
+        for batch_size in [1, 4, 8]  # Different batch sizes
        for M in [128, 4096]  # Different rows
+        for dtype in [torch.float16, torch.bfloat16, torch.float32]


Most benchmarks test across various dtypes, I thought this one should too

aghilann · 2026-02-23T06:59:22Z

Hey @hannahli-nv, another day - another cuTILE perf upgrade!

aghilann · 2026-02-26T03:46:30Z

@xjmxyt Any chance I could get a review :)

aghilann commented Feb 23, 2026

View reviewed changes

aghilann added 5 commits February 23, 2026 06:27

Optimize SwiGLU forward path with fast sigmoid math

ad6272b

feat: swiglu forward only changes

a0c86c3

Tune SwiGLU forward with minimal Blackwell fast-math tweak

dd5d0ff

feat: lint

ea516f1

feat: lint

0461595

aghilann force-pushed the swiglu-optimizations branch from 381e8dc to 0461595 Compare February 23, 2026 06:27

aghilann commented Feb 23, 2026

View reviewed changes

feat: add bfloat16 and float32 benchmarks

3476f8a

aghilann commented Feb 23, 2026

View reviewed changes

Merge branch 'main' into swiglu-optimizations

c856ac8

Merge branch 'main' into swiglu-optimizations

285b2cd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: swiglu forward optimizations#63

feat: swiglu forward optimizations#63
aghilann wants to merge 8 commits intoNVIDIA:mainfrom
aghilann:swiglu-optimizations

aghilann commented Feb 23, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Feb 23, 2026

Uh oh!

aghilann Feb 23, 2026 •

edited

Loading

Uh oh!

aghilann Feb 23, 2026 •

edited

Loading

Uh oh!

aghilann Feb 23, 2026

Uh oh!

aghilann Feb 23, 2026

Uh oh!

aghilann commented Feb 23, 2026

Uh oh!

aghilann commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aghilann commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Benchmark Results (Added bfloat16 + float32 in addition to float16)

CI Configuration

Checklist

Uh oh!

copy-pr-bot bot commented Feb 23, 2026

Uh oh!

aghilann Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aghilann Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aghilann Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

aghilann Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

aghilann commented Feb 23, 2026

Uh oh!

aghilann commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aghilann commented Feb 23, 2026 •

edited

Loading

aghilann Feb 23, 2026 •

edited

Loading

aghilann Feb 23, 2026 •

edited

Loading