Llama reimplementation with fused transaction sequence by andrej · Pull Request #71 · amd/IRON

andrej · 2026-01-26T21:35:38Z

Added

Python code that automatically generates a fused MLIR operator from an "invocation plan" of smaller operators using aiex.configure and aiex.run
llama_npu.py implementation that makes use of that fusion. goal: all operators end-to-end run on NPU, no involvement of CPU whatsoever
A couple of new operators needed to run everything on NPU: strided_copy and repeat -- not yet very optimized
new simple baseline CPU llama implementation

Changed

streamlined some of the compilation infrastructure where I ran into limitations
added masking option with run-time parameter to softmax

Removed

old llama implementation

To-Do

offload everything in prefill too -- there are still some CPU operations there
fuse prefill together too
upgrade actions runner to use XRT after Add Python binding for ELF-initialized hardware context Xilinx/XRT#9560, which this relies on
upgrade this repository to use MLIR-AIE after Also inline locks in aiex.run lowering + codegen performance improvements Xilinx/mlir-aie#2831, which this relies on
figure out something about patching? currently, patching the fused transaction sequence for decode slows everything down a lot because we re-instantiate a new xrt::elf for every token. proper solution would be to use new scratchpad memory firmware feature to read run-time parameters from DDR
re-test all the other operators after refactoring compilation infrastructure, might have broken

PR Merge Checklist

The PR is rebased on the latest devel commit and pointing to devel.
Your PR has been reviewed and approved.
All checks are passing.

…utput quality

… score calculation in llama

… now)

… functions; fused-txn of more of the transformer block

…xn transpose - 3 TPS

…rs - 2.4 TPS

… block fused -- 2.5 TPS

… -- 4.4 TPS

andrej added 30 commits January 14, 2026 14:46

rework profiling

cf6485d

vibe-coded flame graph visualization

af46d81

plot updates

1f184eb

simplified implementation (no KV cache yet)

eedc527

add KV cache

5372fca

start with simplified llama for NPU

f0d3289

offload last layer GEMM/GEMV

fd6f759

add profile path analyzer

ec0d372

change profiling to allow annotation using contexts; remove nn.Module

280917a

refactoring started; RMSNorm offloaded, last layer GEMM started

7b56ef8

last layer GEMM offloaded

1daaa39

fixes:

83380c2

cleanup

3aa1e02

RMSNorm offloaded verywhere, cleanup

002d487

less buffer copying

db9c4b8

offload first residual

b0942e8

simplify

02e5307

offload second residual

e815c50

SwiGLU offloaded

ebe8f32

offload RoPE

7515199

offload attention query projection linear layer

7ef6b03

offload attention key projection linear layer -- slight decrease in o…

d43eeea

…utput quality

add batching to GEMV, fix issue when K<vector_size, offload attention…

de2995c

… score calculation in llama

add strided_copy operator

74300bd

add patchable callable

b1eab7c

simplify llama_npu.py, make GEMV operator input shapes simpler

119bb7e

fix strided copy; offload KV cache concat + transpose to NPU

7fee60d

offload repeat_interleave

b72432b

offload normalization/scaling + softmax (with -inf masking on CPU for…

8b4b2b3

… now)

make softmax run-time parametrizable

5575d4f

andrej added 30 commits January 21, 2026 13:01

autofuse update

11d5802

refactor compilation

f1a2ab9

towards full fused ELF + some more refactoring

d80b726

fixes; requires XRT PR 9560 to be merged

f0cda24

fixes

d715e48

finally working

684a725

fixes

447983c

optimize out reconfiguration

e025ac7

fix some compilation issues

e3c0e64

make all llama operators take a kernel archive and func prefix arg

34fc4c5

txn-fused swiglu

a197f2d

bring up to speed after host runtime refactor

e35ed7b

refactor symbol renaming to not clash with externally defined library…

2994ba2

… functions; fused-txn of more of the transformer block

fuse first part of attention

25605be

make it possible to slice buffers in fused txn specification; fused-t…

e01a6f0

…xn transpose - 3 TPS

discover patching locations automatically by use of magic values

e9cbc00

make ELFs patchable; offload strided-copy for KV cache

b7d2834

fuse repeat_interleave and post attention residual onto other operato…

687cb2a

…rs - 2.4 TPS

fused attn score and scaling onto end - 2.5 TPS

8b0aaeb

fuse on softmax as well

361f10e

transpose fused onto the end; 2.6 TPS

834f33b

fuse attention context gemv - 2.7 TPS

f345e8d

fuse attn output onto end - 2.7 TPS

b46838f

fuse GQA + post attention - 2.5 TPS

99ef9fa

fuse rms norm onto beginning of transformer block -- full transformer…

8eaa9bc

… block fused -- 2.5 TPS

[WRONG RESULTS] 16x-fused transformer block

165a93b

remove unnecessary syncs, remove unused ops -- 4.4 TPS

86d7de8

[decode end-to-end fused] offload last rms norm and last linear layer…

77bac5a

… -- 4.4 TPS

cleanup

6211124

remove old llama implementation

2da438d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama reimplementation with fused transaction sequence#71

Llama reimplementation with fused transaction sequence#71
andrej wants to merge 67 commits intodevelfrom
llama-rework

andrej commented Jan 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andrej commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Added

Changed

Removed

To-Do

PR Merge Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andrej commented Jan 26, 2026 •

edited

Loading