[Perf] Tiles 4: add SharedArray slice support by hughperkins · Pull Request #482 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-04-14T18:55:22Z

Issue: #

Brief Summary

Summary

Adds Tile16x16 ↔ SharedArray interop: tiles can now load from and store to qd.simt.block.SharedArray using the same slice syntax as device arrays.

What's in this PR

6 new SharedArray tests

test_tile16_shared_array_roundtrip — field → tile → SharedArray → tile → field, verify data survives
test_tile16_shared_array_partial_cols — partial-column load/store through SharedArray, parametrized over (partial_store, partial_load) combinations
test_tile16_shared_array_cholesky — Cholesky factorization with L stored in SharedArray, verify reconstruction
test_tile16_shared_array_clamp_store — store tile to SharedArray narrower than 16 cols, verify auto-clamping
test_tile16_shared_array_clamp_load — load tile from SharedArray narrower than 16 cols, verify extra registers are zero
test_tile16_vec_proxy_shared_array — symmetric rank-1 subtract via vec proxy loaded from SharedArray at non-zero row offset
Docs
Added "SharedArray support" section to tile16.md with code example and clamping behavior notes

Good

Enables a key blocked-algorithm pattern: factorize in registers, park in shared memory, reload in another tile — all with the familiar slice syntax
Column clamping works identically to device arrays, so users don't need to learn new rules
Vec proxy slicing (v = sh[K0:K1, col]) also works with SharedArray, enabling lookback patterns in blocked Cholesky
Tests cover the important edge cases (partial columns, clamping on both load and store sides)

Bad

SharedArray tests are f32-only — no f64 coverage
No test for 3D SharedArray indexing (batch dimension), though this may not be a meaningful use case for shared memory
The test_tile16_vec_proxy_shared_array test manually copies data into SharedArray element-by-element with a loop, which is verbose but necessary since there's no bulk copy
API

copilot:summary

Walkthrough

copilot:walkthrough

TypeCheck has allow_undefined_visitor=true, so removing this override is a no-op. The original code was actively wrong (overwriting correct ret_type with i32), but nothing downstream relied on ret_type from the type_check pass for InternalFuncStmt, so the bug was latent. Removing the override eliminates the misleading TODO and prevents a future pass from accidentally depending on the wrong type.

…tests - Parametrize ger_sub and cholesky tests over f32/f64 dtypes - Use tighter tolerance (1e-10) for f64, 1e-5 for f32 - Parametrize cholesky over src_offset (0, 5, 32) and dst_delta (0, 3, 16) - Verify untouched regions of dst array remain at sentinel value

…olesky_ _ger_sub: 34 lines → 4 lines cholesky_: 224 lines → 22 lines

Quadrants DSL types don't carry operator overloads in their stubs, so pyright can't verify +=, *, /, > on shuffled values.

Fix scipy reference computation: solve X @ L^T = B requires solve_triangular(L, B.T) not solve_triangular(L, B, trans='T'). Add type: ignore[reportOperatorIssue] for DSL operator expressions in _trsm that pyright can't verify.

…_eye_ Replace 16-way explicit register access with loop-based _get_col/_set_col helpers, eliminating ~145 lines of boilerplate. Also fix trsm test reference computation (was computing B^T @ L^{-1} instead of B @ L^{-T}).

…e16x16 - Slice-based load/store: arr[r0:r1, c0:c1] for 2D and arr[b, r0:r1, c0:c1] for 3D - qd.outer(a, b) deferred proxy for augmented assignment (t -= qd.outer(a, b)) - _TileSliceProxy, _VecSliceProxy, _TileRefProxy for deferred subscript evaluation - AST purity exemptions for quadrants-internal code - _quadrants_internal flag on Tile struct

The purity checker only flags int/float/Field attribute accesses inside kernels. Tile() and Tile.zeros() don't trigger this, and Tile.SIZE is only accessed in plain Python, not inside kernels.

Add tile16.md covering tile creation, slice load/store, rank-1 updates, Cholesky, and triangular solve. Remove the unused _quadrants_internal getattr escape hatch from the purity checker (file-level check suffices). Add test for Tile.SIZE access inside a kernel.

…top in slices Update tile16 doc examples to use Tile.SIZE. Enforce that both start and stop indices are provided in tile/vec slice syntax (previously stop defaulted to start+16 when omitted). Reorder _TileSliceProxy.__init__ args to match (row_start, row_stop, col_start, col_stop).

The if-condition checking all four slice indices was too long for black. The test_tile16_load_missing_start_raises regex didn't match the updated error message after requiring both start and stop indices.

- test_tile16_outer_symmetric_same_variable: passes same variable for both args to qd.outer(v, v) - test_tile16_vec_proxy_ger_sub_3d: column vector load from 3D array via arr[batch, r0:r1, col] - test_outer_composition_raises: verifies qd.outer(a,b) + qd.outer(c,d) raises TypeError

…errors Only intercept 2D/1D array slices for tile proxy creation when _tile16_cache is non-empty (i.e., _make_tile16x16 has been called). Non-tile programs now get the original "does not support slice" error. Add _DeferredProxyMixin with __add__, __sub__, __mul__, __getitem__, and __repr__ that raise clear TypeErrors explaining the proxy is only valid in tile operations.

Check isinstance(value, t) against cached tile types before creating a _TileRefProxy for struct[:] subscripts. Non-tile structs now fall through to normal subscript handling.

# Conflicts: # tests/python/test_tile16.py

# Conflicts: # docs/source/user_guide/tile16.md

This reverts commit 876a666.

Was hardcoded to _ATOLS[qd.f32] despite being parametrized over qd_dtype, making the f64 path 100,000x too lenient.

hughperkins · 2026-04-17T18:02:40Z

Opus:

Done. All 6 SharedArray tests are now parametrized over qd_dtype with _QD_DTYPES (f32, f64):
• test_tile16_shared_array_roundtrip
• test_tile16_shared_array_partial_cols
• test_tile16_shared_array_cholesky
• test_tile16_shared_array_clamp_store
• test_tile16_shared_array_clamp_load
• test_tile16_vec_proxy_shared_array

Each test now includes skip_if_f64_unsupported, uses _NP_DTYPES[qd_dtype] for numpy dtypes, _ATOLS/_EPS_VALS where applicable, and passes qd_dtype: qd.Template as a kernel
parameter. Pre-commit passes cleanly.

Syncs with main branch commit 62af01b ("[CI] Increase clock test tolerance to 20%"). The 10% threshold was flaky on CI GPU runners. Made-with: Cursor

Made-with: Cursor

…ure) Made-with: Cursor

Made-with: Cursor # Conflicts: # tests/python/test_tile16.py

erizmr · 2026-04-17T23:52:00Z

        ratio = a[i] / a[0]
        expected = i + 1
-        assert abs(ratio - expected) / expected < 0.1  # 10% tolerance
+        assert abs(ratio - expected) / expected < 0.2  # 20% tolerance


How do we determine this tolearnce? why we change it from 10% to 20% here

This is not part of this PR.

The Manylinux wheel Test (3.12, ubuntu-22.04) hung for 6h and was cancelled. Previous run of the same commit range passed in ~10 min. Made-with: Cursor

erizmr

Approving on the basis that I have reviewed the design and the public facing API, tests, and they look reasonable to me.

# Conflicts: # docs/source/user_guide/tile16.md # tests/python/test_tile16.py

hughperkins added 30 commits April 12, 2026 10:29

feat: add rank-1 subtract and Cholesky factorization to Tile16x16

af9aa0c

refactor: extract _get_col/_set_col helpers, simplify _ger_sub and ch…

c27fbd3

…olesky_ _ger_sub: 34 lines → 4 lines cholesky_: 224 lines → 22 lines

Merge branch 'main' into hp/tiles-2a

88829f9

fix: suppress pyright reportOperatorIssue in cholesky_ DSL code

5302270

Quadrants DSL types don't carry operator overloads in their stubs, so pyright can't verify +=, *, /, > on shuffled values.

feat: add triangular solve to Tile16x16

a29b0b5

merge origin/main into hp/tiles-2b

600b71d

test: verify solve_triangular_ raises on lower=False

c1d2414

Merge branch 'hp/tiles-2b' into hp/tiles-2c

7264960

refactor: remove unnecessary _quadrants_internal flag from Tile16x16

109e39b

The purity checker only flags int/float/Field attribute accesses inside kernels. Tile() and Tile.zeros() don't trigger this, and Tile.SIZE is only accessed in plain Python, not inside kernels.

Merge branch 'main' into hp/tiles-2c

02fff68

docs: use intermediate variable for vec slice in tile16 examples

d1462b0

docs: document allowable slice values and ranges for tile16 load/store

c3fff86

docs: separate block size and subgroup size into distinct sections

7d776c7

fix: clamp vec slice row range to array bounds in _resolve_vec2d/3d

7e59acf

Merge branch 'hp/tiles-2c' into hp/tiles-3

9f652bd

fix: black formatting for long if-condition and update test error match

45d40ee

The if-condition checking all four slice indices was too long for black. The test_tile16_load_missing_start_raises regex didn't match the updated error message after requiring both start and stop indices.

test: add missing-stop-index and omitted-[:]-rebind error tests

0d9cb95

test: add missing-stop-index test for store path

1004fd1

test: use qd.outer public API path instead of private import

694fb45

test: verify proxy misuse gives clear TypeError messages

e787158

fix: only create _TileRefProxy for actual tile structs, not all structs

d1aa338

Check isinstance(value, t) against cached tile types before creating a _TileRefProxy for struct[:] subscripts. Non-tile structs now fall through to normal subscript handling.

hughperkins added 3 commits April 17, 2026 10:38

move 'import types' to top-level in util.py

b902d1f

consolidate slice error tests into parametrized tests

f8c7b02

Merge branch 'hp/tiles-4d' into hp/tiles-4

fdc6b8d

# Conflicts: # tests/python/test_tile16.py

hughperkins changed the base branch from hp/tiles-4c to hp/tiles-4d April 17, 2026 17:41

hughperkins added 5 commits April 17, 2026 10:43

update tile16 docs to use public qd.simt.Tile16x16 API

023213b

Merge branch 'hp/tiles-4c' into hp/tiles-4d

c1b0a56

Merge branch 'hp/tiles-4d' into hp/tiles-4

2817965

# Conflicts: # docs/source/user_guide/tile16.md

Revert "Widen test_clock_accuracy tolerance from ±1 to ±2"

a80d39d

This reverts commit 876a666.

Merge remote-tracking branch 'origin/main' into hp/tiles-4a

4a90dca

hughperkins changed the title ~~[Perf] Tiles 4: add Tile16x16Proxy for dtype-at-point-of-use + SharedArray support~~ [Perf] Tiles 4: add SharedArray slice support Apr 17, 2026

hughperkins added 2 commits April 17, 2026 13:47

Merge branch 'hp/tiles-4a' into hp/tiles-4b

ee78f67

fix test_tile16_syr_sub tolerance to use parametrized dtype

92ec118

Was hardcoded to _ATOLS[qd.f32] despite being parametrized over qd_dtype, making the f64 path 100,000x too lenient.

hughperkins added 7 commits April 17, 2026 14:29

Merge branch 'hp/tiles-4b' into hp/tiles-4c

5486ab5

Merge branch 'hp/tiles-4c' into hp/tiles-4d

5d8f248

Merge branch 'hp/tiles-4d' into hp/tiles-4

1b69159

fix(test): widen clock accuracy tolerance to 20% to match main

7ac9376

Syncs with main branch commit 62af01b ("[CI] Increase clock test tolerance to 20%"). The 10% threshold was flaky on CI GPU runners. Made-with: Cursor

ci: retrigger AMD GPU test (runner timeout on previous run)

67cb50d

Made-with: Cursor

ci: retrigger CI (AMD GPU job was manually cancelled, not a code fail…

346843b

…ure) Made-with: Cursor

Merge remote-tracking branch 'origin/main' into hp/tiles-4c

c2525e2

Made-with: Cursor # Conflicts: # tests/python/test_tile16.py

erizmr reviewed Apr 17, 2026

View reviewed changes

hughperkins added 3 commits April 17, 2026 20:52

ci: retrigger after flaky 3.12 x86 test timeout

2c2c52b

The Manylinux wheel Test (3.12, ubuntu-22.04) hung for 6h and was cancelled. Previous run of the same commit range passed in ~10 min. Made-with: Cursor

Merge branch 'hp/tiles-4c' into hp/tiles-4d

d7df3ee

Merge branch 'hp/tiles-4d' into hp/tiles-4

92992c9

erizmr approved these changes Apr 19, 2026

View reviewed changes

Base automatically changed from hp/tiles-4d to main April 20, 2026 08:06

Merge remote-tracking branch 'origin/main' into hp/tiles-4

df035d6

# Conflicts: # docs/source/user_guide/tile16.md # tests/python/test_tile16.py

hughperkins enabled auto-merge (squash) April 20, 2026 08:12

hughperkins merged commit 5e84362 into main Apr 20, 2026
47 checks passed

hughperkins deleted the hp/tiles-4 branch April 20, 2026 09:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Tiles 4: add SharedArray slice support#482

[Perf] Tiles 4: add SharedArray slice support#482
hughperkins merged 157 commits intomainfrom
hp/tiles-4

hughperkins commented Apr 14, 2026 •

edited

Loading

Uh oh!

hughperkins commented Apr 17, 2026

Uh oh!

erizmr Apr 17, 2026 •

edited

Loading

Uh oh!

hughperkins Apr 18, 2026

Uh oh!

erizmr left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hughperkins commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Brief Summary

Summary

What's in this PR

Good

Bad

Walkthrough

Uh oh!

hughperkins commented Apr 17, 2026

Uh oh!

erizmr Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hughperkins Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

erizmr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hughperkins commented Apr 14, 2026 •

edited

Loading

erizmr Apr 17, 2026 •

edited

Loading