Skip to content

[Perf] Tiles 4: add SharedArray slice support#482

Merged
hughperkins merged 157 commits intomainfrom
hp/tiles-4
Apr 20, 2026
Merged

[Perf] Tiles 4: add SharedArray slice support#482
hughperkins merged 157 commits intomainfrom
hp/tiles-4

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

@hughperkins hughperkins commented Apr 14, 2026

Issue: #

Brief Summary

Summary

Adds Tile16x16 ↔ SharedArray interop: tiles can now load from and store to qd.simt.block.SharedArray using the same slice syntax as device arrays.

What's in this PR

6 new SharedArray tests

  • test_tile16_shared_array_roundtrip — field → tile → SharedArray → tile → field, verify data survives
  • test_tile16_shared_array_partial_cols — partial-column load/store through SharedArray, parametrized over (partial_store, partial_load) combinations
  • test_tile16_shared_array_cholesky — Cholesky factorization with L stored in SharedArray, verify reconstruction
  • test_tile16_shared_array_clamp_store — store tile to SharedArray narrower than 16 cols, verify auto-clamping
  • test_tile16_shared_array_clamp_load — load tile from SharedArray narrower than 16 cols, verify extra registers are zero
  • test_tile16_vec_proxy_shared_array — symmetric rank-1 subtract via vec proxy loaded from SharedArray at non-zero row offset
    Docs
  • Added "SharedArray support" section to tile16.md with code example and clamping behavior notes

Good

  • Enables a key blocked-algorithm pattern: factorize in registers, park in shared memory, reload in another tile — all with the familiar slice syntax
  • Column clamping works identically to device arrays, so users don't need to learn new rules
  • Vec proxy slicing (v = sh[K0:K1, col]) also works with SharedArray, enabling lookback patterns in blocked Cholesky
  • Tests cover the important edge cases (partial columns, clamping on both load and store sides)

Bad

  • SharedArray tests are f32-only — no f64 coverage
  • No test for 3D SharedArray indexing (batch dimension), though this may not be a meaningful use case for shared memory
  • The test_tile16_vec_proxy_shared_array test manually copies data into SharedArray element-by-element with a loop, which is verbose but necessary since there's no bulk copy
    API

copilot:summary

Walkthrough

copilot:walkthrough

TypeCheck has allow_undefined_visitor=true, so removing this override
is a no-op. The original code was actively wrong (overwriting correct
ret_type with i32), but nothing downstream relied on ret_type from
the type_check pass for InternalFuncStmt, so the bug was latent.
Removing the override eliminates the misleading TODO and prevents
a future pass from accidentally depending on the wrong type.
…tests

- Parametrize ger_sub and cholesky tests over f32/f64 dtypes
- Use tighter tolerance (1e-10) for f64, 1e-5 for f32
- Parametrize cholesky over src_offset (0, 5, 32) and dst_delta (0, 3, 16)
- Verify untouched regions of dst array remain at sentinel value
…olesky_

_ger_sub: 34 lines → 4 lines
cholesky_: 224 lines → 22 lines
Quadrants DSL types don't carry operator overloads in their stubs,
so pyright can't verify +=, *, /, > on shuffled values.
Fix scipy reference computation: solve X @ L^T = B requires
solve_triangular(L, B.T) not solve_triangular(L, B, trans='T').
Add type: ignore[reportOperatorIssue] for DSL operator expressions
in _trsm that pyright can't verify.
…_eye_

Replace 16-way explicit register access with loop-based _get_col/_set_col
helpers, eliminating ~145 lines of boilerplate. Also fix trsm test reference
computation (was computing B^T @ L^{-1} instead of B @ L^{-T}).
…e16x16

- Slice-based load/store: arr[r0:r1, c0:c1] for 2D and arr[b, r0:r1, c0:c1] for 3D
- qd.outer(a, b) deferred proxy for augmented assignment (t -= qd.outer(a, b))
- _TileSliceProxy, _VecSliceProxy, _TileRefProxy for deferred subscript evaluation
- AST purity exemptions for quadrants-internal code
- _quadrants_internal flag on Tile struct
The purity checker only flags int/float/Field attribute accesses
inside kernels. Tile() and Tile.zeros() don't trigger this, and
Tile.SIZE is only accessed in plain Python, not inside kernels.
Add tile16.md covering tile creation, slice load/store, rank-1 updates,
Cholesky, and triangular solve. Remove the unused _quadrants_internal
getattr escape hatch from the purity checker (file-level check suffices).
Add test for Tile.SIZE access inside a kernel.
…top in slices

Update tile16 doc examples to use Tile.SIZE. Enforce that both start
and stop indices are provided in tile/vec slice syntax (previously stop
defaulted to start+16 when omitted). Reorder _TileSliceProxy.__init__
args to match (row_start, row_stop, col_start, col_stop).
The if-condition checking all four slice indices was too long for black.
The test_tile16_load_missing_start_raises regex didn't match the updated
error message after requiring both start and stop indices.
- test_tile16_outer_symmetric_same_variable: passes same variable for
  both args to qd.outer(v, v)
- test_tile16_vec_proxy_ger_sub_3d: column vector load from 3D array
  via arr[batch, r0:r1, col]
- test_outer_composition_raises: verifies qd.outer(a,b) + qd.outer(c,d)
  raises TypeError
…errors

Only intercept 2D/1D array slices for tile proxy creation when
_tile16_cache is non-empty (i.e., _make_tile16x16 has been called).
Non-tile programs now get the original "does not support slice" error.

Add _DeferredProxyMixin with __add__, __sub__, __mul__, __getitem__,
and __repr__ that raise clear TypeErrors explaining the proxy is only
valid in tile operations.
Check isinstance(value, t) against cached tile types before creating
a _TileRefProxy for struct[:] subscripts. Non-tile structs now fall
through to normal subscript handling.
@hughperkins hughperkins changed the base branch from hp/tiles-4c to hp/tiles-4d April 17, 2026 17:41
@hughperkins hughperkins changed the title [Perf] Tiles 4: add Tile16x16Proxy for dtype-at-point-of-use + SharedArray support [Perf] Tiles 4: add SharedArray slice support Apr 17, 2026
Was hardcoded to _ATOLS[qd.f32] despite being parametrized over
qd_dtype, making the f64 path 100,000x too lenient.
@hughperkins
Copy link
Copy Markdown
Collaborator Author

Opus:

Done. All 6 SharedArray tests are now parametrized over qd_dtype with _QD_DTYPES (f32, f64):
• test_tile16_shared_array_roundtrip
• test_tile16_shared_array_partial_cols
• test_tile16_shared_array_cholesky
• test_tile16_shared_array_clamp_store
• test_tile16_shared_array_clamp_load
• test_tile16_vec_proxy_shared_array

Each test now includes skip_if_f64_unsupported, uses _NP_DTYPES[qd_dtype] for numpy dtypes, _ATOLS/_EPS_VALS where applicable, and passes qd_dtype: qd.Template as a kernel
parameter. Pre-commit passes cleanly.

ratio = a[i] / a[0]
expected = i + 1
assert abs(ratio - expected) / expected < 0.1 # 10% tolerance
assert abs(ratio - expected) / expected < 0.2 # 20% tolerance
Copy link
Copy Markdown
Collaborator

@erizmr erizmr Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we determine this tolearnce? why we change it from 10% to 20% here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not part of this PR.

The Manylinux wheel Test (3.12, ubuntu-22.04) hung for 6h and was
cancelled. Previous run of the same commit range passed in ~10 min.

Made-with: Cursor
Copy link
Copy Markdown
Collaborator

@erizmr erizmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving on the basis that I have reviewed the design and the public facing API, tests, and they look reasonable to me.

Base automatically changed from hp/tiles-4d to main April 20, 2026 08:06
# Conflicts:
#	docs/source/user_guide/tile16.md
#	tests/python/test_tile16.py
@hughperkins hughperkins enabled auto-merge (squash) April 20, 2026 08:12
@hughperkins hughperkins merged commit 5e84362 into main Apr 20, 2026
47 checks passed
@hughperkins hughperkins deleted the hp/tiles-4 branch April 20, 2026 09:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants