[Perf] Tiles 4: add SharedArray slice support#482
Conversation
TypeCheck has allow_undefined_visitor=true, so removing this override is a no-op. The original code was actively wrong (overwriting correct ret_type with i32), but nothing downstream relied on ret_type from the type_check pass for InternalFuncStmt, so the bug was latent. Removing the override eliminates the misleading TODO and prevents a future pass from accidentally depending on the wrong type.
…tests - Parametrize ger_sub and cholesky tests over f32/f64 dtypes - Use tighter tolerance (1e-10) for f64, 1e-5 for f32 - Parametrize cholesky over src_offset (0, 5, 32) and dst_delta (0, 3, 16) - Verify untouched regions of dst array remain at sentinel value
…olesky_ _ger_sub: 34 lines → 4 lines cholesky_: 224 lines → 22 lines
Quadrants DSL types don't carry operator overloads in their stubs, so pyright can't verify +=, *, /, > on shuffled values.
Fix scipy reference computation: solve X @ L^T = B requires solve_triangular(L, B.T) not solve_triangular(L, B, trans='T'). Add type: ignore[reportOperatorIssue] for DSL operator expressions in _trsm that pyright can't verify.
…_eye_
Replace 16-way explicit register access with loop-based _get_col/_set_col
helpers, eliminating ~145 lines of boilerplate. Also fix trsm test reference
computation (was computing B^T @ L^{-1} instead of B @ L^{-T}).
…e16x16 - Slice-based load/store: arr[r0:r1, c0:c1] for 2D and arr[b, r0:r1, c0:c1] for 3D - qd.outer(a, b) deferred proxy for augmented assignment (t -= qd.outer(a, b)) - _TileSliceProxy, _VecSliceProxy, _TileRefProxy for deferred subscript evaluation - AST purity exemptions for quadrants-internal code - _quadrants_internal flag on Tile struct
The purity checker only flags int/float/Field attribute accesses inside kernels. Tile() and Tile.zeros() don't trigger this, and Tile.SIZE is only accessed in plain Python, not inside kernels.
Add tile16.md covering tile creation, slice load/store, rank-1 updates, Cholesky, and triangular solve. Remove the unused _quadrants_internal getattr escape hatch from the purity checker (file-level check suffices). Add test for Tile.SIZE access inside a kernel.
…top in slices Update tile16 doc examples to use Tile.SIZE. Enforce that both start and stop indices are provided in tile/vec slice syntax (previously stop defaulted to start+16 when omitted). Reorder _TileSliceProxy.__init__ args to match (row_start, row_stop, col_start, col_stop).
The if-condition checking all four slice indices was too long for black. The test_tile16_load_missing_start_raises regex didn't match the updated error message after requiring both start and stop indices.
- test_tile16_outer_symmetric_same_variable: passes same variable for both args to qd.outer(v, v) - test_tile16_vec_proxy_ger_sub_3d: column vector load from 3D array via arr[batch, r0:r1, col] - test_outer_composition_raises: verifies qd.outer(a,b) + qd.outer(c,d) raises TypeError
…errors Only intercept 2D/1D array slices for tile proxy creation when _tile16_cache is non-empty (i.e., _make_tile16x16 has been called). Non-tile programs now get the original "does not support slice" error. Add _DeferredProxyMixin with __add__, __sub__, __mul__, __getitem__, and __repr__ that raise clear TypeErrors explaining the proxy is only valid in tile operations.
Check isinstance(value, t) against cached tile types before creating a _TileRefProxy for struct[:] subscripts. Non-tile structs now fall through to normal subscript handling.
# Conflicts: # tests/python/test_tile16.py
# Conflicts: # docs/source/user_guide/tile16.md
This reverts commit 876a666.
Was hardcoded to _ATOLS[qd.f32] despite being parametrized over qd_dtype, making the f64 path 100,000x too lenient.
|
Opus: Done. All 6 SharedArray tests are now parametrized over qd_dtype with _QD_DTYPES (f32, f64): Each test now includes skip_if_f64_unsupported, uses _NP_DTYPES[qd_dtype] for numpy dtypes, _ATOLS/_EPS_VALS where applicable, and passes qd_dtype: qd.Template as a kernel |
Syncs with main branch commit 62af01b ("[CI] Increase clock test tolerance to 20%"). The 10% threshold was flaky on CI GPU runners. Made-with: Cursor
Made-with: Cursor
…ure) Made-with: Cursor
Made-with: Cursor # Conflicts: # tests/python/test_tile16.py
| ratio = a[i] / a[0] | ||
| expected = i + 1 | ||
| assert abs(ratio - expected) / expected < 0.1 # 10% tolerance | ||
| assert abs(ratio - expected) / expected < 0.2 # 20% tolerance |
There was a problem hiding this comment.
How do we determine this tolearnce? why we change it from 10% to 20% here
There was a problem hiding this comment.
This is not part of this PR.
The Manylinux wheel Test (3.12, ubuntu-22.04) hung for 6h and was cancelled. Previous run of the same commit range passed in ~10 min. Made-with: Cursor
erizmr
left a comment
There was a problem hiding this comment.
Approving on the basis that I have reviewed the design and the public facing API, tests, and they look reasonable to me.
# Conflicts: # docs/source/user_guide/tile16.md # tests/python/test_tile16.py
Issue: #
Brief Summary
Summary
Adds Tile16x16 ↔ SharedArray interop: tiles can now load from and store to
qd.simt.block.SharedArrayusing the same slice syntax as device arrays.What's in this PR
6 new SharedArray tests
test_tile16_shared_array_roundtrip— field → tile → SharedArray → tile → field, verify data survivestest_tile16_shared_array_partial_cols— partial-column load/store through SharedArray, parametrized over(partial_store, partial_load)combinationstest_tile16_shared_array_cholesky— Cholesky factorization with L stored in SharedArray, verify reconstructiontest_tile16_shared_array_clamp_store— store tile to SharedArray narrower than 16 cols, verify auto-clampingtest_tile16_shared_array_clamp_load— load tile from SharedArray narrower than 16 cols, verify extra registers are zerotest_tile16_vec_proxy_shared_array— symmetric rank-1 subtract via vec proxy loaded from SharedArray at non-zero row offsetDocs
tile16.mdwith code example and clamping behavior notesGood
v = sh[K0:K1, col]) also works with SharedArray, enabling lookback patterns in blocked CholeskyBad
test_tile16_vec_proxy_shared_arraytest manually copies data into SharedArray element-by-element with a loop, which is verbose but necessary since there's no bulk copyAPI
copilot:summary
Walkthrough
copilot:walkthrough