[lang] Add SNode.snode_tree_id by hughperkins · Pull Request #8697 · taichi-dev/taichi

hughperkins · 2025-04-29T03:30:22Z

Issue: #

Brief Summary

Add SNode.snode_tree_id

copilot:summary

Walkthrough

Add SNode.snode_tree_id

This will be useful for diagnosing crash bugs in Debug.

copilot:walkthrough

hughperkins · 2025-04-29T12:16:34Z

(note: open to calling this tree_id instead of snode_tree_id. No strong opinion either way, though I feel that tree_id might be mildly more pythonic?)

feisuzhu · 2025-04-29T13:46:15Z

I'm not sure if it's appropriate to expose it as a public API....

hughperkins · 2025-04-29T13:48:31Z

Ok. So, you mean, rename it with a _ prefix?

hughperkins · 2025-04-29T13:55:51Z

(or perhaps, give it an _unsupported_ prefix?)

feisuzhu · 2025-04-29T15:40:00Z

Ok. So, you mean, rename it with a _ prefix?

I don't know too much about the internals, maybe SNode itself is considered a impl detail. If that's the case, adding a property without _ is OK.

Summoning the correct person @bobcao3

hughperkins · 2025-04-29T22:18:32Z

Ok, prefixed with a _, and removed from test_api

feisuzhu

LGTM!

hughperkins · 2025-04-30T10:38:50Z

Awesome, thank you 🙌

…llocations as dupes (#8705) Issue: # ### Brief Summary The unified allocator always allocated the first two allocations as dupe memory addresses, which always clobbered each other. copilot:summary # New walkthrough Looking at the Unified allocator code, we can see that when we first allocate a memory chunk, we do not add `size` to `head`, and thus the next allocation will receive the exact same address too. Thus, there will be two structs or similar, in memory, which clobber each other, leading to plausibly a plethora of hard-to-debug crashes. ## High level overview of how allocator works The allocator can work with two types of request: - `exclusive` - `not exclusive` For exclusive requests: - a buffer is allocated from the system: - https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/rhi/common/unified_allocator.cpp#L74-L75 - the size of the buffer matches the requested bytes (to within alignment bytes) - a new chunk is created - `chunk.data` is set to the start of this buffer - `chunk.head` is too, but it's not really used for exclusive access - `chunk.tail` is set to the end of the buffer, but again not really used for exclusive access Exclusive access requests are thus fairly straightforward For non-exclusive, it is slightly more complex - for the first request,we allocate a much larger buffer than the request - by default 1GB https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/rhi/common/unified_allocator.cpp#L9-L10 - we create a new chunk - `chunk.data` is set to the start of this buffer - `chunk.head` is set to the start of unused space in this buffer - it should be set to `chunk.data + size` - prior to this PR, it is incorrectly being set to point to `chunk.data` though - meaning that the next request will incorrectly return the start of this chunk, again - then we return `chunk.head` - for subsequent requests, we look for a chunk that has available space (head - tail <= requested size) - when we find such a chunk: - we add `size` to `head` (to within alignment) - we return the old `head` (to within alignment) ## Proposed fix The proposed fix is to set `head` to `data + size` for newly allocated chunks - thinking about it, an alternative fix is to split the function into two parts: - first part searches for an existing chunk, or makes a new one - does not return the allocated address - does not update head etc - second part is always executed - updates head - returns old head I don't really have a strong opinion on which fix we prefer. The second approach seems mildly cleaner perhaps, since decouples 'finding/creating a chunk' from 'updating the chunk and returning the requested memory pointer'. ## Low level details In more details, and assuming non exclusive mode: - let's say client requests `size` bytes - we allocate a chunk much larger than that, `default_allocator_size` bytes - https://github.com/hughperkins/taichi/blob/0c41277f7b4a597247ea23760336dcba5c7f7efc/taichi/rhi/common/unified_allocator.cpp#L65-L75 - the address of this chunk is stored in `ptr` - we create a `chunk` structure to store information about the chunk we just allocated - https://github.com/hughperkins/taichi/blob/0c41277f7b4a597247ea23760336dcba5c7f7efc/taichi/rhi/common/unified_allocator.cpp#L63 - https://github.com/hughperkins/taichi/blob/0c41277f7b4a597247ea23760336dcba5c7f7efc/taichi/rhi/common/unified_allocator.cpp#L76-L79 - ptr is stored in chunk.data - head is set to ptr too, via chunk.data - tail is set to ptr + allocation size, via chunk.data - we return ptr - https://github.com/hughperkins/taichi/blob/0c41277f7b4a597247ea23760336dcba5c7f7efc/taichi/rhi/common/unified_allocator.cpp#L85 - we should have added allocation_size to chunk.head - we can look at what happens when we re-use this chunk later, to confirm this: When we re-use a chunk: - we loop over all allocated chunks, looking for non-exclusive chunks - https://github.com/hughperkins/taichi/blob/0c41277f7b4a597247ea23760336dcba5c7f7efc/taichi/rhi/common/unified_allocator.cpp#L39-L45 - we add allocation size to head, adjusting for alignment, store that in ret, and check if ret is less than tail - https://github.com/hughperkins/taichi/blob/0c41277f7b4a597247ea23760336dcba5c7f7efc/taichi/rhi/common/unified_allocator.cpp#L49-L53 - if ret is less or equal to tail, then we: - update head to be equal to ret (so, we've updated it to be old head + allocation_size, adjusted for alignment) - return ret - (and break out of the loop, by virtue of the return) - otherwise, we ignore, and keep looping over available chunks - (if no suitable chunks found, then we will allocate a brand new chunk) # Original Walkthrough High level summary: =============== - both the LLVMRuntime and the result_buffer are allocated to the same memory address - this results in the return error code from running a kernel overwriting the address of snode tree 19 - this results in subsequent access to any field having snode tree 19 crashing Taichi Reproducing the bug =================== This bug was initially reproduced in #8569 , but knowing what the bug is, we can reproduce it using the following much simpler code: ``` import taichi as ti ti.init(arch=ti.arm64, debug=True) fields = [] for i in range(20): fields.append(ti.field(float, shape=())) ti.sync() @ti.kernel def foo(): fields[19][None] = 1 foo() foo() ``` What this code does: - allocates snode trees 0 through 19, by creating fields indexed 0 through 19, and immediately calling ti.sync, to materialize the snode tree - you can optionally print out the snode tree ids as long as you have a version of master that includes the PR at #8697, to verify this assertion - following the creation of snode trees 0 through 19, we call a kernel twice - the first kernel runs without issue - however, the address of snode tree 19 will be set to 0, following this kernel call, because it is overwritten by the return code of this call - when we run the second kernel call, using the address of snode tree 19 - which is now set to 0 - to access values from snode tree 19, causes a segmentation fault: [E 04/30/25 19:00:30.022 3136495] Received signal 11 (Segmentation fault: 11) Detailed walkthrough ==================== 1. LLVMRuntime and result_buffer are allocated the same memory address - When we first initialize the LLVMRuntime, we: - allocate a result_buffer from the unified allocator, via the host allocator - result_buffer allocated here https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/llvm_runtime_executor.cpp#L699-L700 - call runtime_initialize - here https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/llvm_runtime_executor.cpp#L706-L711 - passing in the result_buffer - and the host allocator - inside runtime_initialize, we: - allocate the LLVMRuntime, using the same allocator - here https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/runtime_module/runtime.cpp#L932-L933 - interestingly, the address allocated for the LLVMRuntime memory is identical to the address of the result_buffer memory - verifiable by printing out the two addresses. Over multiple runs, they consistently have the same address as each other (though the exact addresses vary between runs) - these are both allocated from the exact same allocator - if you print out the address of the allocator in each location, they are identical - and no deallocations take place between the allocations - so, how is this possible? - looking at the unified allocator, there is a concept of 'exclusive' - here https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/rhi/common/unified_allocator.cpp#L32 - if a request for memory is not marked as exclusive, previously allocated buffers can be re-used, and allocated to new requests - here https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/rhi/common/unified_allocator.cpp#L57 - the default is exclusive = false - here https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/rhi/common/unified_allocator.h#L31 - therefore, by default, memory chunks allocated can be re-used/returned/allocated across multiple requests Let's first walk through the effects of LLVMRuntime and result_buffer occupying the same space. 2. The return code of a kernel overwrites snode tree address 19 - following a kernel launch, the method runtime_retrieve_and_reset_error_code is run on runtime.cpp - here https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/runtime_module/runtime.cpp#L727-L730 - this method calls `runtime->set_result(taichi_result_buffer_error_id, runtime->error_code);` - the first parameter is a constant - defined here https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/inc/constants.h#L21 - `constexpr std::size_t taichi_max_num_ret_value = 30;` - `set_result`: - is here https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/runtime_module/runtime.cpp#L600-L604 - sets result_buffer[i] to t - here https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/runtime_module/runtime.cpp#L602 - in this case, i is taichi_max_num_ret_value - which is 30 - t is the return code - empirically this has a value of 0, in the test cases described above - i is used to index onto an array of i64 - here https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/runtime_module/runtime.cpp#L602 - therefore each element of the array has 8 bytes - and therefore to get the address of the element which will be set to 0, we should multiply the index, which is 30, by 8 - thus, we will zero out 8 bytes at byte offset 30 * 8 = 240 - the base address for this offset is result_buffer - however, result_buffer has the same address as LLVMRuntime - (as discussed in the first section) - so we are going to clobber 8 bytes in LLVMRuntime with zeros, at offset 240 - let's now look at where byte offset 240 is in LLVMRuntime - LLVMRuntime struct: - is here https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/runtime_module/runtime.cpp#L552-L562 - there are two PreallocatedMemoryChunks, each of which contains two pointers and a size_t - https://github.com/taichi-dev/taichi/blob/562e05fa70ba196b200f03e4006c51bd0962341b/taichi/runtime/llvm/runtime_module/runtime.cpp#L546-L550 - each pointer is 8 bytes, and size_t likely 8 bytes, for 24 bytes each - 48 bytes total - host_allocator_type is a pointer to function -> 8 more bytes - assert_failed_type, host_printf_type, host_vsnprintf_type, and Ptr are also all pointers, so 8 bytes each, for a total for them of 32 bytes - now we arrive at roots, which is the snode tree roots address array - at this point, we are at an offset of 48 + 8 + 32 = 88 - so our offset into roots will be 240 - 88 = 152 - each element of roots is also a pointer - so size 8 bytes - 152 bytes / 8 bytes = 19 - thus when we write the return code of 0 to result_buffer[30], we clobber the address of tree snode 19 with 0 3. kernel access of tree snode 19 - when a kernel is initialized, and that kernel uses a field that is allocated on snode tree 19: - the lowered kernel calls `%10 = call ptr @LLVMRuntime_get_roots(ptr %9, i32 19` (exact statement index varies depends on the kernel of course) - this will return 0 - then when we access a field based on offset 0, we crash. ~Proposed fix~ ============ ~We need to ensure that the allocator does not allocate the same memory block twice, both to the LLVMRuntime and to the result_buffer~ - ~my proposed fix is to expose the `exclusive` parameter to the LLVMRuntime~ - ~and to set this parameter to `true`, when used from the runtime~ ~Questions~ ========= ~A question in my mind is, why we would ever want exclusive to not be true. And by default, it is in fact set to false. I feel like there is some knowledge or insight that is missing to me.~ copilot:walkthrough

hughperkins added 6 commits April 28, 2025 23:15

add get_snode_tree_id

98d6d25

remove stuff from impl.py

97bb461

add test and imply

a2d4396

add to test_api

bba8b4a

dummy_commit

f63648a

dummy_commit

0583745

feisuzhu requested a review from bobcao3 April 29, 2025 13:38

remove from api

dcf0c88

update test_snode

6448dce

hughperkins mentioned this pull request Apr 29, 2025

Flaky download of difftaichi repo, in CI #8700

Open

hughperkins added 4 commits April 29, 2025 18:43

dummy

d56b28d

dummy

aa1f7e1

dummy

363006f

dummy

de3c414

feisuzhu approved these changes Apr 30, 2025

View reviewed changes

feisuzhu changed the title ~~[Lang] Add SNode.snode_tree_id~~ [lang] Add SNode.snode_tree_id Apr 30, 2025

feisuzhu merged commit 562e05f into taichi-dev:master Apr 30, 2025
16 checks passed

hughperkins deleted the hp/add-snode-tree-id branch April 30, 2025 10:38

linull24 mentioned this pull request Apr 30, 2025

[build] Update minimum requirement for CMAKE #8703

Merged

hughperkins mentioned this pull request Apr 30, 2025

[Rhi] [bug] Fix the Unified Allocator to no longer return first two allocations as dupes #8705

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[lang] Add SNode.snode_tree_id#8697

[lang] Add SNode.snode_tree_id#8697
feisuzhu merged 12 commits intotaichi-dev:masterfrom
hughperkins:hp/add-snode-tree-id

hughperkins commented Apr 29, 2025

Uh oh!

hughperkins commented Apr 29, 2025

Uh oh!

feisuzhu commented Apr 29, 2025

Uh oh!

hughperkins commented Apr 29, 2025

Uh oh!

hughperkins commented Apr 29, 2025

Uh oh!

feisuzhu commented Apr 29, 2025

Uh oh!

hughperkins commented Apr 29, 2025

Uh oh!

feisuzhu left a comment

Uh oh!

Uh oh!

hughperkins commented Apr 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hughperkins commented Apr 29, 2025

Brief Summary

Walkthrough

Uh oh!

hughperkins commented Apr 29, 2025

Uh oh!

feisuzhu commented Apr 29, 2025

Uh oh!

hughperkins commented Apr 29, 2025

Uh oh!

hughperkins commented Apr 29, 2025

Uh oh!

feisuzhu commented Apr 29, 2025

Uh oh!

hughperkins commented Apr 29, 2025

Uh oh!

feisuzhu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hughperkins commented Apr 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants