From 876a553f310a47979cc1ea785b5861286626346e Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Thu, 2 Apr 2026 03:06:20 +0800 Subject: [PATCH 01/35] Add: document manual dependency scope design - Capture the hybrid scoped model for tensormap_and_ringbuffer - Define same-scope explicit edges versus cross-scope TensorMap behavior - Record ownership, scope, nesting, and testing constraints before implementation --- docs/manual-dep-for-tensormap-design.md | 435 ++++++++++++++++++++++++ 1 file changed, 435 insertions(+) create mode 100644 docs/manual-dep-for-tensormap-design.md diff --git a/docs/manual-dep-for-tensormap-design.md b/docs/manual-dep-for-tensormap-design.md new file mode 100644 index 000000000..4ef98d214 --- /dev/null +++ b/docs/manual-dep-for-tensormap-design.md @@ -0,0 +1,435 @@ +# Manual Dependency For TensorMap Runtime + +## Goal + +Bring the human-created dependency workflow from `aicpu_build_graph` into `tensormap_and_ringbuffer` in a scoped way: + +- `PTO_SCOPE(manual_dep=1) { ... }` +- Tensors crossing scope boundaries use TensorMap semantics +- Tensors used entirely inside the manual scope use explicit `add_dependency` + +This is not a port of `aicpu_build_graph`'s fully-explicit runtime model. The target is a hybrid model inside `tensormap_and_ringbuffer`: + +- same-scope dependency tracking: explicit +- cross-scope dependency tracking: TensorMap +- scope-local lifetime management: unchanged ring/scope ownership model + +## Confirmed Decisions + +These decisions are already aligned with the requested direction: + +1. `tensormap` scope may contain a manual scope. +2. Manual scope may not contain another manual scope. +3. The design must not simplify away multi-write cases. +4. For an outer-scope tensor written inside a manual scope, readiness is the writer task completion time, not `scope_end`. +5. Therefore, a task inside a manual scope that writes an outer-scope tensor must still publish that tensor to TensorMap. + +## Non-Goals + +- Do not replace `tensormap_and_ringbuffer` with a fully explicit runtime. +- Do not require explicit export/import APIs at scope boundaries. +- Do not constrain v1 to single-writer exported tensors. +- Do not change the existing rule that inner-scope temporary tensors do not outlive their owning scope unless already represented by an outer-scope tensor. + +## Current Runtime Behavior Relevant To This Design + +## Scope lifetime + +In `tensormap_and_ringbuffer`, each submitted task starts with one scope-held fanout reference. On `scope_end`, the scheduler releases that reference. When fanout is otherwise exhausted, the task can become `CONSUMED` and its slot/buffer can be reclaimed. + +This means: + +- outer-scope tensors may flow into inner scopes +- inner-scope temporaries are scope-local by default +- `scope_end` affects lifetime ownership, not semantic readiness of a cross-scope tensor write + +## Current dependency model + +Today the runtime derives dependencies in `pto_orchestrator.cpp` using: + +- creator retention through `owner_task_id` +- modifier lookup through TensorMap overlap search +- TensorMap insert for `INOUT` and `OUTPUT_EXISTING` + +There is already a `Tensor::manual_dep` bit, but in current code it is effectively a per-tensor escape hatch that skips TensorMap lookup/insert. That is not sufficient for scoped hybrid semantics because the scope, not the tensor alone, decides whether a use is same-scope or cross-scope. + +## Problem Statement + +If we simply copy `aicpu_build_graph` semantics into `tensormap_and_ringbuffer`, we get a wrong boundary model: + +- suppressing TensorMap for all tensors inside `PTO_SCOPE(manual_dep=1)` is incorrect +- delaying publication of an outer tensor until `scope_end` is incorrect + +The reason is that cross-scope tensors must become visible at the actual writer frontier. Outside consumers should depend on the task that really produced the latest visible state, not on scope closure. + +So the correct split is: + +- same-scope tensor relations inside the manual scope: explicit edges only +- cross-scope tensor relations: preserve TensorMap behavior + +## Required Semantics + +## Core rule + +`PTO_SCOPE(manual_dep=1)` means: + +- if both producer and consumer are inside this manual scope, the dependency must be established by explicit `add_dependency` +- if a tensor use crosses the scope boundary, dependency tracking still uses TensorMap/owner metadata + +This rule applies per tensor use site, not as a global on/off switch for the whole submit. + +## Tensor categories + +For a task submitted inside a manual scope, every tensor argument falls into one of these categories: + +1. Outer-scope tensor, read only +2. Outer-scope tensor, written in place +3. Tensor created inside this manual scope, used again inside this manual scope +4. Tensor created inside this manual scope, then used through an outer-scope tensor alias/view +5. External tensor with no owner task + +The runtime must classify behavior from ownership and current scope, not only from argument tag. + +## Expected behavior by category + +### 1. Outer-scope tensor, read only + +- The first internal consumer still needs dependency seeding from existing producer state. +- This must still use creator retention and TensorMap lookup as appropriate. +- Manual scope does not remove the need to wait for the outer producer frontier. + +### 2. Outer-scope tensor, written in place + +- The internal writer must still publish to TensorMap. +- Readiness of the written tensor is the completion of that writer task. +- Multiple writes inside the same manual scope are allowed. +- TensorMap should continue tracking the latest producer frontier exactly as in normal scope. + +### 3. Tensor created inside this manual scope and reused only inside this manual scope + +- No TensorMap lookup/insert. +- No automatic same-scope dependency derivation. +- Orchestration must call `add_dependency` explicitly for correctness. + +### 4. Tensor created inside this manual scope, then used through an outer-scope alias/view + +This case must be handled by ownership classification, not by raw pointer equality. + +If the tensor instance still belongs to the manual scope, it remains same-scope and should stay explicit. + +If orchestration is mutating an outer-scope tensor through a view that inherits the outer owner/scope identity, that is cross-scope and should keep TensorMap behavior. + +### 5. External tensor with no owner task + +- There is no creator dependency. +- Reads need no dependency unless TensorMap contains a producer entry. +- Writes to such a tensor should still publish to TensorMap if the tensor is cross-scope visible. + +## Recommended API Shape + +## Orchestration API + +Add explicit edge wiring to `tensormap_and_ringbuffer` orchestration API, mirroring `aicpu_build_graph`: + +```cpp +void pto2_rt_add_dependency(PTO2TaskId producer, PTO2TaskId consumer); +``` + +Add scoped manual mode: + +```cpp +PTO_SCOPE(manual_dep = 1) { + ... +} +``` + +For C++ implementation, this should compile down to a guard with scope metadata, not a dynamic stringly API. + +## Runtime API + +Add runtime ops support: + +```cpp +void (*add_dependency)(PTO2Runtime* rt, PTO2TaskId producer, PTO2TaskId consumer); +``` + +Add manual-scope entry/exit plumbing by extending the existing scope API with a mode flag: + +```cpp +void pto2_rt_scope_begin(PTO2Runtime* rt, bool manual_dep); +``` + +Recommendation: extend scope state with a mode flag and keep one scope stack. Avoid separate manual/non-manual stacks and avoid introducing a second scope API family. + +## Internal Design + +## Scope state + +Each scope frame needs: + +- `begin_index` into `scope_tasks` +- scope mode: normal or manual +- unique scope id + +The unique scope id is required because same-scope vs cross-scope classification must be relative to the current manual scope, not only to nested depth. + +Recommendation: + +- assign `scope_id` on every `scope_begin` +- store current producing `scope_id` on runtime-created tensors +- views inherit the source tensor's producing `scope_id` + +## Tensor metadata + +Current `Tensor` already stores: + +- `owner_task_id` +- `manual_dep` + +For this design, the critical missing concept is producing scope identity. We need enough metadata to answer: + +- was this tensor produced in the current manual scope? +- or is it owned by an outer scope and therefore boundary-visible? + +Recommendation: + +- add `owner_scope_id` to `Tensor` +- initialize runtime-created outputs with the current scope id +- inherit `owner_scope_id` through `view`, `reshape`, and `transpose` + +`manual_dep` should no longer be the primary mechanism for scope semantics. It may remain as a per-tensor override, but the scoped design should be driven by: + +- current scope mode +- tensor owner scope id +- tensor owner task id + +## Submit-time classification + +In `pto_orchestrator.cpp`, dependency behavior should be classified per tensor argument. + +Pseudo-rule for a task submitted inside a manual scope: + +```cpp +same_scope_tensor = + tensor.owner_task_id.is_valid() && + tensor.owner_scope_id == current_manual_scope_id; + +if (!in_manual_scope) { + use existing tensormap behavior; +} else if (same_scope_tensor) { + skip TensorMap lookup/insert; + rely on explicit add_dependency; +} else { + use cross-scope TensorMap/owner behavior; +} +``` + +This should be applied separately to: + +- creator retention +- modifier lookup +- TensorMap insertion for writes + +Important nuance: + +- same-scope tensors should still retain creator lifetime through explicit dependencies, not through automatic creator retention +- cross-scope tensors should still retain creator lifetime automatically + +## Explicit edge wiring + +`pto2_add_dependency` from `aicpu_build_graph` can be reused conceptually, but the implementation must match `tensormap_and_ringbuffer` scheduler semantics: + +- increment consumer `fanin_count` +- record producer in consumer payload for release traversal +- wire producer fanout list under `fanout_lock` +- handle early-completed producer case + +No scope-end batch publish behavior should be imported. `tensormap_and_ringbuffer` tasks are already submit-visible before scope end, and changing that would be a separate design. + +## Scope-end behavior + +Manual scope does not change lifetime release semantics: + +- `scope_end` still releases the owning-scope fanout reference +- `scope_end` is not a publication barrier for cross-scope tensors +- cross-scope visibility must already reflect task completion frontier + +This is the main semantic difference from `aicpu_build_graph`. + +## Multiple Writes To Outer Tensors + +This case must be supported in v1. + +Example: + +```cpp +PTO_SCOPE(manual_dep=1) { + t1 writes outer C + t2 writes outer C + add_dependency(t1, t2) +} +outside task reads C +``` + +Correct behavior: + +- `t1` publishes `C` to TensorMap +- `t2` publishes `C` again to TensorMap +- outside reader should see `t2` as the latest producer frontier +- because `t1 -> t2` is explicit, `t2` completion is a valid readiness frontier for the final visible state + +Potential invalid user pattern: + +```cpp +PTO_SCOPE(manual_dep=1) { + t1 writes outer C + t2 also writes outer C + // missing add_dependency(t1, t2) +} +``` + +This is a user error. The runtime should not try to reconstruct same-scope writer ordering automatically in manual mode. + +## Reads Of Outer Tensors Inside Manual Scope + +Outer tensors read inside manual scope must still seed internal dependencies from existing producer state. + +Otherwise: + +- a task inside manual scope may run before the outer producer of its input +- explicit edges inside the scope are insufficient to protect the outer-to-inner boundary + +So manual mode disables only same-scope auto-derivation, not boundary seeding. + +## Nesting Rules + +Supported: + +- normal scope contains manual scope +- normal scope contains normal scope + +Not supported in v1: + +- manual scope contains manual scope + +Reason: + +- the same-scope vs cross-scope rule is already relative to the current manual frame +- nested manual scopes add little value and complicate classification and diagnostics + +Recommendation: + +- detect this at `scope_begin` +- fail fast with a clear orchestrator error + +## Diagnostics + +The runtime should detect and report: + +1. nested manual scope not supported +2. `add_dependency` used with invalid task ids +3. dependency overflow from explicit wiring +4. obvious cross-scope/manual mismatch where possible + +Nice-to-have diagnostics: + +- count of explicit edges added in manual scope +- count of cross-scope TensorMap lookups/inserts preserved inside manual scope + +These are not required for correctness, but will make profiling and debugging practical. + +## Testing Strategy + +Add focused coverage before broad workload migration. + +### Unit-style runtime cases + +1. Manual scope diamond on scope-local outputs +- all same-scope edges explicit +- no TensorMap dependence required + +2. Manual scope reads outer tensor +- internal first task waits on outer producer frontier + +3. Manual scope writes outer tensor once +- outside consumer waits on inner writer completion, not `scope_end` + +4. Manual scope writes outer tensor multiple times +- latest writer becomes TensorMap frontier +- correctness depends on explicit same-scope edge wiring + +5. Normal scope containing manual scope +- outer to inner and inner to outer boundary cases both work + +6. Nested manual scope +- rejected with deterministic error + +### Example-level migration + +Use a small example first, such as vector-style or BGEMM-style, to validate: + +- scope-local temp tensors use explicit edges +- outer tensors still behave through TensorMap + +Only then move to more complex orchestration such as paged attention. + +## Main Risks + +1. Treating manual scope as a global TensorMap disable switch. +- This breaks cross-scope correctness. + +2. Using `Tensor::manual_dep` as the only signal. +- Scope semantics are relational and need owner scope identity. + +3. Letting cross-scope writes publish only at `scope_end`. +- This delays readiness incorrectly. + +4. Accidentally preserving creator retention for same-scope tensors in manual mode. +- This reintroduces hidden dependencies and weakens the mental model. + +5. Missing alias/view inheritance of scope ownership. +- This causes wrong same-scope vs cross-scope classification. + +## Recommended Implementation Order + +1. Add API surface for `add_dependency` and manual scope mode. +2. Add scope-frame mode and `scope_id`. +3. Add tensor ownership metadata needed for classification. +4. Implement explicit edge wiring in tensormap runtime. +5. Refactor submit-time dependency logic to branch on: + - current scope mode + - tensor owner scope id + - tensor owner task id +6. Add fail-fast nested-manual-scope check. +7. Add targeted tests for boundary semantics. +8. Migrate one example and validate. + +## Open Question Resolved + +This design intentionally resolves the central ambiguity: + +- `scope_end` controls lifetime release +- task completion controls semantic readiness + +For outer tensors written inside manual scope, TensorMap publication must stay aligned with task completion frontier, not with scope closure. + +## File Areas Expected To Change + +- `src/a2a3/runtime/tensormap_and_ringbuffer/orchestration/pto_orchestration_api.h` +- `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h` +- `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp` +- `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp` +- `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/tensor.h` +- `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h` +- docs and examples/tests needed to demonstrate the new scoped behavior + +## Recommendation Summary + +Implement manual dependency as a scope-local override inside `tensormap_and_ringbuffer`, not as a runtime-wide replacement of TensorMap: + +- same manual scope: explicit `add_dependency` +- crossing the manual scope boundary: TensorMap +- write visibility: writer completion +- lifetime release: `scope_end` + +That is the smallest design that satisfies the requested model without breaking the core tensormap runtime semantics. From 2e0ee411ec38871324e50cacdfbb124562caa0a3 Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Thu, 2 Apr 2026 18:06:46 +0800 Subject: [PATCH 02/35] Update: tighten manual dependency design constraints - Force outer-scope reads in manual scope through TensorMap boundary seeding - Remove the invalid inner-created outer-alias case and keep Tensor layout unchanged - Add explicit scope, tooling, and narrow-change requirements for the implementation PR --- docs/manual-dep-for-tensormap-design.md | 108 +++++++++++++++++++----- 1 file changed, 86 insertions(+), 22 deletions(-) diff --git a/docs/manual-dep-for-tensormap-design.md b/docs/manual-dep-for-tensormap-design.md index 4ef98d214..759a0847b 100644 --- a/docs/manual-dep-for-tensormap-design.md +++ b/docs/manual-dep-for-tensormap-design.md @@ -23,6 +23,44 @@ These decisions are already aligned with the requested direction: 3. The design must not simplify away multi-write cases. 4. For an outer-scope tensor written inside a manual scope, readiness is the writer task completion time, not `scope_end`. 5. Therefore, a task inside a manual scope that writes an outer-scope tensor must still publish that tensor to TensorMap. +6. For an outer-scope tensor read inside a manual scope, the dependency must still be forced by TensorMap/owner-based boundary seeding. + +## Change Control Requirements + +The implementation PR must follow these rules: + +- Keep the change strictly scoped to manual dependency support in `tensormap_and_ringbuffer`. +- Do not refactor unrelated runtime behavior while doing this work. +- Do not change existing normal-scope TensorMap semantics. +- Do not change scope lifetime semantics. +- Prefer the smallest invasive write set that cleanly supports the feature. +- Preserve existing examples/tests unless a targeted update is required to cover the new feature. +- Any behavior change outside manual-scope execution must be treated as a regression. + +## Repository Rule Requirements + +The implementation must carefully follow the repository's coding rules and conventions: + +- obey `CLAUDE.md` directory ownership and workflow rules +- obey `.claude/rules/architecture.md` +- obey `.claude/rules/codestyle.md` +- keep platform-isolation preprocessor ordering consistent with repo rules +- avoid comment styles that encode plan phases or temporary implementation notes +- preserve current behavior unless this spec explicitly requires otherwise +- avoid adding new tensor metadata unless it is strictly necessary for correctness +- prefer provenance on task-side state over changing hot-path `Tensor` layout + +## Tooling Requirements + +The implementation and follow-up PRs must also respect the current repository tooling state: + +- PR #424 has already aligned C and C++ sources with `clang-format`. +- Local development should use `clang-format` `v21.1.0`, matching `.pre-commit-config.yaml`. +- Developers should configure local save-time auto-formatting with that exact `clang-format` version to avoid unnecessary AI-driven formatting churn. +- The feature PR should not include unrelated bulk reformatting. +- `.clang-tidy` is now part of the repository toolchain, but many checks are still intentionally disabled in the config file. +- This feature PR must satisfy the currently active `clang-tidy` expectations for touched code. +- Gradually enabling additional `clang-tidy` checks and fixing old violations is a separate ongoing stream of work, not something this feature should broaden into unless directly required for touched code. ## Non-Goals @@ -85,7 +123,7 @@ For a task submitted inside a manual scope, every tensor argument falls into one 1. Outer-scope tensor, read only 2. Outer-scope tensor, written in place 3. Tensor created inside this manual scope, used again inside this manual scope -4. Tensor created inside this manual scope, then used through an outer-scope tensor alias/view +4. Outer-scope tensor accessed through a derived view/reshape/transpose inside the manual scope 5. External tensor with no owner task The runtime must classify behavior from ownership and current scope, not only from argument tag. @@ -94,9 +132,10 @@ The runtime must classify behavior from ownership and current scope, not only fr ### 1. Outer-scope tensor, read only -- The first internal consumer still needs dependency seeding from existing producer state. -- This must still use creator retention and TensorMap lookup as appropriate. +- The first internal consumer must still get its dependency from TensorMap/owner-based boundary seeding. +- This is not optional and must not be delegated to explicit manual edges inside the scope. - Manual scope does not remove the need to wait for the outer producer frontier. +- In other words, outer-read boundary correctness is still forced by TensorMap-side logic. ### 2. Outer-scope tensor, written in place @@ -111,13 +150,15 @@ The runtime must classify behavior from ownership and current scope, not only fr - No automatic same-scope dependency derivation. - Orchestration must call `add_dependency` explicitly for correctness. -### 4. Tensor created inside this manual scope, then used through an outer-scope alias/view +### 4. Outer-scope tensor accessed through a derived view/reshape/transpose inside the manual scope + +This is the real aliasing case that matters for the design. It must be handled by ownership classification, not by raw pointer equality. -This case must be handled by ownership classification, not by raw pointer equality. +An outer-scope tensor may be sliced or reshaped inside the manual scope, but it is still outer-scope. -If the tensor instance still belongs to the manual scope, it remains same-scope and should stay explicit. +If orchestration is reading or mutating an outer-scope tensor through a derived view that inherits the outer owner/scope identity, that is still cross-scope and should keep TensorMap behavior. -If orchestration is mutating an outer-scope tensor through a view that inherits the outer owner/scope identity, that is cross-scope and should keep TensorMap behavior. +A tensor created inside the manual scope should not later become an outer-scope alias. That would violate the existing scope lifetime model rather than define a supported boundary case. ### 5. External tensor with no owner task @@ -176,8 +217,8 @@ The unique scope id is required because same-scope vs cross-scope classification Recommendation: - assign `scope_id` on every `scope_begin` -- store current producing `scope_id` on runtime-created tensors -- views inherit the source tensor's producing `scope_id` +- store current producing `scope_id` on task-side provenance +- use `owner_task_id` on `Tensor` to reach producing task provenance ## Tensor metadata @@ -186,22 +227,27 @@ Current `Tensor` already stores: - `owner_task_id` - `manual_dep` -For this design, the critical missing concept is producing scope identity. We need enough metadata to answer: +Recommendation: do not add new tensor metadata in v1. + +The critical missing concept is producing scope identity, but that provenance should live on the producer task side if possible, not on `Tensor`. + +We need enough information to answer: - was this tensor produced in the current manual scope? - or is it owned by an outer scope and therefore boundary-visible? -Recommendation: +Preferred approach: -- add `owner_scope_id` to `Tensor` -- initialize runtime-created outputs with the current scope id -- inherit `owner_scope_id` through `view`, `reshape`, and `transpose` +- keep `Tensor` layout unchanged +- use `tensor.owner_task_id` as the provenance pointer +- record `owner_scope_id` on producer task-side metadata such as task descriptor or scheduler/orchestrator slot state +- classify same-scope versus cross-scope through `owner_task_id -> producer provenance -> scope_id` `manual_dep` should no longer be the primary mechanism for scope semantics. It may remain as a per-tensor override, but the scoped design should be driven by: - current scope mode -- tensor owner scope id - tensor owner task id +- producer task scope provenance ## Submit-time classification @@ -212,7 +258,7 @@ Pseudo-rule for a task submitted inside a manual scope: ```cpp same_scope_tensor = tensor.owner_task_id.is_valid() && - tensor.owner_scope_id == current_manual_scope_id; + producer_scope_id(tensor.owner_task_id) == current_manual_scope_id; if (!in_manual_scope) { use existing tensormap behavior; @@ -234,6 +280,7 @@ Important nuance: - same-scope tensors should still retain creator lifetime through explicit dependencies, not through automatic creator retention - cross-scope tensors should still retain creator lifetime automatically +- cross-scope outer reads must still execute the existing TensorMap/owner dependency path even when the current scope is manual ## Explicit edge wiring @@ -292,7 +339,7 @@ This is a user error. The runtime should not try to reconstruct same-scope write ## Reads Of Outer Tensors Inside Manual Scope -Outer tensors read inside manual scope must still seed internal dependencies from existing producer state. +Outer tensors read inside manual scope must still seed internal dependencies from existing producer state through TensorMap/owner logic. Otherwise: @@ -301,6 +348,11 @@ Otherwise: So manual mode disables only same-scope auto-derivation, not boundary seeding. +This is a strict requirement: + +- outer read boundary dependency is forced by TensorMap/owner metadata +- orchestration code inside the manual scope must not be required to recreate that outer dependency manually + ## Nesting Rules Supported: @@ -322,11 +374,17 @@ Recommendation: - detect this at `scope_begin` - fail fast with a clear orchestrator error +Required error text quality: + +- the message must explicitly say that `manual scope inside manual scope is not supported` +- the message must identify the offending operation as nested `PTO_SCOPE(manual_dep=1)` +- the message must not use vague wording such as only `invalid scope state` + ## Diagnostics The runtime should detect and report: -1. nested manual scope not supported +1. nested manual scope not supported, with an explicit error message 2. `add_dependency` used with invalid task ids 3. dependency overflow from explicit wiring 4. obvious cross-scope/manual mismatch where possible @@ -381,15 +439,21 @@ Only then move to more complex orchestration such as paged attention. 2. Using `Tensor::manual_dep` as the only signal. - Scope semantics are relational and need owner scope identity. -3. Letting cross-scope writes publish only at `scope_end`. +3. Failing to force outer-scope reads through TensorMap/owner dependency seeding. +- This allows manual-scope tasks to read before the outer producer frontier is ready. + +4. Letting cross-scope writes publish only at `scope_end`. - This delays readiness incorrectly. -4. Accidentally preserving creator retention for same-scope tensors in manual mode. +5. Accidentally preserving creator retention for same-scope tensors in manual mode. - This reintroduces hidden dependencies and weakens the mental model. -5. Missing alias/view inheritance of scope ownership. +6. Missing alias/view inheritance of scope ownership. - This causes wrong same-scope vs cross-scope classification. +7. Turning this feature into a broad runtime refactor. +- This increases regression risk and violates the required change scope. + ## Recommended Implementation Order 1. Add API surface for `add_dependency` and manual scope mode. @@ -398,8 +462,8 @@ Only then move to more complex orchestration such as paged attention. 4. Implement explicit edge wiring in tensormap runtime. 5. Refactor submit-time dependency logic to branch on: - current scope mode - - tensor owner scope id - tensor owner task id + - producer task scope provenance 6. Add fail-fast nested-manual-scope check. 7. Add targeted tests for boundary semantics. 8. Migrate one example and validate. From 23a1fe2e1997e05a46bbe510f6314bdf774be77c Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Fri, 3 Apr 2026 15:21:46 +0800 Subject: [PATCH 03/35] docs: refine manual tensormap dependency design --- docs/manual-dep-for-tensormap-design.md | 726 +++++++++++++++++++++--- 1 file changed, 634 insertions(+), 92 deletions(-) diff --git a/docs/manual-dep-for-tensormap-design.md b/docs/manual-dep-for-tensormap-design.md index 759a0847b..ae42f6f4c 100644 --- a/docs/manual-dep-for-tensormap-design.md +++ b/docs/manual-dep-for-tensormap-design.md @@ -4,7 +4,7 @@ Bring the human-created dependency workflow from `aicpu_build_graph` into `tensormap_and_ringbuffer` in a scoped way: -- `PTO_SCOPE(manual_dep=1) { ... }` +- `PTO2_SCOPE(true) { ... }` - Tensors crossing scope boundaries use TensorMap semantics - Tensors used entirely inside the manual scope use explicit `add_dependency` @@ -14,6 +14,24 @@ This is not a port of `aicpu_build_graph`'s fully-explicit runtime model. The ta - cross-scope dependency tracking: TensorMap - scope-local lifetime management: unchanged ring/scope ownership model +## Code-Checked Baseline + +This draft is reviewed against the current implementations in: + +- `src/a2a3/runtime/aicpu_build_graph/runtime/pto_orchestrator.{h,cpp}` +- `src/a2a3/runtime/aicpu_build_graph/orchestration/pto_orchestration_api.h` +- `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.{h,cpp}` +- `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.{h,cpp}` +- `src/a2a3/runtime/tensormap_and_ringbuffer/orchestration/pto_orchestration_api.h` +- `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/tensor.h` + +The important current-code facts are: + +- `aicpu_build_graph` already has explicit `add_dependency`, returns `SubmitResult { task_id, outputs }`, and batch-publishes tasks at `scope_end`. +- `tensormap_and_ringbuffer` currently derives dependencies during submit, returns only `TaskOutputTensors`, and uses `scope_end` only for lifetime release. +- `tensormap_and_ringbuffer` orchestration is TLS-based today: `PTO2_SCOPE()` and `pto2_rt_submit_*()` do not take an explicit `PTO2Runtime*`. +- In `tensormap_and_ringbuffer`, `Tensor::manual_dep` is creator-retention-only mode: it skips OverlapMap lookup/insert, but `owner_task_id` retention still applies. + ## Confirmed Decisions These decisions are already aligned with the requested direction: @@ -22,8 +40,9 @@ These decisions are already aligned with the requested direction: 2. Manual scope may not contain another manual scope. 3. The design must not simplify away multi-write cases. 4. For an outer-scope tensor written inside a manual scope, readiness is the writer task completion time, not `scope_end`. -5. Therefore, a task inside a manual scope that writes an outer-scope tensor must still publish that tensor to TensorMap. -6. For an outer-scope tensor read inside a manual scope, the dependency must still be forced by TensorMap/owner-based boundary seeding. +5. Therefore, a task inside a manual scope that writes an outer-scope tensor must still publish that tensor to TensorMap by manual `scope_end`. +6. For an outer-scope tensor read inside a manual scope, the dependency must still be forced by TensorMap/owner-based boundary seeding during manual `scope_end`. +7. Tasks created inside a manual scope should be batch-published to the scheduler at `scope_end`, matching `aicpu_build_graph` semantics for explicit dependency closure inside the scope. ## Change Control Requirements @@ -89,13 +108,67 @@ Today the runtime derives dependencies in `pto_orchestrator.cpp` using: - modifier lookup through TensorMap overlap search - TensorMap insert for `INOUT` and `OUTPUT_EXISTING` -There is already a `Tensor::manual_dep` bit, but in current code it is effectively a per-tensor escape hatch that skips TensorMap lookup/insert. That is not sufficient for scoped hybrid semantics because the scope, not the tensor alone, decides whether a use is same-scope or cross-scope. +There is already a `Tensor::manual_dep` bit, but in current code it is only a per-tensor creator-retention mode: it skips TensorMap overlap lookup/insert while still keeping `owner_task_id` retention. That is not sufficient for scoped hybrid semantics because the scope, not the tensor alone, decides whether a use is same-scope or cross-scope. + +## Discovery vs Execution Separation + +This distinction is central to the frozen design. + +TensorMap is not the execution-time dependency engine. It is only a producer-discovery mechanism. + +The scheduler's fanin/fanout graph is the execution-time dependency engine. + +In current `tensormap_and_ringbuffer`, submit does two different things: + +1. Discover producers from tensors. +- creator retention from `owner_task_id` in `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp` +- overlap lookup from TensorMap in `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp` + +2. Convert discovered producers into scheduler edges. +- accumulate unique producer slot states in local `fanin_states[]` +- wire producer `fanout_head`, consumer `fanin_count`, and consumer `fanin_refcount` + +After that conversion, execution no longer cares how the dependency was found. + +The scheduler only sees: + +- producer fanout list +- consumer fanin counters + +This is why manual dependency integration should work as follows: + +- do not put manual dependencies into TensorMap +- do not bind manual dependencies to tensors +- at manual `scope_end`, realize manual dependencies directly as normal producer-consumer scheduler edges + +So at manual `scope_end`, for each manual task: + +1. Start a local dedup buffer such as `fanin_states[]`. +2. Add producers from recorded manual edges. +3. Add producers from outer-tensor creator retention and TensorMap lookup. +4. Dedup all of them together. +5. Run the normal wiring path into: + - `payload->fanin_slot_states[]` + - `fanin_count` + - producer `fanout_head` + +Then after publish: + +- manual deps and TensorMap-derived deps are indistinguishable +- both are handled by the existing scheduler readiness and completion fanout path + +Concise conclusion: + +- TensorMap discovers tensor-related dependencies +- manual deps bypass discovery +- both become the same scheduler edges before publish +- execution uses only the scheduler edge machinery, not TensorMap ## Problem Statement If we simply copy `aicpu_build_graph` semantics into `tensormap_and_ringbuffer`, we get a wrong boundary model: -- suppressing TensorMap for all tensors inside `PTO_SCOPE(manual_dep=1)` is incorrect +- suppressing TensorMap for all tensors inside `PTO2_SCOPE(true)` is incorrect - delaying publication of an outer tensor until `scope_end` is incorrect The reason is that cross-scope tensors must become visible at the actual writer frontier. Outside consumers should depend on the task that really produced the latest visible state, not on scope closure. @@ -109,13 +182,73 @@ So the correct split is: ## Core rule -`PTO_SCOPE(manual_dep=1)` means: +`PTO2_SCOPE(true)` means: -- if both producer and consumer are inside this manual scope, the dependency must be established by explicit `add_dependency` -- if a tensor use crosses the scope boundary, dependency tracking still uses TensorMap/owner metadata +- if a tensor was created inside this manual scope and is reused inside this manual scope, the dependency must be established by explicit `add_dependency` +- all outer-scope tensors still use existing TensorMap/owner metadata +- tasks submitted inside the manual scope remain invisible to the scheduler until `scope_end` This rule applies per tensor use site, not as a global on/off switch for the whole submit. +## Two Different Publication Semantics + +The design must distinguish two different kinds of publication: + +1. Scheduler publication +2. TensorMap boundary publication + +### Scheduler publication + +For tasks inside `PTO2_SCOPE(true)`: + +- submit builds the internal task records and records explicit dependencies +- those tasks are not yet published as executable scheduler work +- `scope_end` batch-publishes them to the scheduler + +This is required so all same-scope explicit edges are fully wired before any task in the manual scope can start execution. + +### TensorMap boundary publication + +For cross-scope tensors touched by tasks inside `PTO2_SCOPE(true)`: + +- outside tasks submitted after the manual scope ends must still be able to discover the internal writer frontier +- therefore the producer frontier for an external tensor written inside the manual scope must become visible to later TensorMap lookups at manual `scope_end` +- however the tensor is still not semantically ready until that producer task actually completes + +So: + +- scheduler visibility of the task is controlled by manual `scope_end` +- dependency readiness for later consumers is still enforced by waiting on producer task completion + +The document must not conflate these two mechanisms. + +More precisely: + +- before manual `scope_end`, the task record already exists but TensorMap boundary wiring may still be deferred +- after manual `scope_end`, the task becomes part of the executable published graph +- once published, the task may enter `READY` immediately or remain `PENDING` depending on whether its dependencies are already satisfied + +## Discussion Guardrails + +The following clarifications are recorded to reduce implementation drift and hallucination risk: + +1. Deferred publish does not mean deferred task allocation. +- Manual tasks still allocate task ids, slot state, and payload at submit time. +- What is deferred is dependency realization and ready-queue publication. + +2. Manual dependencies are not tracked by TensorMap. +- TensorMap is only used for tensor-related producer discovery. +- Manual dependencies are explicit producer-consumer edges recorded by orchestration. +- At manual `scope_end`, both kinds of dependencies are converted into the same scheduler fanin/fanout structures. + +3. After manual `scope_end`, there is no special execution-time manual mechanism. +- The runtime should not keep a second dependency engine alive after publish. +- Once the scope is finalized, all dependencies are handled only by the existing scheduler fanin/fanout path. + +4. Deferred boundary wiring does not change tensor readiness semantics. +- Outer writes become TensorMap-visible at manual `scope_end`. +- Their semantic readiness is still producer-task completion. + ## Tensor categories For a task submitted inside a manual scope, every tensor argument falls into one of these categories: @@ -126,7 +259,11 @@ For a task submitted inside a manual scope, every tensor argument falls into one 4. Outer-scope tensor accessed through a derived view/reshape/transpose inside the manual scope 5. External tensor with no owner task -The runtime must classify behavior from ownership and current scope, not only from argument tag. +The runtime only needs one special classification for v1: + +- tensor created in the current manual scope + +Everything else stays on the existing TensorMap path. ## Expected behavior by category @@ -139,10 +276,11 @@ The runtime must classify behavior from ownership and current scope, not only fr ### 2. Outer-scope tensor, written in place -- The internal writer must still publish to TensorMap. +- The internal writer must still publish its producer frontier for TensorMap boundary tracking. +- That boundary frontier becomes visible at manual `scope_end`, so later outside submissions can attach to the correct writer task. - Readiness of the written tensor is the completion of that writer task. - Multiple writes inside the same manual scope are allowed. -- TensorMap should continue tracking the latest producer frontier exactly as in normal scope. +- TensorMap should continue tracking the latest producer frontier exactly as in normal scope once the manual scope is finalized. ### 3. Tensor created inside this manual scope and reused only inside this manual scope @@ -170,21 +308,65 @@ A tensor created inside the manual scope should not later become an outer-scope ## Orchestration API -Add explicit edge wiring to `tensormap_and_ringbuffer` orchestration API, mirroring `aicpu_build_graph`: +Keep the existing `tensormap_and_ringbuffer` orchestration style: TLS-based helpers with no explicit runtime argument in user orchestration code. Do not make the public surface look like `aicpu_build_graph`'s `PTO2_SCOPE(rt)` family just to add manual mode. + +Add explicit edge wiring to `tensormap_and_ringbuffer` orchestration API: ```cpp void pto2_rt_add_dependency(PTO2TaskId producer, PTO2TaskId consumer); ``` -Add scoped manual mode: +Extend scope syntax to accept an optional manual flag: ```cpp -PTO_SCOPE(manual_dep = 1) { +PTO2_SCOPE(true) { ... } ``` -For C++ implementation, this should compile down to a guard with scope metadata, not a dynamic stringly API. +`PTO2_SCOPE()` remains the normal-scope form. `PTO2_SCOPE(true)` enters manual mode. + +Do not change `TaskOutputTensors`. + +Add new manual-submit APIs with `_manual` suffix so orchestration code can get task ids without changing existing normal submit call sites. This mirrors the role of `aicpu_build_graph`'s `SubmitResult`, but keeps the existing `tensormap_and_ringbuffer` submit APIs intact: + +```cpp +struct PTO2ManualSubmitResult { + PTO2TaskId task_id; + TaskOutputTensors outputs; +}; + +PTO2ManualSubmitResult pto2_rt_submit_task_manual(const MixedKernels& mixed_kernels, const Arg& args); +PTO2ManualSubmitResult pto2_rt_submit_aic_task_manual(int32_t kernel_id, const Arg& args); +PTO2ManualSubmitResult pto2_rt_submit_aiv_task_manual(int32_t kernel_id, const Arg& args); +``` + +These APIs are intended for use inside `PTO2_SCOPE(true)` where explicit dependency wiring is required. + +This design intentionally splits task APIs, not tensor storage APIs: + +- normal scope uses existing `pto2_rt_submit_*` task APIs +- manual scope uses `pto2_rt_submit_*_manual` task APIs +- both modes continue using the same `Tensor`, `Arg`, and `TensorArgType` model + +So manual mode changes how tasks are recorded and finalized, not how tensors are represented. + +## Rejected API Alternatives + +The following alternatives were considered and rejected for v1: + +1. Add a new user-facing “external tensor” API for manual scope. +- Rejected because manual mode only needs to identify manual-local tensors. +- Everything else can be treated as outer/external and handled by the existing TensorMap path. +- Adding a second tensor annotation API would increase surface area without adding necessary information. + +2. Change `TaskOutputTensors` to carry task ids. +- Rejected to avoid broad churn in existing orchestration code. +- Manual mode gets separate `_manual` submit APIs instead. + +3. Create a second tensor representation for manual mode. +- Rejected because payload already stores the copied tensor/scalar data needed for deferred finalize. +- The task API split is enough; tensor storage stays unified. ## Runtime API @@ -192,15 +374,18 @@ Add runtime ops support: ```cpp void (*add_dependency)(PTO2Runtime* rt, PTO2TaskId producer, PTO2TaskId consumer); +void (*scope_begin)(PTO2Runtime* rt, bool manual_dep); ``` -Add manual-scope entry/exit plumbing by extending the existing scope API with a mode flag: +The orchestration-facing helper can stay TLS-style and hide the runtime pointer, for example by plumbing the flag through the existing `pto2_rt_scope_begin()` / `PTO2ScopeGuard` path. + +Add manual-scope entry/exit plumbing by extending the existing runtime entry point with a mode flag: ```cpp void pto2_rt_scope_begin(PTO2Runtime* rt, bool manual_dep); ``` -Recommendation: extend scope state with a mode flag and keep one scope stack. Avoid separate manual/non-manual stacks and avoid introducing a second scope API family. +Recommendation: extend scope state with a mode flag and keep one scope stack. Avoid separate manual/non-manual stacks. ## Internal Design @@ -210,15 +395,12 @@ Each scope frame needs: - `begin_index` into `scope_tasks` - scope mode: normal or manual -- unique scope id +- `begin_index` into a manual-edge buffer when the scope is manual +- `begin_index` into a manual-task-meta buffer when the scope is manual -The unique scope id is required because same-scope vs cross-scope classification must be relative to the current manual scope, not only to nested depth. +Manual scope needs a local edge registry because `add_dependency` should record edges during orchestration but should not mutate scheduler fanin/fanout state until manual `scope_end`. -Recommendation: - -- assign `scope_id` on every `scope_begin` -- store current producing `scope_id` on task-side provenance -- use `owner_task_id` on `Tensor` to reach producing task provenance +Manual scope also needs a compact per-task metadata stream so `scope_end` can replay the deferred dependency logic without copying full `Arg` objects. ## Tensor metadata @@ -229,79 +411,407 @@ Current `Tensor` already stores: Recommendation: do not add new tensor metadata in v1. -The critical missing concept is producing scope identity, but that provenance should live on the producer task side if possible, not on `Tensor`. +The narrowed v1 rule only needs to identify tensors created in the current manual scope. That can be derived from: -We need enough information to answer: +- `tensor.owner_task_id` +- the set of task ids created in the current manual scope -- was this tensor produced in the current manual scope? -- or is it owned by an outer scope and therefore boundary-visible? - -Preferred approach: +So the preferred approach is: - keep `Tensor` layout unchanged -- use `tensor.owner_task_id` as the provenance pointer -- record `owner_scope_id` on producer task-side metadata such as task descriptor or scheduler/orchestrator slot state -- classify same-scope versus cross-scope through `owner_task_id -> producer provenance -> scope_id` +- keep `owner_task_id` as the provenance pointer +- track the current manual scope's owned task membership in scope-local orchestrator state -`manual_dep` should no longer be the primary mechanism for scope semantics. It may remain as a per-tensor override, but the scoped design should be driven by: +`manual_dep` should not become the primary mechanism for scoped semantics. It may remain as a per-tensor escape hatch for existing behavior, but the manual-scope design should be driven by: - current scope mode - tensor owner task id -- producer task scope provenance +- whether that owner belongs to the current manual scope + +## Shared Tensor Path, Split Task APIs + +The design should keep one shared tensor recording path across normal and manual scope: + +- `Arg` remains the user-facing container for tensor refs, tensor create-info, scalars, and `TensorArgType` +- `PTO2TaskPayload` remains the destination for copied tensors and scalars +- runtime-created outputs still receive `owner_task_id` during submit + +What changes in manual scope is only the task API and the time when dependency logic runs: -## Submit-time classification +- normal submit APIs perform dependency lookup and TensorMap insert immediately +- manual submit APIs only allocate the task, copy payload data, and record compact finalize metadata +- manual `scope_end` replays dependency lookup and TensorMap insert from the recorded payload -In `pto_orchestrator.cpp`, dependency behavior should be classified per tensor argument. +This keeps normal-mode APIs unchanged while avoiding a second tensor representation for manual mode. -Pseudo-rule for a task submitted inside a manual scope: +## Classification rule + +In manual scope, the runtime only needs one special classification: ```cpp -same_scope_tensor = +manual_local_tensor = tensor.owner_task_id.is_valid() && - producer_scope_id(tensor.owner_task_id) == current_manual_scope_id; + current_manual_scope_owns(tensor.owner_task_id); +``` + +Then: +```cpp if (!in_manual_scope) { use existing tensormap behavior; -} else if (same_scope_tensor) { - skip TensorMap lookup/insert; - rely on explicit add_dependency; +} else if (manual_local_tensor) { + use explicit add_dependency only; } else { - use cross-scope TensorMap/owner behavior; + use existing TensorMap/owner behavior; } ``` -This should be applied separately to: +Important nuance: -- creator retention -- modifier lookup -- TensorMap insertion for writes +- tensors created in the current manual scope use explicit same-scope dependencies +- all outer tensors stay on the existing TensorMap path, even if two tasks inside the manual scope both access them +- this means outer tensors may still create implicit same-scope edges through TensorMap inside a manual scope +- this is an accepted v1 decision and should be documented in the PR description as a deliberate tradeoff -Important nuance: +This is why a separate user-facing “external tensor” API is not required for v1: + +- manual mode only needs to identify manual-local tensors +- everything else is treated as outer/external and goes through the existing TensorMap path +- that decision can be derived from `owner_task_id` plus the current manual scope's owned-task membership check + +## Scheduler-Safe Hybrid Design + +The scheduler changes should be localized and should not disturb existing normal-scope behavior. + +### Design principle + +Keep two execution paths: + +- normal scope path: existing `tensormap_and_ringbuffer` behavior +- manual scope path: deferred dependency realization and deferred scheduler publication + +The normal path should remain unchanged as much as possible. + +### What a manual-scope task must count as dependencies + +For a task inside `PTO2_SCOPE(true)`, total fanin is: + +- explicit manual dependencies added by `add_dependency` +- external dependencies derived from TensorMap/owner logic for outer-scope reads +- one extra publish barrier released only at manual `scope_end` + +In other words: + +```cpp +fanin_count = + manual_dep_edges + + external_tensor_deps + + 1; // publish barrier +``` + +This is the key mechanism that lets the scheduler ignore manual-local TensorMap lookup while still respecting out-of-scope dependencies. + +### What submit should do in manual scope + +For a task submitted inside manual scope: + +1. Allocate slot and payload exactly as today. +2. Initialize `task_state = PENDING`. +3. Initialize `fanin_count = 1` and `fanin_refcount = 0` for deferred publication. +4. Return a stable `task_id` immediately so orchestration can call `add_dependency`. +5. Do not realize explicit manual edges into scheduler fanin/fanout yet. +6. Do not realize external TensorMap-derived dependencies yet. +7. Do not publish outer writes into TensorMap yet. +8. Do not push the task into ready queues during submit. +9. Preserve enough scope-local information so manual `scope_end` can realize all dependencies before publish. + +Submit-time task records are still required even though execution is deferred: + +- manual submit APIs must return stable task ids immediately +- runtime-created outputs need `owner_task_id` immediately so later scope-local tensors and their derived views can be recognized +- the scheduler only sees these tasks after manual `scope_end` + +Manual mode should also record a compact per-task finalize descriptor rather than a second full copy of `Arg`. + +Recommended shape: + +```cpp +struct PTO2ManualTaskMeta { + uint64_t packed_tags; // compact encoding of TensorArgType for this task + uint16_t tensor_count; + uint16_t edge_begin; // range in manual_edges[] + uint16_t edge_count; + uint16_t _pad; +}; + +struct PTO2ManualEdge { + uint16_t producer_idx; // index in current manual scope's task slice + uint16_t consumer_idx; +}; +``` -- same-scope tensors should still retain creator lifetime through explicit dependencies, not through automatic creator retention -- cross-scope tensors should still retain creator lifetime automatically -- cross-scope outer reads must still execute the existing TensorMap/owner dependency path even when the current scope is manual +Why this is low-overhead: + +- tensor values are already copied into `PTO2TaskPayload` +- scalars are already copied into `PTO2TaskPayload` +- tags are much smaller than copying `Arg` again +- the edge list is dense, append-only, and scope-local + +The design should prefer a packed tag stream plus a dense edge stream over storing duplicated tensor refs or explicit user-marked external tensors. + +That gives a manual pre-publish state: + +- task records and task ids already exist +- explicit edges are only recorded, not yet wired into scheduler fanin/fanout +- external TensorMap-derived edges are also deferred until `scope_end` +- the task is still unpublished as executable scheduler work because the publish barrier is not yet released + +### What scope_end should do in manual scope + +Manual `scope_end` needs one additional finalize-and-publish step before the existing lifetime-release step completes. + +Recommended sequence: + +1. For every task directly owned by this manual scope: + - realize recorded explicit `add_dependency` edges into scheduler fanin/fanout state + - inspect each tensor arg + - if the tensor is manual-local, skip TensorMap logic + - otherwise run the existing TensorMap/owner dependency logic + - if the task writes an outer tensor, insert its producer frontier into TensorMap +2. After all dependency realization is complete for the scope: + - release the publish barrier by incrementing `fanin_refcount` + - if `fanin_refcount == fanin_count`, transition to `READY` and push to ready queue + - otherwise keep the task in published `PENDING` state so later producer completion can resolve it +3. Release the scope lifetime reference exactly as current `on_scope_end` does + +This can be implemented as a manual-scope finalize path in the orchestrator plus a small scheduler helper for the publish-barrier release. + +Example helper shape: + +```cpp +void publish_manual_scope_tasks(PTO2TaskSlotState** task_slot_states, int32_t count); +``` + +This helper should reuse the existing ready-transition logic as much as possible. + +### How external dependency replay works + +Manual `scope_end` should replay tasks in original submit order, using: + +- `scope_tasks[]` for task order +- `manual_task_meta[]` for packed tags and edge ranges +- `PTO2TaskPayload::tensors[]` for actual tensor values + +For each task in that order: + +1. Realize explicit manual edges whose consumer is this task. +2. Decode tensor tags from `packed_tags`. +3. For each tensor arg: + - if `owner_task_id` belongs to the current manual scope's owned task set, treat it as manual-local and skip TensorMap logic + - otherwise treat it as outer/external +4. For outer/external tensors: + - apply creator-retention logic from `owner_task_id` + - apply existing TensorMap overlap lookup for `INPUT` / `INOUT` +5. After lookup for this task: + - apply normal TensorMap insertion for outer writes (`INOUT` / `OUTPUT_EXISTING`) + +This replay order matters: + +- it preserves current tensormap behavior for multiple writes to outer tensors +- earlier outer writes from the same manual scope become visible to later tasks in the same manual scope during replay +- that matches the accepted v1 tradeoff that outer tensors may still induce implicit same-scope TensorMap edges + +The replay must not be implemented as: + +- all lookups for the whole scope first, then all inserts +- all explicit manual edges first, then a second undeduped TensorMap pass +- per-dependency immediate scheduler mutation without first building a deduped producer set for the consumer + +Those variants would diverge from current tensormap semantics and are considered incorrect for this design. + +### Important case: external dependency already produced before manual publish + +For a manual-scope task that reads an outer-scope tensor: + +- if the external producer task has already completed when dependency realization happens at manual `scope_end`, that edge should immediately contribute to `fanin_refcount` +- then manual `scope_end` releases only the publish barrier, and the task may become `READY` immediately + +If the external producer has only published its TensorMap frontier but not yet completed: + +- the manual-scope consumer is published at manual `scope_end` +- but it remains in published `PENDING` +- later producer completion notifies fanout and increments `fanin_refcount` +- once `fanin_refcount == fanin_count`, the consumer transitions to `READY` + +This is the desired hybrid behavior: + +- dependency construction happens at manual `scope_end`, before publish +- dependency satisfaction is still handled by the normal runtime execution path after publish + +### Why this is low-risk + +- no change to ready queue implementation +- no change to worker dispatch loop +- no change to normal TensorMap scope behavior +- no need for a new scheduler task state +- reuse the existing `fanin_count` / `fanin_refcount` / `PENDING -> READY` transition model + +The main new behavior is deferred dependency realization plus deferred release of the publish barrier for manual-scope tasks. + +## Current-Manual-Scope Ownership Without Tensor Changes + +To decide whether a tensor is manual-local or outer-visible, the orchestrator only needs to know whether its `owner_task_id` belongs to the current manual scope. + +Recommended minimal design: + +- keep `Tensor` unchanged +- use `Tensor.owner_task_id` as the provenance link +- keep a scope-local registry of task ids created in the current manual scope + +A good low-risk implementation is to reuse the existing flat `scope_tasks` buffer plus a parallel manual-edge buffer, rather than widening hot structs unnecessarily. + +Classification then becomes: + +```cpp +if (!tensor.owner_task_id.is_valid()) { + // external tensor with no producer task +} else { + manual_local = current_manual_scope_owns(tensor.owner_task_id); +} +``` + +## Lifecycle Clarification + +This design needs precise task-lifecycle terms: + +- `COMPLETED`: task execution has finished; produced tensor data is semantically ready +- `CONSUMED` / reclaimed: all fanout references and the owning-scope reference have been released, so the task slot may be reused and `last_task_alive` may advance +- tensor readiness: data-level concept, typically tied to producer task completion + +This matters for deferred manual `scope_end` wiring: + +- an outer-scope producer task may already be `COMPLETED` before the inner manual scope ends +- that is fine, and the manual finalize path should treat it as an already-satisfied dependency +- what must remain true is that the producer task has not yet reached the reclaimed / slot-reusable state before the inner manual `scope_end` + +Why this is expected to hold: + +- tasks created in the current manual scope are still protected by the current manual scope reference until manual `scope_end` +- tasks created in an outer still-active scope may complete early, but the outer scope still holds their scope reference until that outer scope ends +- therefore an inner manual scope can still discover those outer producers through `owner_task_id` or TensorMap when it finalizes + +This does not mean the producer task is still runnable or incomplete. +It may already be `COMPLETED`; the manual finalize path should then treat it as an already-satisfied dependency. + +So the safety argument is not "outer producers cannot complete early". The correct statement is: + +- outer producers may complete before inner manual `scope_end` +- they should still remain alive enough to be discoverable until the deferred boundary wiring for that inner manual scope has finished + +## External Dependency Publication In Manual Scope + +The spec needs two explicit rules here. + +### External reads + +A task inside manual scope that reads an outer-scope tensor: + +- must still collect the external producer through TensorMap/owner logic +- must include that dependency in its fanin during manual `scope_end`, before manual batch publish +- must not require the user to restate that outer dependency manually + +### External writes + +A task inside manual scope that writes an outer-scope tensor: + +- must publish its producer frontier to TensorMap during manual `scope_end` +- must not publish same-scope temporary tensors into TensorMap +- may still be `PENDING` and unpublished to the scheduler until manual `scope_end` + +This is safe because later outside submissions only need to identify the producer task and wire dependency to it. Actual execution readiness is still controlled by task completion and the scheduler's normal completion path. + +## Manual Dependencies And External Dependencies On The Same Task + +A single task inside manual scope may simultaneously depend on: + +- explicit same-scope manual producers +- external TensorMap-derived producers + +This is supported by the same fanin accounting model. + +Example: + +```cpp +PTO2_SCOPE(true) { + t0 = in-scope producer of tmp + t1 = consumer of tmp and outer tensor X + add_dependency(t0, t1) +} +``` + +At manual `scope_end`, for `t1`: + +- `t0 -> t1` contributes one explicit manual fanin edge +- outer tensor `X` contributes boundary-derived external fanin edges +- publish barrier contributes one extra deferred fanin unit + +`t1` becomes READY only after: + +- explicit in-scope producers complete +- external producers complete +- manual `scope_end` releases the publish barrier + +That is the intended scheduler behavior. ## Explicit edge wiring -`pto2_add_dependency` from `aicpu_build_graph` can be reused conceptually, but the implementation must match `tensormap_and_ringbuffer` scheduler semantics: +`pto2_add_dependency` from `aicpu_build_graph` can be reused conceptually, but manual scope should not wire scheduler fanin/fanout immediately. + +Recommended behavior inside manual scope: + +- validate that both task ids belong to the current manual scope +- record the edge in a scope-local manual-edge buffer +- do not increment `fanin_count` yet +- do not mutate producer `fanout_head` yet -- increment consumer `fanin_count` +Then at manual `scope_end`: + +- realize each recorded edge into the scheduler's existing fanin/fanout structures +- increment `fanin_count` - record producer in consumer payload for release traversal -- wire producer fanout list under `fanout_lock` -- handle early-completed producer case +- handle the already-completed producer case exactly once, during realization + +This avoids racing live external completion against partially built manual dependency state. + +Important discussion note: -No scope-end batch publish behavior should be imported. `tensormap_and_ringbuffer` tasks are already submit-visible before scope end, and changing that would be a separate design. +- the deduped producer set for one consumer must include all sources together: + - explicit manual edges + - creator retention from `owner_task_id` + - TensorMap overlap lookup + +The implementation must not count these sources independently and then wire fanout multiple times for the same producer-consumer pair. ## Scope-end behavior +Manual scope changes scheduler publication semantics for tasks inside that scope: + +- tasks in manual scope are batch-published to the scheduler at `scope_end` +- same-scope explicit edges must be fully wired before that publish happens + Manual scope does not change lifetime release semantics: - `scope_end` still releases the owning-scope fanout reference -- `scope_end` is not a publication barrier for cross-scope tensors -- cross-scope visibility must already reflect task completion frontier -This is the main semantic difference from `aicpu_build_graph`. +Manual scope also does not change cross-scope readiness semantics: + +- external tensor readiness is still producer-task completion, not `scope_end` +- but external-writer frontier information must be visible to later TensorMap lookups no later than manual `scope_end` + +This manual-scope behavior intentionally combines: + +- `aicpu_build_graph`-style scope-end batch publish for explicit same-scope dependencies +- `tensormap_and_ringbuffer`-style TensorMap boundary tracking for cross-scope tensors ## Multiple Writes To Outer Tensors @@ -310,7 +820,7 @@ This case must be supported in v1. Example: ```cpp -PTO_SCOPE(manual_dep=1) { +PTO2_SCOPE(true) { t1 writes outer C t2 writes outer C add_dependency(t1, t2) @@ -320,15 +830,16 @@ outside task reads C Correct behavior: -- `t1` publishes `C` to TensorMap -- `t2` publishes `C` again to TensorMap +- at manual `scope_end`, `t1` publishes `C` to TensorMap +- at manual `scope_end`, `t2` publishes `C` again to TensorMap - outside reader should see `t2` as the latest producer frontier - because `t1 -> t2` is explicit, `t2` completion is a valid readiness frontier for the final visible state +- outer tensors may still create implicit same-scope TensorMap edges inside the manual scope; this is an accepted v1 tradeoff and should be called out in the PR description Potential invalid user pattern: ```cpp -PTO_SCOPE(manual_dep=1) { +PTO2_SCOPE(true) { t1 writes outer C t2 also writes outer C // missing add_dependency(t1, t2) @@ -352,6 +863,7 @@ This is a strict requirement: - outer read boundary dependency is forced by TensorMap/owner metadata - orchestration code inside the manual scope must not be required to recreate that outer dependency manually +- even though the consumer task itself is only batch-published to the scheduler at manual `scope_end`, its fanin accounting must include the external TensorMap-derived dependency before publication ## Nesting Rules @@ -363,11 +875,14 @@ Supported: Not supported in v1: - manual scope contains manual scope +- manual scope contains any nested scope with its own publish boundary Reason: -- the same-scope vs cross-scope rule is already relative to the current manual frame -- nested manual scopes add little value and complicate classification and diagnostics +- current ring selection depends on scope depth +- the top scope frame is also the publication and lifetime-release boundary +- allowing a child scope inside `PTO2_SCOPE(true)` would split one manual region across multiple scope/ring boundaries unless extra machinery is added +- rejecting nested scopes inside manual mode keeps `current_manual_scope_owns(...)` a simple membership check over one manual frame Recommendation: @@ -376,18 +891,35 @@ Recommendation: Required error text quality: +- the message must explicitly say that nested scope inside `PTO2_SCOPE(true)` is not supported in v1 - the message must explicitly say that `manual scope inside manual scope is not supported` -- the message must identify the offending operation as nested `PTO_SCOPE(manual_dep=1)` +- the message must identify the offending operation as nested `PTO2_SCOPE(true)` - the message must not use vague wording such as only `invalid scope state` +## Blocking Cross-Layer Tensor Access + +`get_tensor_data` and `set_tensor_data` are blocking cross-layer access APIs. Their current contract assumes producer state is already published through TensorMap/owner metadata. + +That assumption does not hold inside manual scope because tasks remain unpublished until manual `scope_end`. + +So v1 should fail fast: + +- `get_tensor_data` inside `PTO2_SCOPE(true)` is an error +- `set_tensor_data` inside `PTO2_SCOPE(true)` is an error + +Required error text quality: + +- the message must explicitly say that blocking tensor data access is not supported inside `PTO2_SCOPE(true)` +- the message should tell the user to exit the manual scope first + ## Diagnostics The runtime should detect and report: -1. nested manual scope not supported, with an explicit error message +1. nested scope inside manual mode not supported, with an explicit error message 2. `add_dependency` used with invalid task ids 3. dependency overflow from explicit wiring -4. obvious cross-scope/manual mismatch where possible +4. `get_tensor_data` or `set_tensor_data` called inside manual scope Nice-to-have diagnostics: @@ -415,11 +947,12 @@ Add focused coverage before broad workload migration. 4. Manual scope writes outer tensor multiple times - latest writer becomes TensorMap frontier - correctness depends on explicit same-scope edge wiring +- accepted implicit TensorMap edges on outer tensors are documented 5. Normal scope containing manual scope - outer to inner and inner to outer boundary cases both work -6. Nested manual scope +6. Nested scope inside manual mode - rejected with deterministic error ### Example-level migration @@ -437,36 +970,44 @@ Only then move to more complex orchestration such as paged attention. - This breaks cross-scope correctness. 2. Using `Tensor::manual_dep` as the only signal. -- Scope semantics are relational and need owner scope identity. +- Scoped semantics should be driven by current manual-scope ownership, not by the tensor flag alone. 3. Failing to force outer-scope reads through TensorMap/owner dependency seeding. - This allows manual-scope tasks to read before the outer producer frontier is ready. -4. Letting cross-scope writes publish only at `scope_end`. -- This delays readiness incorrectly. +4. Confusing scheduler batch publication with tensor readiness semantics. +- Manual-scope tasks should be scheduler-visible at `scope_end`, but external tensor readiness is still producer completion. + +5. Letting cross-scope writer frontier become visible only after producer completion. +- This is too late for later outside submissions made after manual `scope_end`. -5. Accidentally preserving creator retention for same-scope tensors in manual mode. -- This reintroduces hidden dependencies and weakens the mental model. +6. Realizing manual edges incrementally before `scope_end`. +- This can race with already-live external producers and partially built fanin state. -6. Missing alias/view inheritance of scope ownership. +7. Missing alias/view inheritance of scope ownership. - This causes wrong same-scope vs cross-scope classification. -7. Turning this feature into a broad runtime refactor. +8. Turning this feature into a broad runtime refactor. - This increases regression risk and violates the required change scope. +9. Allowing blocking cross-layer tensor access inside manual scope. +- `get_tensor_data` and `set_tensor_data` assume published producer state and should fail fast in manual scope. + +10. Replacing the existing scheduler edge machinery with a separate manual execution path. +- This would duplicate fanin/fanout handling, completion notification, and release traversal. +- The design requires one unified post-publish scheduler mechanism. + ## Recommended Implementation Order 1. Add API surface for `add_dependency` and manual scope mode. -2. Add scope-frame mode and `scope_id`. -3. Add tensor ownership metadata needed for classification. -4. Implement explicit edge wiring in tensormap runtime. -5. Refactor submit-time dependency logic to branch on: - - current scope mode - - tensor owner task id - - producer task scope provenance -6. Add fail-fast nested-manual-scope check. -7. Add targeted tests for boundary semantics. -8. Migrate one example and validate. +2. Add manual-submit APIs with `_manual` suffix returning task ids plus outputs. +3. Add scope-frame mode plus scope-local manual-edge storage. +4. Implement deferred explicit edge realization at manual `scope_end`. +5. Implement manual-local tensor classification from `owner_task_id` plus current manual-scope ownership. +6. Realize outer-tensor TensorMap lookup/insert during manual `scope_end`. +7. Add fail-fast nested-scope-in-manual check and block `get_tensor_data` / `set_tensor_data` in manual scope. +8. Add targeted tests for boundary semantics. +9. Migrate one example and validate. ## Open Question Resolved @@ -475,7 +1016,7 @@ This design intentionally resolves the central ambiguity: - `scope_end` controls lifetime release - task completion controls semantic readiness -For outer tensors written inside manual scope, TensorMap publication must stay aligned with task completion frontier, not with scope closure. +For outer tensors written inside manual scope, TensorMap frontier publication happens at manual `scope_end`, while semantic readiness is still producer-task completion. ## File Areas Expected To Change @@ -483,17 +1024,18 @@ For outer tensors written inside manual scope, TensorMap publication must stay a - `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h` - `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp` - `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp` -- `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/tensor.h` - `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h` +- `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h` - docs and examples/tests needed to demonstrate the new scoped behavior ## Recommendation Summary Implement manual dependency as a scope-local override inside `tensormap_and_ringbuffer`, not as a runtime-wide replacement of TensorMap: -- same manual scope: explicit `add_dependency` -- crossing the manual scope boundary: TensorMap -- write visibility: writer completion +- tensors created in the current manual scope: explicit `add_dependency` +- outer tensors: existing TensorMap path +- TensorMap boundary realization for manual scopes: manual `scope_end` +- semantic readiness of outer writes: writer completion - lifetime release: `scope_end` That is the smallest design that satisfies the requested model without breaking the core tensormap runtime semantics. From 37a0faee598ea72be6e7e96d6ce9e3043b276f3d Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Sat, 4 Apr 2026 01:06:55 +0800 Subject: [PATCH 04/35] Update: add partial manual TensorMap scope support - add PTO2ScopeMode::MANUAL, manual submit APIs, and deferred\n scope_end replay for tensormap_and_ringbuffer\n- add paged_attention_partial_manual plus paged_attention*_partial_manual\n ST coverage for nested outer-normal and inner-manual scopes\n- repoint AGENTS.md/CLAUDE.md toward the .agents layout and add a\n placeholder so the directory is tracked --- .agents | 1 + AGENTS.md | 18 +- CLAUDE.md | 2 +- .../paged_attention_partial_manual/golden.py | 7 + .../kernels/kernel_config.py | 72 +++ .../orchestration/paged_attention_orch.cpp | 170 ++++++ .../orchestration/pto_orchestration_api.h | 53 +- .../runtime/pto_orchestrator.cpp | 535 ++++++++++++++---- .../runtime/pto_orchestrator.h | 33 +- .../runtime/pto_runtime2.cpp | 36 +- .../runtime/pto_runtime2.h | 10 +- .../runtime/pto_runtime2_types.h | 14 + .../runtime/pto_scheduler.h | 11 + .../paged_attention_partial_manual/golden.py | 7 + .../kernels/kernel_config.py | 71 +++ .../orchestration/paged_attention_orch.cpp | 171 ++++++ .../golden.py | 8 + .../kernels/kernel_config.py | 72 +++ .../orchestration/paged_attention_orch.cpp | 181 ++++++ 19 files changed, 1330 insertions(+), 142 deletions(-) create mode 120000 .agents mode change 100644 => 120000 AGENTS.md create mode 100644 examples/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/golden.py create mode 100644 examples/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/kernel_config.py create mode 100644 examples/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp create mode 100644 tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/golden.py create mode 100644 tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/kernel_config.py create mode 100644 tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp create mode 100644 tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/golden.py create mode 100644 tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/kernels/kernel_config.py create mode 100644 tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/kernels/orchestration/paged_attention_orch.cpp diff --git a/.agents b/.agents new file mode 120000 index 000000000..c8161850a --- /dev/null +++ b/.agents @@ -0,0 +1 @@ +.claude \ No newline at end of file diff --git a/AGENTS.md b/AGENTS.md deleted file mode 100644 index 982e706a1..000000000 --- a/AGENTS.md +++ /dev/null @@ -1,17 +0,0 @@ -# AGENTS Guide - -**EVERY AI AGENT MUST FOLLOW THIS GUIDE BEFORE ANY WORK.** - -## Required startup sequence - -1. Read `CLAUDE.md` before running commands, analyzing code, or editing files. -2. Treat `CLAUDE.md` as the source of truth for role boundaries, architecture context, and repository workflow. -3. Load always-on conventions from `.claude/rules/` (for example: architecture, codestyle, device constraints). -4. Load only task-relevant workflows from `.claude/skills/` and `.claude/commands/`. - -## Additional rules - -- If `CLAUDE.md` changes, read it again before continuing. -- If relevant files under `.claude/rules/`, `.claude/skills/`, or `.claude/commands/` change, refresh your context before proceeding. -- If user instructions conflict with repository conventions, prioritize user intent for that task. -- Higher-priority system/developer/user instructions override this guide. diff --git a/AGENTS.md b/AGENTS.md new file mode 120000 index 000000000..681311eb9 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1 @@ +CLAUDE.md \ No newline at end of file diff --git a/CLAUDE.md b/CLAUDE.md index 046b9fdf3..269255928 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -32,7 +32,7 @@ clang-format -i ## Important Rules -1. **Consult `.claude/rules/` for coding conventions** (architecture, codestyle, terminology) — these are always-loaded guidelines. **Consult `.claude/skills/` for task-specific workflows** (e.g., `git-commit/` when committing, `testing/` when running tests) +1. **Consult `.agents/rules/` for coding conventions** (architecture, codestyle, terminology) — these are always-loaded guidelines. **Consult `.agents/skills/` for task-specific workflows** (e.g., `git-commit/` when committing, `testing/` when running tests) 2. **Do not modify directories outside your assigned area** unless the user explicitly requests it 3. Create new subdirectories under your assigned directory as needed 4. When in doubt, ask the user before making changes to other areas diff --git a/examples/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/golden.py b/examples/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/golden.py new file mode 100644 index 000000000..b03743d37 --- /dev/null +++ b/examples/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/golden.py @@ -0,0 +1,7 @@ +from pathlib import Path +import sys + +_BASE = Path(__file__).resolve().parents[1] / "paged_attention" +sys.path.insert(0, str(_BASE)) + +from golden import compute_golden, generate_inputs # noqa: E402,F401 diff --git a/examples/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/kernel_config.py b/examples/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/kernel_config.py new file mode 100644 index 000000000..534a84f19 --- /dev/null +++ b/examples/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/kernel_config.py @@ -0,0 +1,72 @@ +# Copyright (c) PyPTO Contributors. +# This program is free software, you can redistribute it and/or modify it under the terms and conditions of +# CANN Open Software License Agreement Version 2.0 (the "License"). +# Please refer to the License for details. You may not use this file except in compliance with the License. +# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. +# See LICENSE in the root of the software repository for the full text of the License. +# ----------------------------------------------------------------------------------------------------------- +from pathlib import Path + +from task_interface import ArgDirection as D # pyright: ignore[reportAttributeAccessIssue] + +_ROOT = Path(__file__).parent +_PA_KERNELS = _ROOT.parent.parent / "paged_attention" / "kernels" + +ORCHESTRATION = { + "source": str(_ROOT / "orchestration" / "paged_attention_orch.cpp"), + "function_name": "aicpu_orchestration_entry", + "signature": [D.IN, D.IN, D.IN, D.IN, D.IN, D.OUT], +} + +KERNELS = [ + { + "func_id": 0, + "name": "QK", + "source": str(_PA_KERNELS / "aic" / "aic_qk_matmul.cpp"), + "core_type": "aic", + "signature": [D.IN, D.IN, D.OUT], + }, + { + "func_id": 2, + "name": "PV", + "source": str(_PA_KERNELS / "aic" / "aic_pv_matmul.cpp"), + "core_type": "aic", + "signature": [D.IN, D.IN, D.OUT], + }, + { + "func_id": 4, + "name": "AIC_HUB", + "source": str(_PA_KERNELS / "aic" / "aic_hub.cpp"), + "core_type": "aic", + "signature": [], + }, + { + "func_id": 1, + "name": "SF", + "source": str(_PA_KERNELS / "aiv" / "aiv_softmax_prepare.cpp"), + "core_type": "aiv", + "signature": [D.IN, D.OUT, D.OUT, D.OUT], + }, + { + "func_id": 3, + "name": "UP", + "source": str(_PA_KERNELS / "aiv" / "aiv_online_update.cpp"), + "core_type": "aiv", + "signature": [D.IN, D.IN, D.IN, D.INOUT, D.INOUT, D.INOUT, D.INOUT], + }, + { + "func_id": 5, + "name": "AIV_HUB", + "source": str(_PA_KERNELS / "aiv" / "aiv_hub.cpp"), + "core_type": "aiv", + "signature": [], + }, +] + +RUNTIME_CONFIG = { + "runtime": "tensormap_and_ringbuffer", + "aicpu_thread_num": 4, + "orch_thread_num": 1, + "block_dim": 24, +} diff --git a/examples/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp b/examples/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp new file mode 100644 index 000000000..8a7476953 --- /dev/null +++ b/examples/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp @@ -0,0 +1,170 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ + +#include +#include + +#include +#include + +#include "pto_orchestration_api.h" // NOLINT(build/include_subdir) + +#define FUNC_QK_MATMUL 0 +#define FUNC_SOFTMAX_PREPARE 1 +#define FUNC_PV_MATMUL 2 +#define FUNC_ONLINE_UPDATE 3 +#define FUNC_AIC_HUB 4 +#define FUNC_AIV_HUB 5 + +extern "C" { + +__attribute__((visibility("default"))) PTO2OrchestrationConfig +aicpu_orchestration_config(const ChipStorageTaskArgs &orch_args) { + (void)orch_args; // NOLINT(readability/casting) + return PTO2OrchestrationConfig{ + .expected_arg_count = 7, + }; +} + +__attribute__((visibility("default"))) void +aicpu_orchestration_entry(const ChipStorageTaskArgs &orch_args, int orch_thread_num, int orch_thread_index) { + uint64_t batch = orch_args.tensor(0).shapes[0]; + uint64_t num_heads = orch_args.tensor(0).shapes[1]; + uint64_t head_dim = orch_args.tensor(0).shapes[2]; + DataType data_type = orch_args.tensor(0).dtype; + uint64_t block_size = orch_args.tensor(1).shapes[1]; + uint64_t block_num = orch_args.tensor(3).shapes[1]; + uint64_t scale_value = orch_args.scalar(0); + + uint64_t q_head_num = num_heads; + uint64_t q_tile = 16; + uint64_t q_loop = (q_head_num + q_tile - 1) / q_tile; + + uint64_t b_start = batch * orch_thread_index / orch_thread_num; + uint64_t b_end = batch * (orch_thread_index + 1) / orch_thread_num; + + void *query_ptr = orch_args.tensor(0).data_as(); + void *kc_ptr = orch_args.tensor(1).data_as(); + void *vc_ptr = orch_args.tensor(2).data_as(); + void *out_ptr = orch_args.tensor(5).data_as(); + + uint64_t total_blocks_count = orch_args.tensor(1).shapes[0]; + uint64_t kv_total_rows = total_blocks_count * block_size; + + uint32_t query_shapes[2] = {static_cast(batch * num_heads), static_cast(head_dim)}; + uint32_t key_cache_shapes[2] = {static_cast(kv_total_rows), static_cast(head_dim)}; + uint32_t value_cache_shapes[2] = {static_cast(kv_total_rows), static_cast(head_dim)}; + uint32_t out_shapes[2] = {static_cast(batch * num_heads), static_cast(head_dim)}; + Tensor query = make_tensor_external(query_ptr, query_shapes, 2, data_type); + Tensor key_cache = make_tensor_external(kc_ptr, key_cache_shapes, 2, data_type); + Tensor value_cache = make_tensor_external(vc_ptr, value_cache_shapes, 2, data_type); + Tensor out = make_tensor_external(out_ptr, out_shapes, 2, DataType::FLOAT32); + + int *host_block_table = orch_args.tensor(3).data_as(); + int *host_context_lens = orch_args.tensor(4).data_as(); + + uint32_t tile2d_shapes[2] = {static_cast(q_tile), static_cast(head_dim)}; + uint32_t scalar_shapes[1] = {static_cast(q_tile)}; + uint32_t sij_shapes[2] = {static_cast(q_tile), static_cast(block_size)}; + TensorCreateInfo tile2d_ci(tile2d_shapes, 2, DataType::FLOAT32); + TensorCreateInfo scalar_ci(scalar_shapes, 1, DataType::FLOAT32); + TensorCreateInfo sij_ci(sij_shapes, 2, DataType::FLOAT32); + TensorCreateInfo pij_f16_ci(sij_shapes, 2, data_type); + + for (uint64_t b_idx = b_start; b_idx < b_end; b_idx++) { + uint64_t cur_seq = host_context_lens[b_idx]; + uint64_t bn_this_batch = (cur_seq + block_size - 1) / block_size; + for (uint64_t q_idx = 0; q_idx < q_loop; q_idx++) { + PTO2_SCOPE() { + uint32_t cur_offset = static_cast(b_idx * q_head_num + q_idx * q_tile); + uint32_t qi_offsets[2] = {cur_offset, 0}; + uint32_t out_view_offsets[2] = {cur_offset, 0}; + Tensor qi = query.view(tile2d_shapes, qi_offsets); + Tensor out_view = out.view(tile2d_shapes, out_view_offsets); + + Arg params_inplace; + params_inplace.add_output(tile2d_ci); + params_inplace.add_output(scalar_ci); + params_inplace.add_output(scalar_ci); + TaskOutputTensors hub_outs = pto2_rt_submit_aiv_task(FUNC_AIV_HUB, params_inplace); + const Tensor &oi = hub_outs.get_ref(0); + const Tensor &li_update = hub_outs.get_ref(1); + const Tensor &mi_update = hub_outs.get_ref(2); + + for (uint64_t bn = 0; bn < bn_this_batch; bn++) { + uint64_t cur_block_idx = host_block_table[b_idx * block_num + bn]; + uint64_t valid_len = + block_size < (cur_seq - bn * block_size) ? block_size : (cur_seq - bn * block_size); + uint32_t kv_shapes[2] = {static_cast(block_size), static_cast(head_dim)}; + uint32_t kv_offsets[2] = {static_cast(cur_block_idx * block_size), 0}; + Tensor kj = key_cache.view(kv_shapes, kv_offsets); + Tensor vj = value_cache.view(kv_shapes, kv_offsets); + + PTO2_SCOPE(PTO2ScopeMode::MANUAL) { + Arg params_qk; + params_qk.add_input(qi); + params_qk.add_input(kj); + params_qk.add_output(sij_ci); + PTO2ManualSubmitResult qk_outs = pto2_rt_submit_aic_task_manual(FUNC_QK_MATMUL, params_qk); + const Tensor &sij = qk_outs.outputs.get_ref(0); + + uint32_t sij_valid_shapes[2] = { + static_cast(q_tile), static_cast(valid_len) + }; + uint32_t sij_valid_offsets[2] = {0, 0}; + Tensor sij_valid = sij.view(sij_valid_shapes, sij_valid_offsets); + + Arg params_sf; + params_sf.add_input(sij_valid); + params_sf.add_output(pij_f16_ci); + params_sf.add_output(scalar_ci); + params_sf.add_output(scalar_ci); + params_sf.add_scalar(scale_value); + PTO2ManualSubmitResult sf_outs = + pto2_rt_submit_aiv_task_manual(FUNC_SOFTMAX_PREPARE, params_sf); + const Tensor &pij_f16 = sf_outs.outputs.get_ref(0); + const Tensor &mi = sf_outs.outputs.get_ref(1); + const Tensor &li = sf_outs.outputs.get_ref(2); + + Arg params_pv; + params_pv.add_input(pij_f16); + params_pv.add_input(vj); + params_pv.add_output(tile2d_ci); + PTO2ManualSubmitResult pv_outs = pto2_rt_submit_aic_task_manual(FUNC_PV_MATMUL, params_pv); + const Tensor &oi_tmp = pv_outs.outputs.get_ref(0); + + uint64_t is_first = (bn == 0) ? 1 : 0; + uint64_t is_last = (bn == bn_this_batch - 1) ? 1 : 0; + + Arg params_up; + params_up.add_input(mi); + params_up.add_input(li); + params_up.add_input(oi_tmp); + params_up.add_inout(mi_update); + params_up.add_inout(li_update); + params_up.add_inout(oi); + params_up.add_inout(out_view); + params_up.add_scalar(is_first); + params_up.add_scalar(is_last); + PTO2ManualSubmitResult up_outs = pto2_rt_submit_aiv_task_manual(FUNC_ONLINE_UPDATE, params_up); + + pto2_rt_add_dependency(qk_outs.task_id, sf_outs.task_id); + pto2_rt_add_dependency(sf_outs.task_id, pv_outs.task_id); + pto2_rt_add_dependency(sf_outs.task_id, up_outs.task_id); + pto2_rt_add_dependency(pv_outs.task_id, up_outs.task_id); + } + } + } + } + } +} + +} // extern "C" diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/orchestration/pto_orchestration_api.h b/src/a2a3/runtime/tensormap_and_ringbuffer/orchestration/pto_orchestration_api.h index cf752ef2d..c54c80603 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/orchestration/pto_orchestration_api.h +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/orchestration/pto_orchestration_api.h @@ -34,6 +34,7 @@ #include // Type headers needed by orchestration +#include "pto_task_id.h" // PTO2TaskId // NOLINT(build/include_subdir) #include "pto_submit_types.h" // MixedKernels, INVALID_KERNEL_ID, subtask slots // NOLINT(build/include_subdir) #include "pto_types.h" // Arg, TaskOutputTensors, TensorArgType // NOLINT(build/include_subdir) #include "task_args.h" // ChipStorageTaskArgs, ContinuousTensor // NOLINT(build/include_subdir) @@ -84,6 +85,16 @@ inline Tensor from_tensor_arg(const ContinuousTensor &t, bool manual_dep = false // Ops Table and Opaque Runtime // ============================================================================= +enum class PTO2ScopeMode : uint8_t { + AUTO = 0, + MANUAL = 1, +}; + +struct PTO2ManualSubmitResult { + PTO2TaskId task_id; + TaskOutputTensors outputs; +}; + /** * Forward declaration — the orchestration sees PTO2Runtime as a partial * struct whose first field is the ops pointer. The full definition @@ -115,7 +126,9 @@ void pto2_framework_bind_runtime(PTO2Runtime *rt); */ typedef struct PTO2RuntimeOps { TaskOutputTensors (*submit_task)(PTO2Runtime *rt, const MixedKernels &mixed_kernels, const Arg &args); - void (*scope_begin)(PTO2Runtime *rt); + PTO2ManualSubmitResult (*submit_task_manual)(PTO2Runtime *rt, const MixedKernels &mixed_kernels, const Arg &args); + void (*add_dependency)(PTO2Runtime *rt, PTO2TaskId producer, PTO2TaskId consumer); + void (*scope_begin)(PTO2Runtime *rt, PTO2ScopeMode mode); void (*scope_end)(PTO2Runtime *rt); void (*orchestration_done)(PTO2Runtime *rt); bool (*is_fatal)(PTO2Runtime *rt); @@ -179,12 +192,16 @@ static inline TaskOutputTensors alloc_tensors(const CIs &...cis) { always_assert(!args.has_error && "alloc_tensors failed to construct output-only Arg"); return alloc_tensors(args); } - static inline TaskOutputTensors pto2_rt_submit_task(const MixedKernels &mixed_kernels, const Arg &args) { PTO2Runtime *rt = pto2_current_runtime(); return rt->ops->submit_task(rt, mixed_kernels, args); } +static inline PTO2ManualSubmitResult pto2_rt_submit_task_manual(const MixedKernels &mixed_kernels, const Arg &args) { + PTO2Runtime *rt = pto2_current_runtime(); + return rt->ops->submit_task_manual(rt, mixed_kernels, args); +} + /** * Convenience wrapper: submit an AIC-only task. */ @@ -205,9 +222,28 @@ static inline TaskOutputTensors pto2_rt_submit_aiv_task(int32_t kernel_id, const return rt->ops->submit_task(rt, mk, args); } -static inline void pto2_rt_scope_begin() { +static inline PTO2ManualSubmitResult pto2_rt_submit_aic_task_manual(int32_t kernel_id, const Arg &args) { + PTO2Runtime *rt = pto2_current_runtime(); + MixedKernels mk; + mk.aic_kernel_id = kernel_id; + return rt->ops->submit_task_manual(rt, mk, args); +} + +static inline PTO2ManualSubmitResult pto2_rt_submit_aiv_task_manual(int32_t kernel_id, const Arg &args) { + PTO2Runtime *rt = pto2_current_runtime(); + MixedKernels mk; + mk.aiv0_kernel_id = kernel_id; + return rt->ops->submit_task_manual(rt, mk, args); +} + +static inline void pto2_rt_add_dependency(PTO2TaskId producer, PTO2TaskId consumer) { + PTO2Runtime *rt = pto2_current_runtime(); + rt->ops->add_dependency(rt, producer, consumer); +} + +static inline void pto2_rt_scope_begin(PTO2ScopeMode mode = PTO2ScopeMode::AUTO) { PTO2Runtime *rt = pto2_current_runtime(); - rt->ops->scope_begin(rt); + rt->ops->scope_begin(rt, mode); } static inline void pto2_rt_scope_end() { @@ -300,9 +336,9 @@ static inline void set_tensor_data(const Tensor &tensor, uint32_t ndims, const u */ class PTO2ScopeGuard { public: // NOLINT(whitespace/indent) - PTO2ScopeGuard() : + explicit PTO2ScopeGuard(PTO2ScopeMode mode = PTO2ScopeMode::AUTO) : rt_(pto2_current_runtime()) { - rt_->ops->scope_begin(rt_); + rt_->ops->scope_begin(rt_, mode); } ~PTO2ScopeGuard() { rt_->ops->scope_end(rt_); } @@ -313,7 +349,8 @@ class PTO2ScopeGuard { #define _PTO2_CONCATENATE_IMPL(x, y) x##y #define _PTO2_CONCATENATE(x, y) _PTO2_CONCATENATE_IMPL(x, y) -#define PTO2_SCOPE_GUARD() [[maybe_unused]] PTO2ScopeGuard _PTO2_CONCATENATE(scope_guard_, __COUNTER__) +#define PTO2_SCOPE_GUARD(...) \ + [[maybe_unused]] PTO2ScopeGuard _PTO2_CONCATENATE(scope_guard_, __COUNTER__) { __VA_ARGS__ } /** * Scoped block macro: @@ -321,7 +358,7 @@ class PTO2ScopeGuard { * pto2_rt_submit_task(...); * } */ -#define PTO2_SCOPE() if (PTO2_SCOPE_GUARD(); true) +#define PTO2_SCOPE(...) if (PTO2_SCOPE_GUARD(__VA_ARGS__); true) // ============================================================================= // Orchestration Config diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp index defc1ec49..1edd9d047 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp @@ -345,7 +345,6 @@ bool pto2_orchestrator_init( sm_handle->task_descriptors[r], sm_handle->header->rings[r].task_window_size, &fc.current_task_index, &fc.last_task_alive, ring_heap_base, heap_size, &sm_handle->header->orch_error_code ); - size_t fanin_pool_bytes = PTO2_ALIGN_UP(static_cast(dep_pool_capacity) * sizeof(PTO2FaninSpillEntry), PTO2_ALIGN_SIZE); PTO2FaninSpillEntry *fanin_entries = @@ -393,9 +392,20 @@ bool pto2_orchestrator_init( int32_t init_cap = PTO2_SCOPE_TASKS_INIT_CAP; orch->scope_tasks = reinterpret_cast(malloc(init_cap * sizeof(PTO2TaskSlotState *))); orch->scope_begins = reinterpret_cast(malloc(max_depth * sizeof(int32_t))); - if (!orch->scope_tasks || !orch->scope_begins) { + orch->scope_modes = reinterpret_cast(malloc(max_depth * sizeof(PTO2ScopeMode))); + orch->manual_task_meta_begins = reinterpret_cast(malloc(max_depth * sizeof(int32_t))); + orch->manual_edge_begins = reinterpret_cast(malloc(max_depth * sizeof(int32_t))); + orch->manual_task_meta = reinterpret_cast(malloc(init_cap * sizeof(PTO2ManualTaskMeta))); + orch->manual_edges = reinterpret_cast(malloc(init_cap * sizeof(PTO2ManualEdge))); + if (!orch->scope_tasks || !orch->scope_begins || !orch->scope_modes || !orch->manual_task_meta_begins || + !orch->manual_edge_begins || !orch->manual_task_meta || !orch->manual_edges) { free(orch->scope_tasks); free(orch->scope_begins); + free(orch->scope_modes); + free(orch->manual_task_meta_begins); + free(orch->manual_edge_begins); + free(orch->manual_task_meta); + free(orch->manual_edges); for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { free(orch->rings[r].fanin_pool.base); free(orch->rings[r].dep_pool.base); @@ -407,6 +417,11 @@ bool pto2_orchestrator_init( orch->scope_tasks_capacity = init_cap; orch->scope_stack_top = -1; orch->scope_stack_capacity = max_depth; + orch->manual_scope_active = false; + orch->manual_task_meta_size = 0; + orch->manual_task_meta_capacity = init_cap; + orch->manual_edges_size = 0; + orch->manual_edges_capacity = init_cap; return true; } @@ -425,6 +440,16 @@ void pto2_orchestrator_destroy(PTO2OrchestratorState *orch) { orch->scope_tasks = NULL; free(orch->scope_begins); orch->scope_begins = NULL; + free(orch->scope_modes); + orch->scope_modes = NULL; + free(orch->manual_task_meta_begins); + orch->manual_task_meta_begins = NULL; + free(orch->manual_edge_begins); + orch->manual_edge_begins = NULL; + free(orch->manual_task_meta); + orch->manual_task_meta = NULL; + free(orch->manual_edges); + orch->manual_edges = NULL; } void pto2_orchestrator_set_scheduler(PTO2OrchestratorState *orch, PTO2SchedulerState *scheduler) { @@ -447,14 +472,77 @@ static void scope_tasks_push(PTO2OrchestratorState *orch, PTO2TaskSlotState *tas orch->scope_tasks[orch->scope_tasks_size++] = task_slot_state; } -void pto2_scope_begin(PTO2OrchestratorState *orch) { +static void manual_task_meta_push(PTO2OrchestratorState *orch, const PTO2ManualTaskMeta &meta) { + if (orch->manual_task_meta_size >= orch->manual_task_meta_capacity) { + int32_t new_cap = orch->manual_task_meta_capacity * 2; + PTO2ManualTaskMeta *new_buf = reinterpret_cast( + realloc(orch->manual_task_meta, new_cap * sizeof(PTO2ManualTaskMeta)) + ); + assert(new_buf && "Failed to grow manual task meta buffer"); + orch->manual_task_meta = new_buf; + orch->manual_task_meta_capacity = new_cap; + } + orch->manual_task_meta[orch->manual_task_meta_size++] = meta; +} + +static void manual_edge_push(PTO2OrchestratorState *orch, const PTO2ManualEdge &edge) { + if (orch->manual_edges_size >= orch->manual_edges_capacity) { + int32_t new_cap = orch->manual_edges_capacity * 2; + PTO2ManualEdge *new_buf = + reinterpret_cast(realloc(orch->manual_edges, new_cap * sizeof(PTO2ManualEdge))); + assert(new_buf && "Failed to grow manual edge buffer"); + orch->manual_edges = new_buf; + orch->manual_edges_capacity = new_cap; + } + orch->manual_edges[orch->manual_edges_size++] = edge; +} + +static bool in_manual_scope(const PTO2OrchestratorState *orch) { + return orch->scope_stack_top >= 0 && orch->scope_modes[orch->scope_stack_top] == PTO2ScopeMode::MANUAL; +} + +static int32_t current_manual_scope_begin(const PTO2OrchestratorState *orch) { + return orch->scope_begins[orch->scope_stack_top]; +} + +static int32_t find_current_manual_scope_task_index(const PTO2OrchestratorState *orch, PTO2TaskId task_id) { + if (!in_manual_scope(orch) || !task_id.is_valid()) { + return -1; + } + + int32_t begin = current_manual_scope_begin(orch); + for (int32_t i = begin; i < orch->scope_tasks_size; i++) { + PTO2TaskSlotState *slot_state = orch->scope_tasks[i]; + if (slot_state != nullptr && slot_state->task->task_id == task_id) { + return i - begin; + } + } + return -1; +} + +static bool task_owned_by_current_manual_scope(const PTO2OrchestratorState *orch, PTO2TaskId task_id) { + return find_current_manual_scope_task_index(orch, task_id) >= 0; +} + +void pto2_scope_begin(PTO2OrchestratorState *orch, PTO2ScopeMode mode) { if (orch->fatal) { return; } assert(orch->scope_stack_top < static_cast(orch->scope_stack_capacity - 1) && "Scope stack overflow"); + if (in_manual_scope(orch)) { + LOG_ERROR("nested scope inside PTO2_SCOPE(PTO2ScopeMode::MANUAL) is not supported in v1"); + orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_INVALID_ARGS, std::memory_order_release); + orch->fatal = true; + return; + } + ++orch->scope_stack_top; orch->scope_begins[orch->scope_stack_top] = orch->scope_tasks_size; + orch->scope_modes[orch->scope_stack_top] = mode; + orch->manual_scope_active = (mode == PTO2ScopeMode::MANUAL); + orch->manual_task_meta_begins[orch->scope_stack_top] = orch->manual_task_meta_size; + orch->manual_edge_begins[orch->scope_stack_top] = orch->manual_edges_size; } void pto2_scope_end(PTO2OrchestratorState *orch) { @@ -467,8 +555,168 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { uint64_t _se0 = get_sys_cnt_aicpu(); #endif - int32_t begin = orch->scope_begins[orch->scope_stack_top--]; + int32_t top = orch->scope_stack_top; + int32_t begin = orch->scope_begins[top]; int32_t count = orch->scope_tasks_size - begin; + PTO2ScopeMode mode = orch->scope_modes[top]; + int32_t manual_meta_begin = orch->manual_task_meta_begins[top]; + int32_t manual_edge_begin = orch->manual_edge_begins[top]; + + if (mode == PTO2ScopeMode::MANUAL && orch->scheduler && count > 0) { + int32_t manual_task_count = orch->manual_task_meta_size - manual_meta_begin; + if (manual_task_count != count) { + LOG_ERROR("manual scope requires pto2_rt_submit_*_manual for every submitted task"); + orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_INVALID_ARGS, std::memory_order_release); + orch->fatal = true; + return; + } + + for (int32_t ring = 0; ring < PTO2_MAX_RING_DEPTH; ring++) { + int32_t sm_last_task_alive = + orch->sm_handle->header->rings[ring].fc.last_task_alive.load(std::memory_order_acquire); + orch->tensor_map.sync_tensormap(static_cast(ring), sm_last_task_alive); + orch->rings[ring].dep_pool.reclaim(*orch->scheduler, static_cast(ring), sm_last_task_alive); + } + + for (int32_t task_idx = 0; task_idx < count; task_idx++) { + PTO2ManualTaskMeta &meta = orch->manual_task_meta[manual_meta_begin + task_idx]; + PTO2TaskSlotState *slot_state = orch->scope_tasks[begin + task_idx]; + if (meta.slot_state != slot_state || meta.scope_task_index != task_idx) { + LOG_ERROR("manual scope task metadata does not match submit order"); + orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_INVALID_ARGS, std::memory_order_release); + orch->fatal = true; + return; + } + + PTO2TaskPayload *payload = slot_state->payload; + PTO2TaskId task_id = slot_state->task->task_id; + uint8_t ring_id = slot_state->ring_id; + auto &dep_pool = orch->rings[ring_id].dep_pool; + auto &fc = orch->sm_handle->header->rings[ring_id].fc; + PTO2FaninBuilder fanin_builder; + fanin_builder.spill_pool = &orch->rings[ring_id].fanin_pool; + + for (int32_t edge_idx = manual_edge_begin; edge_idx < orch->manual_edges_size; edge_idx++) { + const PTO2ManualEdge &edge = orch->manual_edges[edge_idx]; + if (edge.consumer_idx != task_idx) { + continue; + } + + PTO2TaskSlotState *prod_state = orch->scope_tasks[begin + edge.producer_idx]; + if (!pto2_append_fanin_or_fail( + orch, task_id, edge.consumer_idx, TensorArgType::INPUT, prod_state, &fanin_builder, + orch->scheduler, fc, ring_id, "manual explicit dependency" + )) { + return; + } + } + + for (int32_t t = 0; t < meta.tensor_count; t++) { + TensorArgType tag = static_cast(meta.tags[t]); + if (tag == TensorArgType::OUTPUT) { + continue; + } + + const Tensor &tensor = payload->tensors[t]; + bool manual_local_tensor = task_owned_by_current_manual_scope(orch, tensor.owner_task_id); + if (manual_local_tensor) { + continue; + } + + PTO2TaskId owner = tensor.owner_task_id; + if (owner.is_valid()) { + PTO2TaskSlotState *prod_state = + &orch->scheduler->ring_sched_states[owner.ring()].get_slot_state_by_task_id(owner.local()); + if (!pto2_append_fanin_or_fail( + orch, task_id, t, tag, prod_state, &fanin_builder, orch->scheduler, fc, ring_id, + "creator retention" + )) { + return; + } + } + + if (tag != TensorArgType::INPUT && tag != TensorArgType::INOUT) { + continue; + } + if (tensor.manual_dep) { + continue; + } + + PTO2LookupResult lookup_result; + orch->tensor_map.lookup(tensor, lookup_result); + + for (int32_t r = 0; r < lookup_result.count; r++) { + PTO2TensorMapEntry &entry = *lookup_result.entries[r].entry; + auto overlap_status = lookup_result.entries[r].overlap_status; + PTO2TaskId prod_task_id = entry.producer_task_id; + PTO2TaskSlotState *prod_state = + &orch->scheduler->ring_sched_states[prod_task_id.ring()].get_slot_state_by_task_id( + prod_task_id.local() + ); + if (!pto2_append_fanin_or_fail( + orch, task_id, t, tag, prod_state, &fanin_builder, orch->scheduler, fc, ring_id, + "overlap lookup" + )) { + return; + } + if (tag == TensorArgType::INOUT && overlap_status == OverlapStatus::COVERED) { + orch->tensor_map.remove_entry(entry); + } + } + } + + int32_t fanin_count = fanin_builder.count; + int32_t inline_count = std::min(fanin_count, PTO2_FANIN_INLINE_CAP); + int32_t spill_count = fanin_count - inline_count; + dep_pool.ensure_space(*orch->scheduler, fc, ring_id, fanin_count + 1); + + slot_state->task_state.store(PTO2_TASK_PENDING, std::memory_order_relaxed); + slot_state->fanin_count = fanin_count + 1; + payload->fanin_actual_count = fanin_count; + payload->fanin_spill_start = (spill_count > 0) ? fanin_builder.spill_start : 0; + payload->fanin_spill_pool = (spill_count > 0) ? fanin_builder.spill_pool : nullptr; + for (int32_t i = 0; i < inline_count; i++) { + payload->fanin_inline_slot_states[i] = fanin_builder.inline_slots[i]; + } + + int32_t early_finished = 0; + pto2_for_each_fanin_slot_state(*payload, [&](PTO2TaskSlotState *producer_slot) { + PTO2TaskSlotState &producer_slot_state = *producer_slot; +#if PTO2_ORCH_PROFILING + pto2_fanout_lock(producer_slot_state, g_orch_fanin_atomic_count, g_orch_fanin_wait_cycle); +#else + pto2_fanout_lock(producer_slot_state); +#endif + producer_slot_state.fanout_count += 1; + int32_t prod_state = producer_slot_state.task_state.load(std::memory_order_acquire); + if (prod_state >= PTO2_TASK_COMPLETED) { + early_finished++; + } else { + producer_slot_state.fanout_head = dep_pool.prepend(producer_slot_state.fanout_head, slot_state); + } + pto2_fanout_unlock(producer_slot_state); + return true; + }); + if (early_finished > 0) { + slot_state->fanin_refcount.fetch_add(early_finished, std::memory_order_acq_rel); + } + slot_state->dep_pool_mark = dep_pool.top; + + for (int32_t t = 0; t < meta.tensor_count; t++) { + TensorArgType tag = static_cast(meta.tags[t]); + if (tag != TensorArgType::INOUT && tag != TensorArgType::OUTPUT_EXISTING) { + continue; + } + const Tensor &tensor = payload->tensors[t]; + if (task_owned_by_current_manual_scope(orch, tensor.owner_task_id) || tensor.manual_dep) { + continue; + } + orch->tensor_map.insert(tensor, task_id); + } + } + + orch->scheduler->publish_manual_scope_tasks(&orch->scope_tasks[begin], count); + } if (orch->scheduler && count > 0) { orch->scheduler->on_scope_end(&orch->scope_tasks[begin], count); @@ -476,6 +724,10 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { // Rewind the task buffer — these entries are no longer needed orch->scope_tasks_size = begin; + orch->manual_task_meta_size = manual_meta_begin; + orch->manual_edges_size = manual_edge_begin; + orch->scope_stack_top--; + orch->manual_scope_active = false; #if PTO2_ORCH_PROFILING uint64_t _se1 = get_sys_cnt_aicpu(); @@ -487,8 +739,9 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { // ============================================================================= // Task Submission // ============================================================================= -TaskOutputTensors -pto2_submit_mixed_task(PTO2OrchestratorState *orch, const MixedKernels &mixed_kernels, const Arg &args) { +static TaskOutputTensors pto2_submit_mixed_task_impl( + PTO2OrchestratorState *orch, const MixedKernels &mixed_kernels, const Arg &args, bool manual_submit +) { CYCLE_COUNT_START(); TaskOutputTensors result; @@ -511,29 +764,22 @@ pto2_submit_mixed_task(PTO2OrchestratorState *orch, const MixedKernels &mixed_ke orch->fatal = true; return result; } - - // === Validate submit inputs === - uint8_t active_mask = pto2_mixed_kernels_to_active_mask(mixed_kernels); - always_assert(active_mask != 0 && "MixedKernels must have at least one active slot"); - - int16_t block_num = args.launch_spec.block_num(); - always_assert(block_num >= 1 && "block_num must be >= 1"); - - // Normalize single-AIV tasks: if only aiv1 is set (no aic, no aiv0), move - // it to the aiv0 slot. This guarantees the dispatch path can always use - // PTO2SubtaskSlot::AIV0 for single-AIV shapes without inspecting active_mask. - // Mixed tasks (AIC+AIV) keep their original AIV identity so the correct - // hardware channel (AIV0→AIC vs AIV1→AIC) is used at dispatch time. - MixedKernels normalized = mixed_kernels; - bool has_aic = (active_mask & PTO2_SUBTASK_MASK_AIC) != 0; - bool has_aiv0 = (active_mask & PTO2_SUBTASK_MASK_AIV0) != 0; - bool has_aiv1 = (active_mask & PTO2_SUBTASK_MASK_AIV1) != 0; if (!has_aic && has_aiv1 && !has_aiv0) { normalized.aiv0_kernel_id = normalized.aiv1_kernel_id; normalized.aiv1_kernel_id = INVALID_KERNEL_ID; active_mask = pto2_mixed_kernels_to_active_mask(normalized); } + // Submission without an open scope is illegal. + always_assert(orch->scope_stack_top >= 0 && "Cannot submit task outside a scope"); + + if (!manual_submit && in_manual_scope(orch)) { + LOG_ERROR("PTO2_SCOPE(PTO2ScopeMode::MANUAL) requires pto2_rt_submit_*_manual task APIs"); + orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_INVALID_ARGS, std::memory_order_release); + orch->fatal = true; + return result; + } + // Encode require_sync_start into active_mask bit 3 (only meaningful for tasks with block_num > 1) if (block_num > 1 && args.launch_spec.require_sync_start()) { // Deadlock check: block_num >= total available slots of the required type. @@ -562,9 +808,8 @@ pto2_submit_mixed_task(PTO2OrchestratorState *orch, const MixedKernels &mixed_ke int32_t slot = prepared.alloc_result.slot; PTO2FaninBuilder fanin_builder; - fanin_builder.count = 0; - fanin_builder.spill_start = 0; fanin_builder.spill_pool = &orch->rings[ring_id].fanin_pool; + bool defer_publish = manual_submit; CYCLE_COUNT_LAP_RECORD(g_orch_alloc_cycle, AicpuPhaseId::ORCH_ALLOC, task_id.raw); @@ -588,51 +833,54 @@ pto2_submit_mixed_task(PTO2OrchestratorState *orch, const MixedKernels &mixed_ke CYCLE_COUNT_LAP_RECORD(g_orch_sync_cycle, AicpuPhaseId::ORCH_SYNC, task_id.raw); // === STEP 3: Lookup inputs + materialize runtime-created outputs === - for (int i = 0; i < args.tensor_count(); i++) { - TensorArgType ptype = args.tag(i); - if (ptype == TensorArgType::OUTPUT) { - // Runtime-created OUTPUT tensors are not looked up in the TensorMap since they have no dependencies. - continue; - } - - const Tensor *tensor = args.tensor(i).ptr; - - // Step A: creator retention — all existing tensors extend their creator lifetime. - PTO2TaskId owner = tensor->owner_task_id; - if (owner.is_valid() && sched != nullptr) { - PTO2TaskSlotState *prod_state = - &sched->ring_sched_states[owner.ring()].get_slot_state_by_task_id(owner.local()); - if (!pto2_append_fanin_or_fail( - orch, task_id, i, ptype, prod_state, &fanin_builder, sched, fc, ring_id, "creator retention" - )) { - return result; + if (!defer_publish) { + for (int i = 0; i < args.tensor_count(); i++) { + TensorArgType ptype = args.tag(i); + if (ptype == TensorArgType::OUTPUT) { + // Runtime-created OUTPUT tensors are not looked up in the TensorMap since they have no dependencies. + continue; } - } - // Step B: only INPUT/INOUT need modifier dependency lookup. - if (ptype != TensorArgType::INPUT && ptype != TensorArgType::INOUT) { - continue; - } - if (tensor->manual_dep) { - continue; - } + const Tensor *tensor = args.tensor(i).ptr; + + // Step A: creator retention — all existing tensors extend their creator lifetime. + PTO2TaskId owner = tensor->owner_task_id; + if (owner.is_valid() && sched != nullptr) { + PTO2TaskSlotState *prod_state = + &sched->ring_sched_states[owner.ring()].get_slot_state_by_task_id(owner.local()); + if (!pto2_append_fanin_or_fail( + orch, task_id, i, ptype, prod_state, &fanin_builder, sched, fc, ring_id, "creator retention" + )) { + return result; + } + } - PTO2LookupResult lookup_result; - orch->tensor_map.lookup(*tensor, lookup_result); - - for (int r = 0; r < lookup_result.count; r++) { - PTO2TensorMapEntry &entry = *lookup_result.entries[r].entry; - auto overlap_status = lookup_result.entries[r].overlap_status; - auto prod_ring = entry.producer_task_id.ring(); - auto prod_local = entry.producer_task_id.local(); - PTO2TaskSlotState *prod_state = &sched->ring_sched_states[prod_ring].get_slot_state_by_task_id(prod_local); - if (!pto2_append_fanin_or_fail( - orch, task_id, i, ptype, prod_state, &fanin_builder, sched, fc, ring_id, "overlap lookup" - )) { - return result; + // Step B: only INPUT/INOUT need modifier dependency lookup. + if (ptype != TensorArgType::INPUT && ptype != TensorArgType::INOUT) { + continue; } - if (ptype == TensorArgType::INOUT && overlap_status == OverlapStatus::COVERED) { - orch->tensor_map.remove_entry(entry); + if (tensor->manual_dep) { + continue; + } + + PTO2LookupResult lookup_result; + orch->tensor_map.lookup(*tensor, lookup_result); + + for (int r = 0; r < lookup_result.count; r++) { + PTO2TensorMapEntry &entry = *lookup_result.entries[r].entry; + auto overlap_status = lookup_result.entries[r].overlap_status; + auto prod_ring = entry.producer_task_id.ring(); + auto prod_local = entry.producer_task_id.local(); + PTO2TaskSlotState *prod_state = + &sched->ring_sched_states[prod_ring].get_slot_state_by_task_id(prod_local); + if (!pto2_append_fanin_or_fail( + orch, task_id, i, ptype, prod_state, &fanin_builder, sched, fc, ring_id, "overlap lookup" + )) { + return result; + } + if (ptype == TensorArgType::INOUT && overlap_status == OverlapStatus::COVERED) { + orch->tensor_map.remove_entry(entry); + } } } } @@ -644,7 +892,7 @@ pto2_submit_mixed_task(PTO2OrchestratorState *orch, const MixedKernels &mixed_ke for (int i = 0; i < args.tensor_count(); i++) { TensorArgType ptype = args.tag(i); if (ptype == TensorArgType::INOUT || ptype == TensorArgType::OUTPUT_EXISTING) { - if (!args.tensor(i).ptr->manual_dep) { + if (!defer_publish && !args.tensor(i).ptr->manual_dep) { orch->tensor_map.insert(*args.tensor(i).ptr, task_id); } } @@ -705,55 +953,66 @@ pto2_submit_mixed_task(PTO2OrchestratorState *orch, const MixedKernels &mixed_ke cur_slot_state.block_num = block_num; cur_slot_state.next_block_idx = 0; - auto &dep_pool = orch->rings[ring_id].dep_pool; - int32_t fanin_count = fanin_builder.count; - int32_t inline_count = std::min(fanin_count, PTO2_FANIN_INLINE_CAP); - int32_t spill_count = fanin_count - inline_count; - dep_pool.ensure_space(*sched, fc, ring_id, fanin_count + 1); - - int32_t early_finished = 0; - cur_slot_state.fanin_count = fanin_count + 1; // +1 redundance for not being ready too early - payload->fanin_actual_count = fanin_count; - payload->fanin_spill_start = (spill_count > 0) ? fanin_builder.spill_start : 0; - payload->fanin_spill_pool = (spill_count > 0) ? fanin_builder.spill_pool : nullptr; - for (int i = 0; i < inline_count; i++) { - payload->fanin_inline_slot_states[i] = fanin_builder.inline_slots[i]; - } - pto2_for_each_fanin_slot_state(*payload, [&](PTO2TaskSlotState *producer_slot) { - PTO2TaskSlotState &producer_slot_state = *producer_slot; + if (defer_publish) { + cur_slot_state.fanin_count = 1; + payload->fanin_actual_count = 0; + payload->fanin_spill_start = 0; + payload->fanin_spill_pool = nullptr; + cur_slot_state.dep_pool_mark = orch->rings[ring_id].dep_pool.top; + } else { + auto &dep_pool = orch->rings[ring_id].dep_pool; + int32_t fanin_count = fanin_builder.count; + int32_t inline_count = std::min(fanin_count, PTO2_FANIN_INLINE_CAP); + int32_t spill_count = fanin_count - inline_count; + dep_pool.ensure_space(*sched, fc, ring_id, fanin_count + 1); + + int32_t early_finished = 0; + cur_slot_state.fanin_count = fanin_count + 1; // +1 redundance for not being ready too early + payload->fanin_actual_count = fanin_count; + payload->fanin_spill_start = (spill_count > 0) ? fanin_builder.spill_start : 0; + payload->fanin_spill_pool = (spill_count > 0) ? fanin_builder.spill_pool : nullptr; + for (int i = 0; i < inline_count; i++) { + payload->fanin_inline_slot_states[i] = fanin_builder.inline_slots[i]; + } + pto2_for_each_fanin_slot_state(*payload, [&](PTO2TaskSlotState *producer_slot) { + PTO2TaskSlotState &producer_slot_state = *producer_slot; #if PTO2_ORCH_PROFILING - pto2_fanout_lock(producer_slot_state, g_orch_fanin_atomic_count, g_orch_fanin_wait_cycle); + pto2_fanout_lock(producer_slot_state, g_orch_fanin_atomic_count, g_orch_fanin_wait_cycle); #else - pto2_fanout_lock(producer_slot_state); + pto2_fanout_lock(producer_slot_state); #endif - producer_slot_state.fanout_count += 1; - int32_t prod_state = producer_slot_state.task_state.load(std::memory_order_acquire); - if (prod_state >= PTO2_TASK_COMPLETED) { - early_finished++; - } else { - producer_slot_state.fanout_head = dep_pool.prepend(producer_slot_state.fanout_head, &cur_slot_state); + // Normal path: prepend consumer to producer's fanout list + producer_slot_state.fanout_count += 1; + int32_t prod_state = producer_slot_state.task_state.load(std::memory_order_acquire); + if (prod_state >= PTO2_TASK_COMPLETED) { + // Early return optimization: if producer already completed, we can skip adding dependency and + // directly decrement fanin_count + early_finished++; + } else { + producer_slot_state.fanout_head = + dep_pool.prepend(producer_slot_state.fanout_head, &cur_slot_state); + } + pto2_fanout_unlock(producer_slot_state); + return true; + }); + int32_t initial_refcount = early_finished + 1; // +1 for the init release + int32_t new_rc = + cur_slot_state.fanin_refcount.fetch_add(initial_refcount, std::memory_order_acq_rel) + initial_refcount; + if (new_rc >= fanin_count + 1) { + PTO2ResourceShape shape = pto2_active_mask_to_shape(active_mask); + sched->ready_queues[static_cast(shape)].push(&cur_slot_state); } - pto2_fanout_unlock(producer_slot_state); - }); - // Combined release: merge early_finished batch with the +1 init release - // into a single atomic fetch_add (saves one acq_rel cache-line bounce per task). - int32_t initial_refcount = early_finished + 1; // +1 for the init release - int32_t new_rc = - cur_slot_state.fanin_refcount.fetch_add(initial_refcount, std::memory_order_acq_rel) + initial_refcount; - if (new_rc >= fanin_count + 1) { - PTO2ResourceShape shape = pto2_active_mask_to_shape(active_mask); - sched->ready_queues[static_cast(shape)].push(&cur_slot_state); - } - // Record dep pool watermark in local slot state (used by tail reclamation) - cur_slot_state.dep_pool_mark = orch->rings[ring_id].dep_pool.top; + // Record dep pool watermark in local slot state (used by tail reclamation) + cur_slot_state.dep_pool_mark = orch->rings[ring_id].dep_pool.top; #if PTO2_ORCH_PROFILING - // Per producer: fetch_add(fanout_count) + load(task_state) + store(unlock) = 3 atomics - // Lock atomics (loads + CAS) are counted inside pto2_fanout_lock - g_orch_fanin_atomic_count += fanin_count * 3; - if (early_finished > 0) { - g_orch_fanin_atomic_count += 1; // fanin_refcount.fetch_add - } + // Per producer: fetch_add(fanout_count) + load(task_state) + store(unlock) = 3 atomics + // Lock atomics (loads + CAS) are counted inside pto2_fanout_lock + g_orch_fanin_atomic_count += fanin_count * 3; + if (early_finished > 0) { + g_orch_fanin_atomic_count += 1; // fanin_refcount.fetch_add + } #endif + } } CYCLE_COUNT_LAP_RECORD(g_orch_fanin_cycle, AicpuPhaseId::ORCH_FANIN, task_id.raw); @@ -765,6 +1024,7 @@ pto2_submit_mixed_task(PTO2OrchestratorState *orch, const MixedKernels &mixed_ke #endif g_orch_submit_idx++; #endif + orch->last_submitted_task_id = task_id; return result; } @@ -853,6 +1113,61 @@ TaskOutputTensors pto2_alloc_tensors(PTO2OrchestratorState *orch, const Arg &arg return outputs; } +TaskOutputTensors +pto2_submit_mixed_task(PTO2OrchestratorState *orch, const MixedKernels &mixed_kernels, const Arg &args) { + return pto2_submit_mixed_task_impl(orch, mixed_kernels, args, false); +} + +PTO2ManualSubmitResult +pto2_submit_mixed_task_manual(PTO2OrchestratorState *orch, const MixedKernels &mixed_kernels, const Arg &args) { + PTO2ManualSubmitResult result{}; + if (!in_manual_scope(orch)) { + LOG_ERROR("manual submit APIs require PTO2_SCOPE(PTO2ScopeMode::MANUAL)"); + orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_INVALID_ARGS, std::memory_order_release); + orch->fatal = true; + return result; + } + TaskOutputTensors outputs = pto2_submit_mixed_task_impl(orch, mixed_kernels, args, true); + if (orch->fatal || !orch->last_submitted_task_id.is_valid()) { + return result; + } + result.task_id = orch->last_submitted_task_id; + result.outputs = outputs; + + PTO2ManualTaskMeta meta{}; + meta.slot_state = orch->scope_tasks[orch->scope_tasks_size - 1]; + meta.scope_task_index = orch->scope_tasks_size - 1 - current_manual_scope_begin(orch); + meta.tensor_count = static_cast(args.tensor_count()); + for (int32_t i = 0; i < args.tensor_count(); i++) { + meta.tags[i] = static_cast(args.tag(i)); + } + manual_task_meta_push(orch, meta); + return result; +} + +void pto2_add_dependency(PTO2OrchestratorState *orch, PTO2TaskId producer_id, PTO2TaskId consumer_id) { + if (orch->fatal) { + return; + } + + if (!in_manual_scope(orch)) { + LOG_ERROR("pto2_rt_add_dependency is only valid inside PTO2_SCOPE(PTO2ScopeMode::MANUAL)"); + orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_INVALID_ARGS, std::memory_order_release); + orch->fatal = true; + return; + } + int32_t producer_idx = find_current_manual_scope_task_index(orch, producer_id); + int32_t consumer_idx = find_current_manual_scope_task_index(orch, consumer_id); + if (producer_idx < 0 || consumer_idx < 0) { + LOG_ERROR("add_dependency requires producer and consumer to belong to the current manual scope"); + orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_INVALID_ARGS, std::memory_order_release); + orch->fatal = true; + return; + } + + manual_edge_push(orch, PTO2ManualEdge{.producer_idx = producer_idx, .consumer_idx = consumer_idx}); +} + // ============================================================================= // Flow Control // ============================================================================= diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h index 9db96eaa1..5ce1d14a2 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h @@ -40,6 +40,18 @@ // Orchestrator State // ============================================================================= +struct PTO2ManualTaskMeta { + PTO2TaskSlotState *slot_state; + int32_t scope_task_index; + uint8_t tensor_count; + uint8_t tags[MAX_TENSOR_ARGS]; +}; + +struct PTO2ManualEdge { + int32_t producer_idx; + int32_t consumer_idx; +}; + /** * Orchestrator state structure (private to Orchestrator) * @@ -63,8 +75,19 @@ struct PTO2OrchestratorState { int32_t scope_tasks_size; // Number of task IDs currently in the buffer int32_t scope_tasks_capacity; // Allocated capacity of scope_tasks int32_t *scope_begins; // scope_begins[i] = start index of scope i in scope_tasks - int32_t scope_stack_top; // Current top of stack (-1 = no scope open) - uint64_t scope_stack_capacity; // Max nesting depth (PTO2_MAX_SCOPE_DEPTH) + PTO2ScopeMode *scope_modes; // Mode for each scope frame + int32_t *manual_task_meta_begins; // start index in manual_task_meta for each scope + int32_t *manual_edge_begins; // start index in manual_edges for each scope + int32_t scope_stack_top; // Current top of stack (-1 = no scope open) + uint64_t scope_stack_capacity; // Max nesting depth (PTO2_MAX_SCOPE_DEPTH) + bool manual_scope_active{false}; + PTO2ManualTaskMeta *manual_task_meta; + int32_t manual_task_meta_size; + int32_t manual_task_meta_capacity; + PTO2ManualEdge *manual_edges; + int32_t manual_edges_size; + int32_t manual_edges_capacity; + PTO2TaskId last_submitted_task_id{PTO2TaskId::invalid()}; // === SCHEDULER REFERENCE === // Note: In simulated mode, orchestrator and scheduler share address space @@ -151,7 +174,7 @@ void pto2_orchestrator_set_scheduler(PTO2OrchestratorState *orch, PTO2SchedulerS * Tasks submitted while this scope is at the top of the stack are * owned by it and have their fanout_count initialized to 1. */ -void pto2_scope_begin(PTO2OrchestratorState *orch); +void pto2_scope_begin(PTO2OrchestratorState *orch, PTO2ScopeMode mode = PTO2ScopeMode::AUTO); /** * End current scope @@ -190,6 +213,10 @@ pto2_submit_mixed_task(PTO2OrchestratorState *orch, const MixedKernels &mixed_ke * task id for scope lifetime and future creator-retention dependencies. */ TaskOutputTensors pto2_alloc_tensors(PTO2OrchestratorState *orch, const Arg &args); +PTO2ManualSubmitResult +pto2_submit_mixed_task_manual(PTO2OrchestratorState *orch, const MixedKernels &mixed_kernels, const Arg &args); + +void pto2_add_dependency(PTO2OrchestratorState *orch, PTO2TaskId producer, PTO2TaskId consumer); // ============================================================================= // Flow Control diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp index 8085ed63d..5c0f837e3 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp @@ -45,7 +45,17 @@ static TaskOutputTensors alloc_tensors_impl(PTO2Runtime *rt, const Arg &args) { return pto2_alloc_tensors(&rt->orchestrator, args); } -void pto2_rt_scope_begin(PTO2Runtime *rt) { pto2_scope_begin(&rt->orchestrator); } +PTO2ManualSubmitResult pto2_rt_submit_task_manual(PTO2Runtime *rt, const MixedKernels &mixed_kernels, const Arg &args) { + return pto2_submit_mixed_task_manual(&rt->orchestrator, mixed_kernels, args); +} + +void pto2_rt_add_dependency(PTO2Runtime *rt, PTO2TaskId producer, PTO2TaskId consumer) { + pto2_add_dependency(&rt->orchestrator, producer, consumer); +} + +void pto2_rt_scope_begin(PTO2Runtime *rt, PTO2ScopeMode mode) { + pto2_scope_begin(&rt->orchestrator, mode); +} void pto2_rt_scope_end(PTO2Runtime *rt) { pto2_scope_end(&rt->orchestrator); } @@ -53,6 +63,20 @@ void pto2_rt_orchestration_done(PTO2Runtime *rt) { pto2_orchestrator_done(&rt->o static bool is_fatal_impl(PTO2Runtime *rt) { return rt->orchestrator.fatal; } +static bool in_manual_scope_runtime(PTO2Runtime *rt) { + return rt->orchestrator.manual_scope_active; +} + +static void fail_manual_tensor_access(PTO2Runtime *rt, const char *caller) { + PTO2OrchestratorState &orch = rt->orchestrator; + orch.sm_handle->header->orch_error_code.store(PTO2_ERROR_INVALID_ARGS, std::memory_order_release); + orch.fatal = true; + unified_log_error( + caller, + "blocking tensor data access is not supported inside PTO2_SCOPE(PTO2ScopeMode::MANUAL); exit the manual scope first" + ); +} + // Wait for all producers of this tensor to be safe for data access. // Checks owner metadata (lifecycle anchor) and OverlapMap (modifier writers). // For reads: wait until each producer COMPLETED (done writing). @@ -137,6 +161,10 @@ static bool wait_for_tensor_ready(PTO2Runtime *rt, const Tensor &tensor, bool wa MAYBE_UNINITIALIZED_END uint64_t pto2_get_tensor_data(PTO2Runtime *rt, const Tensor &tensor, uint32_t ndims, const uint32_t indices[]) { + if (in_manual_scope_runtime(rt)) { + fail_manual_tensor_access(rt, __FUNCTION__); + return 0; + } if (tensor.buffer.addr == 0) { unified_log_error( __FUNCTION__, "get_tensor_data: buffer not allocated (addr=0). " @@ -160,6 +188,10 @@ uint64_t pto2_get_tensor_data(PTO2Runtime *rt, const Tensor &tensor, uint32_t nd void pto2_set_tensor_data( PTO2Runtime *rt, const Tensor &tensor, uint32_t ndims, const uint32_t indices[], uint64_t value ) { + if (in_manual_scope_runtime(rt)) { + fail_manual_tensor_access(rt, __FUNCTION__); + return; + } if (tensor.buffer.addr == 0) { unified_log_error( __FUNCTION__, "set_tensor_data: buffer not allocated (addr=0). " @@ -181,6 +213,8 @@ void pto2_set_tensor_data( static const PTO2RuntimeOps s_runtime_ops = { .submit_task = submit_task_impl, + .submit_task_manual = pto2_rt_submit_task_manual, + .add_dependency = pto2_rt_add_dependency, .scope_begin = pto2_rt_scope_begin, .scope_end = pto2_rt_scope_end, .orchestration_done = pto2_rt_orchestration_done, diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h index 779b75143..436ad091a 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h @@ -67,7 +67,9 @@ typedef struct PTO2Runtime PTO2Runtime; // forward declare for ops signatures struct PTO2RuntimeOps { TaskOutputTensors (*submit_task)(PTO2Runtime *rt, const MixedKernels &mixed_kernels, const Arg &args); - void (*scope_begin)(PTO2Runtime *rt); + PTO2ManualSubmitResult (*submit_task_manual)(PTO2Runtime *rt, const MixedKernels &mixed_kernels, const Arg &args); + void (*add_dependency)(PTO2Runtime *rt, PTO2TaskId producer, PTO2TaskId consumer); + void (*scope_begin)(PTO2Runtime *rt, PTO2ScopeMode mode); void (*scope_end)(PTO2Runtime *rt); void (*orchestration_done)(PTO2Runtime *rt); bool (*is_fatal)(PTO2Runtime *rt); @@ -176,7 +178,7 @@ void pto2_runtime_set_mode(PTO2Runtime *rt, PTO2RuntimeMode mode); * bounded by the scope. When scope_end() is called, the scope * releases its reference to all enclosed tasks. */ -void pto2_rt_scope_begin(PTO2Runtime *rt); +void pto2_rt_scope_begin(PTO2Runtime *rt, PTO2ScopeMode mode = PTO2ScopeMode::AUTO); /** * End current scope @@ -186,6 +188,10 @@ void pto2_rt_scope_begin(PTO2Runtime *rt); */ void pto2_rt_scope_end(PTO2Runtime *rt); +PTO2ManualSubmitResult pto2_rt_submit_task_manual(PTO2Runtime *rt, const MixedKernels &mixed_kernels, const Arg &args); + +void pto2_rt_add_dependency(PTO2Runtime *rt, PTO2TaskId producer, PTO2TaskId consumer); + /** * Mark orchestration as complete * diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h index 247f09fed..7705e0e64 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h @@ -135,6 +135,20 @@ constexpr uint64_t PTO2_TENSOR_DATA_TIMEOUT_CYCLES = 15 * 1000 * 1000 * 1000ULL; * TaskId: defined in pto_task_id.h (included above). */ +// ============================================================================= +// Manual Scope Types +// ============================================================================= + +enum class PTO2ScopeMode : uint8_t { + AUTO = 0, + MANUAL = 1, +}; + +struct PTO2ManualSubmitResult { + PTO2TaskId task_id; + TaskOutputTensors outputs; +}; + // ============================================================================= // Worker Types // ============================================================================= diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h index 0c3f5a0ff..91c235308 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h @@ -653,6 +653,17 @@ struct PTO2SchedulerState { #endif } + void publish_manual_scope_tasks(PTO2TaskSlotState **task_slot_states, int32_t count) { + for (int32_t i = 0; i < count; i++) { + PTO2TaskSlotState &slot_state = *task_slot_states[i]; + int32_t new_rc = slot_state.fanin_refcount.fetch_add(1, std::memory_order_acq_rel) + 1; + if (new_rc >= slot_state.fanin_count) { + PTO2ResourceShape shape = pto2_active_mask_to_shape(slot_state.active_mask); + ready_queues[static_cast(shape)].push(&slot_state); + } + } + } + /** * Subtask completion: atomic counter model. * Called when a single subtask (AIC, AIV0, or AIV1) finishes on any block. diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/golden.py new file mode 100644 index 000000000..b03743d37 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/golden.py @@ -0,0 +1,7 @@ +from pathlib import Path +import sys + +_BASE = Path(__file__).resolve().parents[1] / "paged_attention" +sys.path.insert(0, str(_BASE)) + +from golden import compute_golden, generate_inputs # noqa: E402,F401 diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/kernel_config.py b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/kernel_config.py new file mode 100644 index 000000000..92bba047b --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/kernel_config.py @@ -0,0 +1,71 @@ +# Copyright (c) PyPTO Contributors. +# This program is free software, you can redistribute it and/or modify it under the terms and conditions of +# CANN Open Software License Agreement Version 2.0 (the "License"). +# Please refer to the License for details. You may not use this file except in compliance with the License. +# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. +# See LICENSE in the root of the software repository for the full text of the License. +# ----------------------------------------------------------------------------------------------------------- +from pathlib import Path + +from task_interface import ArgDirection as D # pyright: ignore[reportAttributeAccessIssue] + +_ROOT = Path(__file__).parent +_PA_KERNELS = _ROOT.parent.parent / "paged_attention" / "kernels" + +ORCHESTRATION = { + "source": str(_ROOT / "orchestration" / "paged_attention_orch.cpp"), + "function_name": "aicpu_orchestration_entry", + "signature": [D.IN, D.IN, D.IN, D.IN, D.IN, D.OUT], +} + +KERNELS = [ + { + "func_id": 0, + "name": "QK", + "source": str(_PA_KERNELS / "aic" / "aic_qk_matmul.cpp"), + "core_type": "aic", + "signature": [D.IN, D.IN, D.OUT], + }, + { + "func_id": 2, + "name": "PV", + "source": str(_PA_KERNELS / "aic" / "aic_pv_matmul.cpp"), + "core_type": "aic", + "signature": [D.IN, D.IN, D.OUT], + }, + { + "func_id": 4, + "name": "AIC_HUB", + "source": str(_PA_KERNELS / "aic" / "aic_hub.cpp"), + "core_type": "aic", + "signature": [], + }, + { + "func_id": 1, + "name": "SF", + "source": str(_PA_KERNELS / "aiv" / "aiv_softmax_prepare.cpp"), + "core_type": "aiv", + "signature": [D.IN, D.OUT, D.OUT, D.OUT], + }, + { + "func_id": 3, + "name": "UP", + "source": str(_PA_KERNELS / "aiv" / "aiv_online_update.cpp"), + "core_type": "aiv", + "signature": [D.IN, D.IN, D.IN, D.INOUT, D.INOUT, D.INOUT, D.INOUT], + }, + { + "func_id": 5, + "name": "AIV_HUB", + "source": str(_PA_KERNELS / "aiv" / "aiv_hub.cpp"), + "core_type": "aiv", + "signature": [], + }, +] + +RUNTIME_CONFIG = { + "runtime": "tensormap_and_ringbuffer", + "aicpu_thread_num": 4, + "block_dim": 24, +} diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp new file mode 100644 index 000000000..b067f2fe8 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp @@ -0,0 +1,171 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ + +#include +#include + +#include "pto_orchestration_api.h" // NOLINT(build/include_subdir) + +#define FUNC_QK_MATMUL 0 +#define FUNC_SOFTMAX_PREPARE 1 +#define FUNC_PV_MATMUL 2 +#define FUNC_ONLINE_UPDATE 3 +#define FUNC_AIC_HUB 4 +#define FUNC_AIV_HUB 5 + +extern "C" { + +__attribute__((visibility("default"))) PTO2OrchestrationConfig +aicpu_orchestration_config(const ChipStorageTaskArgs &orch_args) { + (void)orch_args; // NOLINT(readability/casting) + return PTO2OrchestrationConfig{ + .expected_arg_count = 7, + }; +} + +__attribute__((visibility("default"))) void +aicpu_orchestration_entry(const ChipStorageTaskArgs &orch_args, int orch_thread_num, int orch_thread_index) { + (void)orch_thread_num; // NOLINT(readability/casting) + (void)orch_thread_index; // NOLINT(readability/casting) + + uint64_t batch = orch_args.tensor(0).shapes[0]; + uint64_t num_heads = orch_args.tensor(0).shapes[1]; + uint64_t head_dim = orch_args.tensor(0).shapes[2]; + DataType data_type = orch_args.tensor(0).dtype; + uint64_t block_size = orch_args.tensor(1).shapes[1]; + uint64_t block_num = orch_args.tensor(3).shapes[1]; + uint64_t scale_value = orch_args.scalar(0); + + uint64_t q_head_num = num_heads; + uint64_t q_tile = std::min(num_heads, 128UL); + uint64_t q_loop = (q_head_num + q_tile - 1) / q_tile; + + void *query_ptr = orch_args.tensor(0).data_as(); + void *kc_ptr = orch_args.tensor(1).data_as(); + void *vc_ptr = orch_args.tensor(2).data_as(); + void *out_ptr = orch_args.tensor(5).data_as(); + + uint64_t total_blocks_count = orch_args.tensor(1).shapes[0]; + + uint32_t query_shapes[2] = {static_cast(batch * num_heads), static_cast(head_dim)}; + uint32_t key_cache_shapes[2] = { + static_cast(total_blocks_count * block_size), static_cast(head_dim) + }; + uint32_t value_cache_shapes[2] = { + static_cast(total_blocks_count * block_size), static_cast(head_dim) + }; + uint32_t out_shapes[2] = {static_cast(batch * num_heads), static_cast(head_dim)}; + Tensor query = make_tensor_external(query_ptr, query_shapes, 2, data_type); + Tensor key_cache = make_tensor_external(kc_ptr, key_cache_shapes, 2, data_type); + Tensor value_cache = make_tensor_external(vc_ptr, value_cache_shapes, 2, data_type); + Tensor out = make_tensor_external(out_ptr, out_shapes, 2, DataType::FLOAT32); + + int *host_block_table = orch_args.tensor(3).data_as(); + int *host_context_lens = orch_args.tensor(4).data_as(); + + uint32_t tile2d_shapes[2] = {static_cast(q_tile), static_cast(head_dim)}; + uint32_t scalar_shapes[1] = {static_cast(q_tile)}; + uint32_t sij_shapes[2] = {static_cast(q_tile), static_cast(block_size)}; + TensorCreateInfo tile2d_ci(tile2d_shapes, 2, DataType::FLOAT32); + TensorCreateInfo scalar_ci(scalar_shapes, 1, DataType::FLOAT32); + TensorCreateInfo sij_ci(sij_shapes, 2, DataType::FLOAT32); + TensorCreateInfo pij_f16_ci(sij_shapes, 2, data_type); + + for (uint64_t b_idx = 0; b_idx < batch; b_idx++) { + uint64_t cur_seq = host_context_lens[b_idx]; + uint64_t bn_this_batch = (cur_seq + block_size - 1) / block_size; + for (uint64_t q_idx = 0; q_idx < q_loop; q_idx++) { + PTO2_SCOPE() { + uint64_t cur_offset = b_idx * q_head_num + q_idx * q_tile; + + uint32_t qi_offsets[2] = {static_cast(cur_offset), 0}; + uint32_t out_view_offsets[2] = {static_cast(cur_offset), 0}; + Tensor qi = query.view(tile2d_shapes, qi_offsets); + Tensor out_view = out.view(tile2d_shapes, out_view_offsets); + + Arg params_inplace; + params_inplace.add_output(tile2d_ci); + params_inplace.add_output(scalar_ci); + params_inplace.add_output(scalar_ci); + TaskOutputTensors hub_outs = pto2_rt_submit_aiv_task(FUNC_AIV_HUB, params_inplace); + const Tensor &oi = hub_outs.get_ref(0); + const Tensor &li_update = hub_outs.get_ref(1); + const Tensor &mi_update = hub_outs.get_ref(2); + + for (uint64_t bn = 0; bn < bn_this_batch; bn++) { + uint64_t cur_block_idx = host_block_table[b_idx * block_num + bn]; + uint64_t valid_len = std::min(block_size, cur_seq - bn * block_size); + + uint32_t kv_shapes[2] = {static_cast(block_size), static_cast(head_dim)}; + uint32_t kv_offsets[2] = {static_cast(cur_block_idx * block_size), 0}; + Tensor kj = key_cache.view(kv_shapes, kv_offsets); + Tensor vj = value_cache.view(kv_shapes, kv_offsets); + + PTO2_SCOPE(PTO2ScopeMode::MANUAL) { + Arg params_qk; + params_qk.add_input(qi); + params_qk.add_input(kj); + params_qk.add_output(sij_ci); + PTO2ManualSubmitResult qk_outs = pto2_rt_submit_aic_task_manual(FUNC_QK_MATMUL, params_qk); + const Tensor &sij = qk_outs.outputs.get_ref(0); + + uint32_t sij_valid_shapes[2] = { + static_cast(q_tile), static_cast(valid_len) + }; + uint32_t sij_valid_offsets[2] = {0, 0}; + Tensor sij_valid = sij.view(sij_valid_shapes, sij_valid_offsets); + + Arg params_sf; + params_sf.add_input(sij_valid); + params_sf.add_output(pij_f16_ci); + params_sf.add_output(scalar_ci); + params_sf.add_output(scalar_ci); + params_sf.add_scalar(scale_value); + PTO2ManualSubmitResult sf_outs = + pto2_rt_submit_aiv_task_manual(FUNC_SOFTMAX_PREPARE, params_sf); + const Tensor &pij_f16 = sf_outs.outputs.get_ref(0); + const Tensor &mi = sf_outs.outputs.get_ref(1); + const Tensor &li = sf_outs.outputs.get_ref(2); + + Arg params_pv; + params_pv.add_input(pij_f16); + params_pv.add_input(vj); + params_pv.add_output(tile2d_ci); + PTO2ManualSubmitResult pv_outs = pto2_rt_submit_aic_task_manual(FUNC_PV_MATMUL, params_pv); + const Tensor &oi_tmp = pv_outs.outputs.get_ref(0); + + uint64_t is_first = (bn == 0) ? 1 : 0; + uint64_t is_last = (bn == bn_this_batch - 1) ? 1 : 0; + + Arg params_up; + params_up.add_input(mi); + params_up.add_input(li); + params_up.add_input(oi_tmp); + params_up.add_inout(mi_update); + params_up.add_inout(li_update); + params_up.add_inout(oi); + params_up.add_inout(out_view); + params_up.add_scalar(is_first); + params_up.add_scalar(is_last); + PTO2ManualSubmitResult up_outs = pto2_rt_submit_aiv_task_manual(FUNC_ONLINE_UPDATE, params_up); + + pto2_rt_add_dependency(qk_outs.task_id, sf_outs.task_id); + pto2_rt_add_dependency(sf_outs.task_id, pv_outs.task_id); + pto2_rt_add_dependency(sf_outs.task_id, up_outs.task_id); + pto2_rt_add_dependency(pv_outs.task_id, up_outs.task_id); + } + } + } + } + } +} + +} // extern "C" diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/golden.py new file mode 100644 index 000000000..95a72e23b --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/golden.py @@ -0,0 +1,8 @@ +from pathlib import Path +import sys + +_BASE = Path(__file__).resolve().parents[1] / "paged_attention_unroll" +sys.path.insert(0, str(_BASE)) + +from golden import ALL_CASES, ATOL, DEFAULT_CASE, RTOL, generate_inputs # noqa: E402,F401 +from paged_attention_golden import compute_golden, run_golden_test # noqa: E402,F401 diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/kernels/kernel_config.py b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/kernels/kernel_config.py new file mode 100644 index 000000000..e3a0d13c3 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/kernels/kernel_config.py @@ -0,0 +1,72 @@ +# Copyright (c) PyPTO Contributors. +# This program is free software, you can redistribute it and/or modify it under the terms and conditions of +# CANN Open Software License Agreement Version 2.0 (the "License"). +# Please refer to the License for details. You may not use this file except in compliance with the License. +# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. +# See LICENSE in the root of the software repository for the full text of the License. +# ----------------------------------------------------------------------------------------------------------- +from pathlib import Path + +from task_interface import ArgDirection as D # pyright: ignore[reportAttributeAccessIssue] + +_ROOT = Path(__file__).parent +_PA_KERNELS = _ROOT.parent.parent / "paged_attention_unroll" / "kernels" + +ORCHESTRATION = { + "source": str(_ROOT / "orchestration" / "paged_attention_orch.cpp"), + "function_name": "aicpu_orchestration_entry", + "signature": [D.IN, D.IN, D.IN, D.IN, D.IN, D.OUT], +} + +KERNELS = [ + { + "func_id": 0, + "name": "QK", + "source": str(_PA_KERNELS / "aic" / "aic_qk_matmul.cpp"), + "core_type": "aic", + "signature": [D.IN, D.IN, D.OUT], + }, + { + "func_id": 2, + "name": "PV", + "source": str(_PA_KERNELS / "aic" / "aic_pv_matmul.cpp"), + "core_type": "aic", + "signature": [D.IN, D.IN, D.OUT], + }, + { + "func_id": 4, + "name": "AIC_HUB", + "source": str(_PA_KERNELS / "aic" / "aic_hub.cpp"), + "core_type": "aic", + "signature": [], + }, + { + "func_id": 1, + "name": "SF", + "source": str(_PA_KERNELS / "aiv" / "aiv_softmax_prepare.cpp"), + "core_type": "aiv", + "signature": [D.IN, D.OUT, D.OUT, D.OUT], + }, + { + "func_id": 3, + "name": "UP", + "source": str(_PA_KERNELS / "aiv" / "aiv_online_update.cpp"), + "core_type": "aiv", + "signature": [D.IN, D.IN, D.IN, D.INOUT, D.INOUT, D.INOUT, D.INOUT], + }, + { + "func_id": 5, + "name": "AIV_HUB", + "source": str(_PA_KERNELS / "aiv" / "aiv_hub.cpp"), + "core_type": "aiv", + "signature": [], + }, +] + +RUNTIME_CONFIG = { + "runtime": "tensormap_and_ringbuffer", + "aicpu_thread_num": 4, + "orch_thread_num": 1, + "block_dim": 24, +} diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/kernels/orchestration/paged_attention_orch.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/kernels/orchestration/paged_attention_orch.cpp new file mode 100644 index 000000000..843c2daf6 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/kernels/orchestration/paged_attention_orch.cpp @@ -0,0 +1,181 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ + +#include +#include + +#include "pto_orchestration_api.h" // NOLINT(build/include_subdir) + +#define N_UNROLL 64 + +#define FUNC_QK_MATMUL 0 +#define FUNC_SOFTMAX_PREPARE 1 +#define FUNC_PV_MATMUL 2 +#define FUNC_ONLINE_UPDATE 3 +#define FUNC_AIC_HUB 4 +#define FUNC_AIV_HUB 5 + +extern "C" { + +__attribute__((visibility("default"))) PTO2OrchestrationConfig +aicpu_orchestration_config(const ChipStorageTaskArgs &orch_args) { + (void)orch_args; // NOLINT(readability/casting) + return PTO2OrchestrationConfig{ + .expected_arg_count = 7, + }; +} + +__attribute__((visibility("default"))) void +aicpu_orchestration_entry(const ChipStorageTaskArgs &orch_args, int orch_thread_num, int orch_thread_index) { + (void)orch_thread_num; // NOLINT(readability/casting) + (void)orch_thread_index; // NOLINT(readability/casting) + + uint64_t batch = orch_args.tensor(0).shapes[0]; + uint64_t num_heads = orch_args.tensor(0).shapes[1]; + uint64_t head_dim = orch_args.tensor(0).shapes[2]; + DataType data_type = orch_args.tensor(0).dtype; + uint64_t block_size = orch_args.tensor(1).shapes[1]; + uint64_t block_num = orch_args.tensor(3).shapes[1]; + uint64_t scale_value = orch_args.scalar(0); + + uint64_t q_head_num = num_heads; + uint64_t q_tile = std::min(num_heads, 128UL); + uint64_t q_loop = (q_head_num + q_tile - 1) / q_tile; + + void *query_ptr = orch_args.tensor(0).data_as(); + void *kc_ptr = orch_args.tensor(1).data_as(); + void *vc_ptr = orch_args.tensor(2).data_as(); + void *out_ptr = orch_args.tensor(5).data_as(); + + uint64_t total_blocks_count = orch_args.tensor(1).shapes[0]; + + uint32_t query_shapes[2] = {static_cast(batch * num_heads), static_cast(head_dim)}; + uint32_t key_cache_shapes[2] = { + static_cast(total_blocks_count * block_size), static_cast(head_dim) + }; + uint32_t value_cache_shapes[2] = { + static_cast(total_blocks_count * block_size), static_cast(head_dim) + }; + uint32_t out_shapes[2] = {static_cast(batch * num_heads), static_cast(head_dim)}; + Tensor query = make_tensor_external(query_ptr, query_shapes, 2, data_type, false); + Tensor key_cache = make_tensor_external(kc_ptr, key_cache_shapes, 2, data_type, false); + Tensor value_cache = make_tensor_external(vc_ptr, value_cache_shapes, 2, data_type, false); + Tensor out = make_tensor_external(out_ptr, out_shapes, 2, DataType::FLOAT32); + + int *host_block_table = orch_args.tensor(3).data_as(); + int *host_context_lens = orch_args.tensor(4).data_as(); + + uint32_t oi_shapes[2] = {static_cast(q_tile), static_cast(head_dim)}; + uint32_t li_shapes[1] = {static_cast(q_tile)}; + TensorCreateInfo tile2d_ci(oi_shapes, 2, DataType::FLOAT32); + TensorCreateInfo scalar_noinit_ci(li_shapes, 1, DataType::FLOAT32, false); + TensorCreateInfo scalar_ci(li_shapes, 1, DataType::FLOAT32); + + for (uint64_t b_idx = 0; b_idx < batch; b_idx++) { + uint64_t cur_seq = host_context_lens[b_idx]; + uint64_t bn_this_batch = (cur_seq + block_size - 1) / block_size; + int *bt_base = host_block_table + b_idx * block_num; + + for (uint64_t q_idx = 0; q_idx < q_loop; q_idx++) { + PTO2_SCOPE() { + uint64_t cur_offset = b_idx * q_head_num + q_idx * q_tile; + + uint32_t qi_shapes[2] = {static_cast(q_tile), static_cast(head_dim)}; + uint32_t qi_offsets[2] = {static_cast(cur_offset), 0}; + Tensor qi = query.view(qi_shapes, qi_offsets); + uint32_t out_view_shapes[2] = {static_cast(q_tile), static_cast(head_dim)}; + uint32_t out_view_offsets[2] = {static_cast(cur_offset), 0}; + Tensor out_view = out.view(out_view_shapes, out_view_offsets, true); + + Arg params_inplace; + params_inplace.add_output(tile2d_ci); + params_inplace.add_output(scalar_noinit_ci); + params_inplace.add_output(scalar_noinit_ci); + TaskOutputTensors hub_outs = pto2_rt_submit_aiv_task(FUNC_AIV_HUB, params_inplace); + const Tensor &oi = hub_outs.get_ref(0); + const Tensor &li_update = hub_outs.get_ref(1); + const Tensor &mi_update = hub_outs.get_ref(2); + + Arg params_qk; + Arg params_sf; + Arg params_pv; + Arg params_up; + + for (uint64_t bn = 0; bn < bn_this_batch; bn += N_UNROLL) { + uint64_t n_blocks = std::min(static_cast(N_UNROLL), bn_this_batch - bn); + uint64_t last_block_seq_start = (bn + n_blocks - 1) * block_size; + uint64_t valid_len_last = std::min(block_size, cur_seq - last_block_seq_start); + + PTO2_SCOPE(PTO2ScopeMode::MANUAL) { + uint32_t sij_buf_shapes[2] = { + static_cast(q_tile), static_cast(n_blocks * block_size) + }; + TensorCreateInfo sij_buf_ci(sij_buf_shapes, 2, DataType::FLOAT32); + + params_qk.reset(); + params_qk.add_input(qi); + params_qk.add_input(key_cache); + params_qk.add_output(sij_buf_ci); + params_qk.add_scalar(n_blocks); + params_qk.add_scalar(reinterpret_cast(bt_base + bn)); + PTO2ManualSubmitResult qk_outs = pto2_rt_submit_aic_task_manual(FUNC_QK_MATMUL, params_qk); + + uint32_t pij_buf_shapes[2] = { + static_cast(q_tile), static_cast(n_blocks * block_size) + }; + TensorCreateInfo pij_buf_ci(pij_buf_shapes, 2, data_type); + + params_sf.reset(); + params_sf.add_input(qk_outs.outputs.get_ref(0)); + params_sf.add_output(pij_buf_ci); + params_sf.add_output(scalar_ci); + params_sf.add_output(scalar_ci); + params_sf.add_scalar(scale_value); + params_sf.add_scalar(n_blocks); + params_sf.add_scalar(valid_len_last); + PTO2ManualSubmitResult sf_outs = + pto2_rt_submit_aiv_task_manual(FUNC_SOFTMAX_PREPARE, params_sf); + + params_pv.reset(); + params_pv.add_input(sf_outs.outputs.get_ref(0)); + params_pv.add_input(value_cache); + params_pv.add_output(tile2d_ci); + params_pv.add_scalar(n_blocks); + params_pv.add_scalar(reinterpret_cast(bt_base + bn)); + PTO2ManualSubmitResult pv_outs = pto2_rt_submit_aic_task_manual(FUNC_PV_MATMUL, params_pv); + + uint64_t is_first = (bn == 0) ? 1 : 0; + uint64_t is_last = (bn + n_blocks >= bn_this_batch) ? 1 : 0; + + params_up.reset(); + params_up.add_input(sf_outs.outputs.get_ref(1)); + params_up.add_input(sf_outs.outputs.get_ref(2)); + params_up.add_input(pv_outs.outputs.get_ref(0)); + params_up.add_inout(mi_update); + params_up.add_inout(li_update); + params_up.add_inout(oi); + params_up.add_inout(out_view); + params_up.add_scalar(is_first); + params_up.add_scalar(is_last); + PTO2ManualSubmitResult up_outs = pto2_rt_submit_aiv_task_manual(FUNC_ONLINE_UPDATE, params_up); + + pto2_rt_add_dependency(qk_outs.task_id, sf_outs.task_id); + pto2_rt_add_dependency(sf_outs.task_id, pv_outs.task_id); + pto2_rt_add_dependency(sf_outs.task_id, up_outs.task_id); + pto2_rt_add_dependency(pv_outs.task_id, up_outs.task_id); + } + } + } + } + } +} + +} // extern "C" From 54e53244124a9e4c37813b9b8718582ab438f282 Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Sun, 5 Apr 2026 21:31:29 +0800 Subject: [PATCH 05/35] Add unmodified tensormap runtime baseline --- .../aicore/aicore_executor.cpp | 140 + .../aicpu/aicpu_executor.cpp | 2473 +++++++++++++++++ .../build_config.py | 26 + .../common/intrinsic.h | 141 + .../docs/MULTI_RING.md | 237 ++ .../docs/ROADMAP.md | 86 + .../docs/RUNTIME_LOGIC.md | 658 +++++ .../docs/SCALAR_DATA_ACCESS.md | 137 + .../docs/SUBMIT_BY_CLUSTER.md | 226 ++ .../docs/device_log_profiling.md | 167 ++ .../docs/profiling_levels.md | 355 +++ .../host/runtime_compile_info.cpp | 18 + .../host/runtime_maker.cpp | 381 +++ .../orchestration/common.cpp | 174 ++ .../orchestration/pto_orchestration_api.h | 308 ++ .../runtime/common.h | 93 + .../runtime/pto2_dispatch_payload.h | 85 + .../runtime/pto_orchestrator.cpp | 759 +++++ .../runtime/pto_orchestrator.h | 225 ++ .../runtime/pto_ring_buffer.cpp | 78 + .../runtime/pto_ring_buffer.h | 508 ++++ .../runtime/pto_runtime2.cpp | 337 +++ .../runtime/pto_runtime2.h | 225 ++ .../runtime/pto_runtime2_types.h | 557 ++++ .../runtime/pto_scheduler.cpp | 220 ++ .../runtime/pto_scheduler.h | 819 ++++++ .../runtime/pto_shared_memory.cpp | 273 ++ .../runtime/pto_shared_memory.h | 227 ++ .../runtime/pto_submit_types.h | 119 + .../runtime/pto_task_id.h | 50 + .../runtime/pto_tensormap.cpp | 256 ++ .../runtime/pto_tensormap.h | 521 ++++ .../runtime/pto_types.h | 284 ++ .../runtime/runtime.cpp | 144 + .../runtime/runtime.h | 290 ++ .../runtime/tensor.h | 493 ++++ .../paged_attention/golden.py | 19 + .../paged_attention/kernels/kernel_config.py | 20 + .../paged_attention_unroll/golden.py | 19 + .../kernels/kernel_config.py | 20 + tests/ut/py/test_runtime_builder.py | 8 + tools/benchmark_rounds.sh | 22 +- 42 files changed, 12196 insertions(+), 2 deletions(-) create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/aicore/aicore_executor.cpp create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/aicpu/aicpu_executor.cpp create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/build_config.py create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/common/intrinsic.h create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/MULTI_RING.md create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/ROADMAP.md create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/RUNTIME_LOGIC.md create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/SCALAR_DATA_ACCESS.md create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/SUBMIT_BY_CLUSTER.md create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/device_log_profiling.md create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/profiling_levels.md create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/host/runtime_compile_info.cpp create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/host/runtime_maker.cpp create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/orchestration/common.cpp create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/orchestration/pto_orchestration_api.h create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/common.h create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto2_dispatch_payload.h create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_orchestrator.cpp create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_orchestrator.h create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_ring_buffer.cpp create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_ring_buffer.h create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_runtime2.cpp create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_runtime2.h create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_runtime2_types.h create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_scheduler.cpp create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_scheduler.h create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_shared_memory.cpp create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_shared_memory.h create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_submit_types.h create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_task_id.h create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_tensormap.cpp create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_tensormap.h create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_types.h create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/runtime.cpp create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/runtime.h create mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/tensor.h create mode 100644 tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention/golden.py create mode 100644 tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention/kernels/kernel_config.py create mode 100644 tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention_unroll/golden.py create mode 100644 tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention_unroll/kernels/kernel_config.py diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/aicore/aicore_executor.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/aicore/aicore_executor.cpp new file mode 100644 index 000000000..1c03606e4 --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/aicore/aicore_executor.cpp @@ -0,0 +1,140 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ + +#include "aicore/aicore.h" +#include "aicore/performance_collector_aicore.h" +#include "common/perf_profiling.h" +#include "common/platform_config.h" // Register-based communication +#include "pto2_dispatch_payload.h" // NOLINT(build/include_subdir) +#include "runtime.h" // NOLINT(build/include_subdir) + +/** + * Unified function pointer type for kernel dispatch + * + * All kernels follow the same signature: void kernel(__gm__ int64_t* args) + * This enables simple, switch-free dispatch. + */ +typedef void (*UnifiedKernelFunc)(__gm__ int64_t*); + +/** + * Execute task from PTO2DispatchPayload. + * + * Reads function_bin_addr and args from the dispatch payload. + * + * @param payload Pointer to PTO2DispatchPayload in global memory + */ +__aicore__ __attribute__((always_inline)) static void execute_task(__gm__ PTO2DispatchPayload* payload) { + if (payload == nullptr || payload->function_bin_addr == 0) { + return; + } + + UnifiedKernelFunc kernel = (UnifiedKernelFunc)payload->function_bin_addr; + kernel(reinterpret_cast<__gm__ int64_t*>(payload->args)); + OUT_OF_ORDER_STORE_BARRIER(); +} + +/** + * AICore main execution loop + * + * Implements the AICPU-AICore register-based dispatch protocol: + * 1. Wait for AICPU ready signal via handshake buffer + * 2. Report physical core ID and core type, signal AICore ready + * 3. Cache per-core PTO2DispatchPayload pointer from hank->task + * 4. Poll DATA_MAIN_BASE register for task dispatch until exit signal + * + * AICPU writes &s_pto2_payload_per_core[i] to hank->task before setting + * aicpu_ready=1. AICore caches this pointer and reads function_bin_addr + + * args pointer from it on each dispatch. reg_val is a monotonically + * increasing task ID used only for dispatch signaling and ACK/FIN protocol. + * + * @param runtime Pointer to Runtime in global memory + * @param block_idx Block index (core ID) + * @param core_type Core type (AIC or AIV) + */ +__aicore__ __attribute__((weak)) void aicore_execute(__gm__ Runtime* runtime, int block_idx, CoreType core_type) { + __gm__ Handshake* my_hank = (__gm__ Handshake*)(&runtime->workers[block_idx]); + + // Phase 1: Wait for AICPU initialization signal + while (my_hank->aicpu_ready == 0) { + dcci(my_hank, SINGLE_CACHE_LINE); + } + + // Phase 2: Report physical core ID, signal ready + my_hank->physical_core_id = get_physical_core_id(); + OUT_OF_ORDER_STORE_BARRIER(); + my_hank->aicore_regs_ready = 1; + dcci(&my_hank->aicore_regs_ready, SINGLE_CACHE_LINE, CACHELINE_OUT); + while (my_hank->aicpu_regs_ready == 0) { + dcci(&my_hank->aicpu_regs_ready, SINGLE_CACHE_LINE); + } + // Report initial idle status via register + write_reg(RegId::COND, AICORE_IDLE_VALUE); + + // Phase 3: Report core type, signal ready + my_hank->core_type = core_type; + OUT_OF_ORDER_STORE_BARRIER(); + my_hank->aicore_done = block_idx + 1; // Signal ready (use block_idx + 1 to avoid 0) + + dcci(my_hank, SINGLE_CACHE_LINE, CACHELINE_OUT); + + // Cache per-core dispatch payload pointer (set by AICPU before aicpu_ready) + __gm__ PTO2DispatchPayload* payload = reinterpret_cast<__gm__ PTO2DispatchPayload*>(my_hank->task); + + bool profiling_enabled = runtime->enable_profiling; + + // Phase 4: Main execution loop - poll register for tasks until exit signal + // Register encoding: AICPU_IDLE_TASK_ID=idle, task_id=task, AICORE_EXIT_SIGNAL=exit + uint32_t reg_val = AICPU_IDLE_TASK_ID; + uint32_t last_reg_val = AICPU_IDLE_TASK_ID; + + while (true) { + reg_val = static_cast(read_reg(RegId::DATA_MAIN_BASE)); + if (reg_val == AICORE_EXIT_SIGNAL) { + // Signal exit acknowledgment to AICPU + write_reg(RegId::COND, AICORE_EXITED_VALUE); + break; + } + + // Execute task if new (reg_val encoding: AICPU_IDLE_TASK_ID=idle, task_id=task) + if (reg_val == AICPU_IDLE_TASK_ID || reg_val == last_reg_val) { + SPIN_WAIT_HINT(); + continue; + } + + { + uint32_t task_id = reg_val; // Decode: register holds task_id directly + + // Invalidate payload buffer (AICPU updates its content each dispatch) + dcci(payload, ENTIRE_DATA_CACHE); + + write_reg(RegId::COND, MAKE_ACK_VALUE(task_id)); + + // Performance profiling: record start time + uint64_t start_time = get_sys_cnt_aicore(); + + // Execute the task + execute_task(payload); + + // Performance profiling: record task execution + if (profiling_enabled) { + uint64_t end_time = get_sys_cnt_aicore(); + __gm__ PerfBuffer* perf_buf = (__gm__ PerfBuffer*)my_hank->perf_records_addr; + perf_aicore_record_task(perf_buf, task_id, start_time, end_time); + } + + last_reg_val = reg_val; + write_reg(RegId::COND, MAKE_FIN_VALUE(task_id)); + } + } + + // Flush all dirty cache lines to HBM before kernel exit. + dcci(my_hank, SINGLE_CACHE_LINE, CACHELINE_OUT); +} diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/aicpu/aicpu_executor.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/aicpu/aicpu_executor.cpp new file mode 100644 index 000000000..79b440e09 --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/aicpu/aicpu_executor.cpp @@ -0,0 +1,2473 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#ifdef __linux__ +#include +#endif + +#include "aicpu/device_log.h" +#include "aicpu/device_time.h" +#include "pto2_dispatch_payload.h" +#include "runtime.h" +#include "spin_hint.h" + +// Runtime headers (full struct definition for create/destroy + PTO2_SCOPE) +#include "pto_runtime2.h" +#include "pto_runtime2_types.h" +#include "pto_shared_memory.h" + +// Performance profiling headers +#include "aicpu/performance_collector_aicpu.h" +#include "common/memory_barrier.h" +#include "common/perf_profiling.h" +#include "common/unified_log.h" + +// Register-based communication +#include "aicpu/platform_regs.h" +#include "common/platform_config.h" + +// Core type definitions +#include "common/core_type.h" + +// CoreCallable for resolved dispatch address +#include "callable.h" + +#if PTO2_PROFILING +// Accumulated nanoseconds per sub-step +#define CYCLE_COUNT_START() uint64_t _t0 = get_sys_cnt_aicpu(), _t1 +#define CYCLE_COUNT_LAP(acc) \ + do { \ + _t1 = get_sys_cnt_aicpu(); \ + acc += (_t1 - _t0); \ + _t0 = _t1; \ + } while (0) +#else +#define CYCLE_COUNT_START() +#define CYCLE_COUNT_LAP(acc) +#endif + +// Device orchestration function signature (loaded via dlopen). +// The executor binds the current thread's PTO2Runtime into orchestration TLS +// before calling the user entry. +typedef void (*DeviceOrchestrationFunc)( + const ChipStorageTaskArgs& orch_args, int32_t orch_thread_num, int32_t orch_thread_index); +typedef void (*DeviceOrchestrationBindRuntimeFunc)(PTO2Runtime* rt); + +// Config function exported by orchestration .so +typedef PTO2OrchestrationConfig (*DeviceOrchestrationConfigFunc)(const ChipStorageTaskArgs& orch_args); + +constexpr int32_t MAX_AICPU_THREADS = PLATFORM_MAX_AICPU_THREADS; +constexpr int32_t MAX_AIC_PER_THREAD = PLATFORM_MAX_AIC_PER_THREAD; +constexpr int32_t MAX_AIV_PER_THREAD = PLATFORM_MAX_AIV_PER_THREAD; +constexpr int32_t MAX_CORES_PER_THREAD = PLATFORM_MAX_CORES_PER_THREAD; + +constexpr int32_t MAX_IDLE_ITERATIONS = 800000; // ~20s idle then scheduler gives up (avoid long hang) +constexpr int32_t STALL_LOG_INTERVAL = 50000; // DEV_ALWAYS every N idle iters to debug hang +constexpr int32_t FATAL_ERROR_CHECK_INTERVAL = 1024; // Check orchestrator error every N idle iters +constexpr int32_t STALL_DUMP_READY_MAX = 8; +constexpr int32_t STALL_DUMP_WAIT_MAX = 4; +constexpr int32_t STALL_DUMP_CORE_MAX = 8; +constexpr int32_t PROGRESS_VERBOSE_THRESHOLD = 10; // log every completion for the first N tasks +constexpr int32_t PROGRESS_LOG_INTERVAL = 250; // log every N completions after threshold + +static PTO2Runtime* rt{nullptr}; + +// Per-core dispatch payload storage (one aligned cache line per physical core) +static PTO2DispatchPayload s_pto2_payload_per_core[RUNTIME_MAX_WORKER]; + +// Per-core state: one cache line per core to eliminate false sharing +// and co-locate all hot-path fields for minimal cache misses. +struct alignas(64) CoreExecState { + // --- Hot fields (completion + dispatch, every iteration) --- + uint64_t reg_addr; // offset 0: register address (set once in handshake) + PTO2TaskSlotState* executing_slot_state; // offset 8: slot state for running task + int32_t executing_reg_task_id; // offset 16: register task ID (AICPU_TASK_INVALID = idle) + uint32_t dispatch_seq; // offset 20: monotonic dispatch counter + PTO2SubtaskSlot executing_subslot; // offset 24: which subtask slot is running + uint8_t pad_[3]; // offset 25: alignment padding +#if PTO2_PROFILING + // --- Profiling fields (dispatch path, compile-time gated) --- + uint32_t dispatch_count; // offset 28: dispatched task count (buffer mgmt) + uint64_t dispatch_timestamp; // offset 32: AICPU dispatch timestamp +#endif + // --- Cold fields (init/diagnostics only, never in hot path) --- + int32_t worker_id; // index in runtime.workers[] + uint32_t physical_core_id; // hardware physical core ID + CoreType core_type; // AIC or AIV +}; +static_assert(sizeof(CoreExecState) == 64, "CoreExecState must occupy exactly one cache line"); + +// core_states_ encodes per-cluster core idle/running in 3 bits per cluster: +// bit i*3 = AIC of cluster i (1 = idle, 0 = running) +// bit i*3+1 = AIV0 of cluster i +// bit i*3+2 = AIV1 of cluster i +// Max 21 clusters per tracker (63 bits in uint64_t). +class alignas(64) CoreTracker { + public: + static inline int32_t MAX_CORE_PER_THREAD = 63; + static constexpr int32_t MAX_CLUSTERS = 63 / 3; + + public: + CoreTracker() = default; + + class BitStates { + public: // NOLINT(whitespace/indent) + BitStates() = default; + + explicit BitStates(uint64_t states) : states_(states) {} + void init() { states_ = 0; } + + BitStates operator~() const { return BitStates(~states_); } + BitStates operator&(const BitStates& other) const { return BitStates(states_ & other.states_); } + BitStates operator|(const BitStates& other) const { return BitStates(states_ | other.states_); } + BitStates operator^(const BitStates& other) const { return BitStates(states_ ^ other.states_); } + BitStates operator>>(int32_t offset) const { return BitStates(states_ >> offset); } + BitStates operator<<(int32_t offset) const { return BitStates(states_ << offset); } + void operator&=(const BitStates& other) { states_ &= other.states_; } + void operator|=(const BitStates& other) { states_ |= other.states_; } + void operator^=(const BitStates& other) { states_ ^= other.states_; } + + bool has_value() const { return states_ > 0; } + int32_t count() const { return __builtin_popcountll(states_); } + + // Extract the lowest set bit from mask, clear it, and return its position. + // Returns -1 if mask is empty. + int32_t pop_first() { + if (states_ == 0) return -1; + int32_t pos = __builtin_ctzll(states_); + states_ &= states_ - 1; + return pos; + } + + private: // NOLINT(whitespace/indent) + uint64_t states_{0}; + }; + + public: + void init(int32_t cluster_count) { + cluster_count_ = cluster_count; + aic_mask_.init(); + aiv_mask_.init(); + for (int32_t i = 0; i < cluster_count; i++) { + aic_mask_ |= BitStates(1ULL << (i * 3)); + aiv_mask_ |= BitStates(6ULL << (i * 3)); + } + core_states_ = aic_mask_ | aiv_mask_; + } + + void set_cluster(int32_t cluster_idx, int32_t aic_wid, int32_t aiv0_wid, int32_t aiv1_wid) { + core_id_map_[cluster_idx * 3] = aic_wid; + core_id_map_[cluster_idx * 3 + 1] = aiv0_wid; + core_id_map_[cluster_idx * 3 + 2] = aiv1_wid; + } + + int32_t get_cluster_count() const { return cluster_count_; } + + // --- Running core queries --- + + template + bool has_running_cores() const { + if constexpr (CT == CoreType::AIC) { + return ((~core_states_) & aic_mask_).has_value(); + } else { + return ((~core_states_) & aiv_mask_).has_value(); + } + } + + bool has_any_running_cores() const { return ((~core_states_) & (aic_mask_ | aiv_mask_)).has_value(); } + + template + int32_t get_running_count() const { + if constexpr (CT == CoreType::AIC) { + return ((~core_states_) & aic_mask_).count(); + } else { + return ((~core_states_) & aiv_mask_).count(); + } + } + + // Return an opaque bitmask for iterating running cores of a given type. + // Use pop_first() to extract core bit offsets one at a time. + template + BitStates get_running_cores() const { + if constexpr (CT == CoreType::AIC) { + return (~core_states_) & aic_mask_; + } else { + return (~core_states_) & aiv_mask_; + } + } + + BitStates get_all_running_cores() const { return (~core_states_) & (aic_mask_ | aiv_mask_); } + + // --- Cluster matching --- + + BitStates get_valid_cluster_offset_states(PTO2ResourceShape shape) const { + switch (shape) { + case PTO2ResourceShape::AIC: + return core_states_ & aic_mask_; + case PTO2ResourceShape::AIV: + return ((core_states_ >> 1) | (core_states_ >> 2)) & aic_mask_; + case PTO2ResourceShape::MIX: + return (core_states_ >> 1) & (core_states_ >> 2) & core_states_ & aic_mask_; + } + return BitStates(0ULL); + } + + int32_t get_aic_core_id(int32_t cluster_offset) const { return core_id_map_[cluster_offset]; } + int32_t get_aiv0_core_id(int32_t cluster_offset) const { return core_id_map_[cluster_offset + 1]; } + int32_t get_aiv1_core_id(int32_t cluster_offset) const { return core_id_map_[cluster_offset + 2]; } + + int32_t get_aic_core_offset(int32_t cluster_offset) const { return cluster_offset; } + int32_t get_aiv0_core_offset(int32_t cluster_offset) const { return cluster_offset + 1; } + int32_t get_aiv1_core_offset(int32_t cluster_offset) const { return cluster_offset + 2; } + + bool is_aic_core_idle(int32_t cluster_offset) const { + return ((core_states_ >> cluster_offset) & BitStates(1ULL)).has_value(); + } + bool is_aiv0_core_idle(int32_t cluster_offset) const { + return ((core_states_ >> (cluster_offset + 1)) & BitStates(1ULL)).has_value(); + } + bool is_aiv1_core_idle(int32_t cluster_offset) const { + return ((core_states_ >> (cluster_offset + 2)) & BitStates(1ULL)).has_value(); + } + + // --- State mutation --- + + // Toggle bit at the given bit offset (running <-> idle) + void change_core_state(int32_t bit_offset) { core_states_ ^= BitStates(1ULL << bit_offset); } + + // --- Bit offset <-> worker_id mapping --- + + int32_t get_core_id_by_offset(int32_t offset) const { return core_id_map_[offset]; } + + private: + int32_t cluster_count_; + BitStates aic_mask_; + BitStates aiv_mask_; + BitStates core_states_; + int32_t core_id_map_[63]; // bit_position -> worker_id, max 21 clusters * 3 +}; + +struct AicpuExecutor { + int32_t orch_thread_num_; + int32_t sched_thread_num_; + bool orch_to_sched_{false}; + + // ===== Thread management state ===== + std::atomic thread_idx_{0}; + std::atomic initialized_{false}; + std::atomic init_done_{false}; + std::atomic init_failed_{false}; + std::atomic finished_{false}; + + int32_t thread_num_{0}; + int32_t cores_total_num_{0}; + int32_t thread_cores_num_{0}; // Cores per scheduler thread (0 for orchestrator when thread_num_==4) + int32_t core_count_per_thread_[MAX_AICPU_THREADS]; // Actual core count per thread + int32_t core_assignments_[MAX_AICPU_THREADS][MAX_CORES_PER_THREAD]; + + // Per-core execution state, indexed by core_id (= worker_id) + CoreExecState core_exec_states_[RUNTIME_MAX_WORKER]; + + // Cluster-ordered worker_id lists for core assignment (init-only) + int32_t aic_worker_ids_[MAX_CORES_PER_THREAD]; + int32_t aiv_worker_ids_[MAX_CORES_PER_THREAD]; + int32_t aic_count_{0}; + int32_t aiv_count_{0}; + + // Platform register base address array (set via get_platform_regs()) + uint64_t regs_{0}; + + CoreTracker core_trackers_[MAX_AICPU_THREADS]; + + // ===== Task queue state (managed by scheduler ready queues) ===== + + // Task execution tracking + std::atomic completed_tasks_{0}; + int32_t total_tasks_{0}; + std::atomic finished_count_{0}; + // Device orchestration: set by last orchestrator when graph is built; schedulers poll it. + // volatile prevents the compiler from hoisting the load out of spin loops. + volatile bool orchestrator_done_{false}; + std::atomic pto2_init_done_{false}; + std::atomic runtime_init_ready_{false}; + std::atomic pto2_init_complete_{false}; // init block finished; others wait for this + std::atomic orch_finished_count_{0}; // Number of orchestrator threads that have finished + + // ===== Dynamic core transition state ===== + std::atomic transition_requested_{false}; + std::atomic wait_reassign_{0}; + std::atomic reassigned_{false}; + std::atomic completed_{false}; + + // Orchestration SO handle - defer dlclose until all tasks complete + void* orch_so_handle_{nullptr}; + char orch_so_path_[256]{}; // Path to orchestration SO file for cleanup + + // Shared orchestration function pointer (loaded by first orch thread, used by all) + DeviceOrchestrationFunc orch_func_{nullptr}; + DeviceOrchestrationBindRuntimeFunc orch_bind_runtime_{nullptr}; + const ChipStorageTaskArgs* orch_args_cached_{nullptr}; + + uint64_t* func_id_to_addr_; + uint64_t get_function_bin_addr(int func_id) const { + if (func_id < 0 || func_id >= RUNTIME_MAX_FUNC_ID) return 0; + return func_id_to_addr_[func_id]; + } + + // ===== Methods ===== + int32_t init(Runtime* runtime); + int32_t handshake_all_cores(Runtime* runtime); + bool assign_cores_to_threads(); + void reassign_cores_for_all_threads(); + int32_t resolve_and_dispatch_pto2(Runtime* runtime, int32_t thread_idx); + int32_t shutdown_aicore(Runtime* runtime, int32_t thread_idx, const int32_t* cur_thread_cores, int32_t core_num); + int32_t run(Runtime* runtime); + void deinit(Runtime* runtime); + void emergency_shutdown(Runtime* runtime); + void diagnose_stuck_state( + Runtime* runtime, int32_t thread_idx, const int32_t* cur_thread_cores, int32_t core_num, Handshake* hank); + + template + void check_running_cores_for_completion(int32_t thread_idx, + Handshake* hank, + int32_t& completed_this_turn, + int32_t& cur_thread_completed, + bool& made_progress, + PTO2TaskSlotState* deferred_release_slot_states[], + int32_t& deferred_release_count, + PTO2LocalReadyBuffer* local_bufs +#if PTO2_PROFILING + , + bool profiling_enabled, + uint32_t& phase_complete_count +#endif +#if PTO2_SCHED_PROFILING + , + uint64_t& complete_probe_count, + uint64_t& complete_hit_count, + uint64_t& notify_edges_total, + int32_t& notify_max_degree, + uint64_t& notify_tasks_enqueued, + uint64_t& fanin_edges_total, + int32_t& fanin_max_degree, + uint64_t& sched_complete_perf_cycle +#endif + ) { +#if !PTO2_PROFILING + (void)hank; // NOLINT(readability/casting) +#endif + CoreTracker& tracker = core_trackers_[thread_idx]; + auto running_core_states = tracker.get_running_cores(); + while (running_core_states.has_value()) { + int32_t bit_pos = running_core_states.pop_first(); + int32_t core_id = tracker.get_core_id_by_offset(bit_pos); + CoreExecState& core_exec_state = core_exec_states_[core_id]; + uint64_t reg_addr = core_exec_state.reg_addr; + + int32_t expected_reg_task_id = core_exec_state.executing_reg_task_id; + uint64_t reg_val = read_reg(reg_addr, RegId::COND); + int32_t reg_task_id = EXTRACT_TASK_ID(reg_val); + int32_t reg_state = EXTRACT_TASK_STATE(reg_val); + bool done = reg_task_id == expected_reg_task_id && reg_state == TASK_FIN_STATE; +#if PTO2_SCHED_PROFILING + if (profiling_enabled) { + complete_probe_count++; + if (done) { + complete_hit_count++; + } + } +#endif + + if (done) { + core_exec_state.executing_reg_task_id = AICPU_TASK_INVALID; + PTO2TaskSlotState& slot_state = *core_exec_state.executing_slot_state; + + // Completion: increment atomic counter, trigger task-level completion on last subtask + bool mixed_complete = rt->scheduler.on_subtask_complete(slot_state); + if (mixed_complete) { +#if PTO2_SCHED_PROFILING + PTO2CompletionStats cstats = + rt->scheduler.on_mixed_task_complete(slot_state, thread_idx, local_bufs); + notify_edges_total += cstats.fanout_edges; + if (cstats.fanout_edges > notify_max_degree) notify_max_degree = cstats.fanout_edges; + notify_tasks_enqueued += cstats.tasks_enqueued; + phase_complete_count++; +#else + rt->scheduler.on_mixed_task_complete(slot_state, local_bufs); +#if PTO2_PROFILING + phase_complete_count++; +#endif +#endif + if (deferred_release_count < 256) { + deferred_release_slot_states[deferred_release_count++] = &slot_state; + } else { + DEV_ALWAYS("Thread %d: release", thread_idx); + while (deferred_release_count > 0) { +#if PTO2_SCHED_PROFILING + int32_t fe = rt->scheduler.on_task_release( + *deferred_release_slot_states[--deferred_release_count], thread_idx); +#else + int32_t fe = + rt->scheduler.on_task_release(*deferred_release_slot_states[--deferred_release_count]); +#endif + (void)fe; // NOLINT(readability/casting) +#if PTO2_SCHED_PROFILING + fanin_edges_total += fe; + if (fe > fanin_max_degree) fanin_max_degree = fe; +#endif + } + deferred_release_slot_states[deferred_release_count++] = &slot_state; + } + } + tracker.change_core_state(bit_pos); +#if PTO2_PROFILING + if (profiling_enabled) { +#if PTO2_SCHED_PROFILING + uint64_t t_perf_start = get_sys_cnt_aicpu(); +#endif + Handshake* h = &hank[core_id]; + uint64_t finish_ts = get_sys_cnt_aicpu(); + PerfBuffer* perf_buf = reinterpret_cast(h->perf_records_addr); + + // Pre-extract fanout (platform layer cannot depend on PTO2DepListEntry) + uint64_t fanout_arr[RUNTIME_MAX_FANOUT]; + int32_t fanout_n = 0; + PTO2DepListEntry* cur = slot_state.fanout_head; + while (cur != nullptr && fanout_n < RUNTIME_MAX_FANOUT) { + fanout_arr[fanout_n++] = cur->slot_state->task->task_id.raw; + cur = cur->next; + } + + int32_t perf_slot_idx = static_cast(core_exec_state.executing_subslot); + if (perf_aicpu_complete_record(perf_buf, + static_cast(expected_reg_task_id), + slot_state.task->task_id.raw, + slot_state.task->kernel_id[perf_slot_idx], + CT, + core_exec_state.dispatch_timestamp, + finish_ts, + fanout_arr, + fanout_n) != 0) { + DEV_ERROR("Core %d: perf_aicpu_complete_record failed for task 0x%" PRIx64, + core_id, + static_cast(slot_state.task->task_id.raw)); + } +#if PTO2_SCHED_PROFILING + sched_complete_perf_cycle += (get_sys_cnt_aicpu() - t_perf_start); +#endif + } +#endif + + DEV_DEBUG("Thread %d: %s core %d completed PTO2 task %d (mixed_complete=%d)", + thread_idx, + CT == CoreType::AIC ? "AIC" : "AIV", + core_id, + expected_reg_task_id, + mixed_complete ? 1 : 0); + cur_thread_completed++; + if (mixed_complete) { + completed_this_turn++; + } + made_progress = true; + } + } + } + + static const char* shape_name(PTO2ResourceShape shape) { + switch (shape) { + case PTO2ResourceShape::AIC: + return "AIC"; + case PTO2ResourceShape::AIV: + return "AIV"; + case PTO2ResourceShape::MIX: + return "MIX"; + } + return "UNKNOWN"; + } + + /** + * Returns the dispatch probe order for a given scheduler thread. + * Widest shapes first to avoid consuming cluster resources with narrow tasks. + * Even/odd threads use different fallback orders (AIC-first vs AIV-first) + * to reduce contention on the same ready queue across adjacent threads. + */ + static const PTO2ResourceShape* get_dispatch_order(int32_t thread_idx) { + // Even threads: AIC-first fallback after widest + static constexpr PTO2ResourceShape kEvenOrder[PTO2_NUM_RESOURCE_SHAPES] = { + PTO2ResourceShape::MIX, + PTO2ResourceShape::AIC, + PTO2ResourceShape::AIV, + }; + // Odd threads: AIV-first fallback after widest + static constexpr PTO2ResourceShape kOddOrder[PTO2_NUM_RESOURCE_SHAPES] = { + PTO2ResourceShape::MIX, + PTO2ResourceShape::AIV, + PTO2ResourceShape::AIC, + }; + return (thread_idx % 2 == 0) ? kEvenOrder : kOddOrder; + } + + int pop_ready_tasks_batch(PTO2ResourceShape shape, + int32_t thread_idx, + PTO2LocalReadyBuffer& local_buf, + PTO2TaskSlotState** out, + int max_count +#if PTO2_SCHED_PROFILING + , + uint64_t& pop_hit, + uint64_t& pop_miss, + uint64_t& local_dispatch_count, + uint64_t& sched_dispatch_pop_cycle +#endif + ) { + (void)thread_idx; // NOLINT(readability/casting) +#if PTO2_SCHED_PROFILING + extern uint64_t g_sched_pop_atomic_count[], g_sched_pop_wait_cycle[]; + uint64_t t_pop_start = get_sys_cnt_aicpu(); + int count = rt->scheduler.get_ready_tasks_batch(shape, + local_buf, + out, + max_count, + g_sched_pop_atomic_count[thread_idx], + g_sched_pop_wait_cycle[thread_idx], + local_dispatch_count); + sched_dispatch_pop_cycle += (get_sys_cnt_aicpu() - t_pop_start); + if (count > 0) { + pop_hit += count; + } else { + pop_miss++; + } +#else + int count = rt->scheduler.get_ready_tasks_batch(shape, local_buf, out, max_count); +#endif + return count; + } + + /** + * Build per-core dispatch payload: copy tensor pointers and scalars into + * the per-core args[] array, then populate SPMD local context at the tail. + * + * Reads next_block_idx and block_num directly from the task descriptor + * to populate LocalContext. The caller is responsible for incrementing + * next_block_idx AFTER dispatch. + * + * GlobalContext (sub_block_id) is NOT written here — it is initialized once + * at runtime startup by init_global_context(). + */ + void build_payload(PTO2DispatchPayload& dispatch_payload, PTO2TaskSlotState& slot_state, PTO2SubtaskSlot subslot) { + int32_t slot_idx = static_cast(subslot); + uint64_t callable_addr = get_function_bin_addr(slot_state.task->kernel_id[slot_idx]); + const CoreCallable* callable = reinterpret_cast(callable_addr); + dispatch_payload.function_bin_addr = callable->resolved_addr(); + auto& payload = *slot_state.payload; + int n = 0; + for (int32_t i = 0; i < payload.tensor_count; i++) { + dispatch_payload.args[n++] = reinterpret_cast(&payload.tensors[i]); + } + for (int32_t i = 0; i < payload.scalar_count; i++) { + dispatch_payload.args[n++] = payload.scalars[i]; + } + // Per-dispatch local context (read from slot state) + dispatch_payload.local_context.block_idx = slot_state.next_block_idx; + dispatch_payload.local_context.block_num = slot_state.block_num; + // Store context pointers at fixed suffix positions in args[] + // (GlobalContext content is already set by init_global_context, but the + // pointer must be written each dispatch since args[] is rebuilt entirely) + dispatch_payload.args[SPMD_LOCAL_CONTEXT_INDEX] = reinterpret_cast(&dispatch_payload.local_context); + dispatch_payload.args[SPMD_GLOBAL_CONTEXT_INDEX] = reinterpret_cast(&dispatch_payload.global_context); + } + + void dispatch_subtask_to_core(Runtime* runtime, + int32_t thread_idx, + int32_t core_offset, + PTO2TaskSlotState& slot_state, + PTO2SubtaskSlot subslot +#if PTO2_PROFILING + , + bool profiling_enabled +#endif + ) { + CoreTracker& tracker = core_trackers_[thread_idx]; + auto core_id = tracker.get_core_id_by_offset(core_offset); +#if !PTO2_PROFILING + (void)runtime; // NOLINT(readability/casting) +#endif + CoreExecState& core_exec_state = core_exec_states_[core_id]; + PTO2DispatchPayload& payload = s_pto2_payload_per_core[core_id]; + build_payload(payload, slot_state, subslot); + core_exec_state.executing_subslot = subslot; + core_exec_state.executing_slot_state = &slot_state; +#if PTO2_PROFILING + if (profiling_enabled) { + core_exec_state.dispatch_timestamp = get_sys_cnt_aicpu(); + if (core_exec_state.dispatch_count >= PLATFORM_PROF_BUFFER_SIZE) { + perf_aicpu_switch_buffer(runtime, core_id, thread_idx); + core_exec_state.dispatch_count = 0; + } + core_exec_state.dispatch_count++; + } +#endif + // Per-core monotonic counter for register protocol uniqueness (32-bit). + // PTO2 task_id encodes (ring_id << 32 | local_id); truncation to uint32 loses ring_id, + // so tasks from different rings with the same local_id would write identical DATA_MAIN_BASE + // values. The AICore uses last_reg_val to detect new dispatches and would skip the + // duplicate, while the stale COND register from the previous task (same local_id) would + // cause a false-positive completion. + // PerfRecord.task_id: register token (low 32) until AICPU overwrites with full (ring_id << 32 | local_id). + core_exec_state.dispatch_seq++; + uint32_t reg_task_id = core_exec_state.dispatch_seq & TASK_ID_MASK; + // Skip reserved sentinel range [AICORE_EXIT_SIGNAL, 0x7FFFFFFF] + if (reg_task_id >= AICORE_EXIT_SIGNAL) { + core_exec_state.dispatch_seq += (TASK_ID_MASK - reg_task_id + 1); + reg_task_id = core_exec_state.dispatch_seq & TASK_ID_MASK; + } + write_reg(core_exec_state.reg_addr, RegId::DATA_MAIN_BASE, static_cast(reg_task_id)); + + tracker.change_core_state(core_offset); + core_exec_state.executing_reg_task_id = reg_task_id; + } +}; + +static AicpuExecutor g_aicpu_executor; + +// ===== AicpuExecutor Method Implementations ===== + +/** + * Handshake with all cores and discover their types + * Sets up register addresses for fast dispatch. + */ +int32_t AicpuExecutor::handshake_all_cores(Runtime* runtime) { + Handshake* all_handshakes = reinterpret_cast(runtime->workers); + cores_total_num_ = runtime->worker_count; + + // Validate cores_total_num_ before using as array index + if (cores_total_num_ == 0 || cores_total_num_ > MAX_CORES_PER_THREAD) { + DEV_ERROR("Invalid cores_total_num %d (expected 1-%d)", cores_total_num_, MAX_CORES_PER_THREAD); + return -1; + } + + aic_count_ = 0; + aiv_count_ = 0; + + DEV_INFO("Handshaking with %d cores", cores_total_num_); + + // Step 1: Write per-core payload addresses and send handshake signal + // OUT_OF_ORDER_STORE_BARRIER() ensures task is globally visible before + // aicpu_ready=1, so AICore reads the correct payload pointer after waking up. + for (int32_t i = 0; i < cores_total_num_; i++) { + all_handshakes[i].task = reinterpret_cast(&s_pto2_payload_per_core[i]); + OUT_OF_ORDER_STORE_BARRIER(); + all_handshakes[i].aicpu_ready = 1; + } + + // Get platform physical cores count for validation + uint32_t max_physical_cores_count = platform_get_physical_cores_count(); + + // Step 2: Wait for all cores to respond, collect core type and register addresses + bool handshake_failed = false; + for (int32_t i = 0; i < cores_total_num_; i++) { + Handshake* hank = &all_handshakes[i]; + + while (hank->aicore_regs_ready == 0) { + } + + uint32_t physical_core_id = hank->physical_core_id; + + // Validate physical_core_id before using as array index + if (physical_core_id >= max_physical_cores_count) { + DEV_ERROR("Core %d reported invalid physical_core_id=%u (platform max=%u)", + i, + physical_core_id, + max_physical_cores_count); + handshake_failed = true; + continue; + } + + // Get register address using physical_core_id + uint64_t* regs = reinterpret_cast(regs_); + uint64_t reg_addr = regs[physical_core_id]; + + // Initialize AICore registers after discovery (first round) + platform_init_aicore_regs(reg_addr); + hank->aicpu_regs_ready = 1; + + while (hank->aicore_done == 0) { + } + + CoreType type = hank->core_type; + + core_exec_states_[i].reg_addr = reg_addr; + core_exec_states_[i].worker_id = i; + core_exec_states_[i].physical_core_id = physical_core_id; + core_exec_states_[i].core_type = type; + + if (type == CoreType::AIC) { + aic_worker_ids_[aic_count_++] = i; + DEV_INFO("Core %d: AIC, physical_id=%u, reg_addr=0x%lx", i, physical_core_id, reg_addr); + } else { + aiv_worker_ids_[aiv_count_++] = i; + DEV_INFO("Core %d: AIV, physical_id=%u, reg_addr=0x%lx", i, physical_core_id, reg_addr); + } + } + + if (handshake_failed) { + emergency_shutdown(runtime); + return -1; + } + + DEV_INFO("Core discovery complete: %d AIC, %d AIV", aic_count_, aiv_count_); + return 0; +} + +/** + * Assign discovered cores to scheduler threads + * (Aligned with host_build_graph mechanism) + */ +bool AicpuExecutor::assign_cores_to_threads() { + // Cluster-aligned round-robin assignment: cluster ci -> sched thread ci % divisor. + // Each cluster = 1 AIC + 2 adjacent AIV; the triple is always kept together. + int32_t divisor = (sched_thread_num_ > 0) ? sched_thread_num_ : thread_num_; + int32_t cluster_count = aic_count_; + + // Max clusters any single sched thread can hold: ceil(cluster_count / divisor). + int32_t max_clusters_per_thread = (cluster_count + divisor - 1) / divisor; + thread_cores_num_ = max_clusters_per_thread * 3; + + if (thread_cores_num_ > CoreTracker::MAX_CORE_PER_THREAD) { + DEV_ERROR("Can't assign more then 64 cores in per scheduler"); + return false; + } + + DEV_INFO("Assigning cores (round-robin): %d clusters across %d sched threads (%d AIC, %d AIV)", + cluster_count, + divisor, + aic_count_, + aiv_count_); + + for (int32_t i = 0; i < MAX_CORES_PER_THREAD; i++) { + core_exec_states_[i].executing_reg_task_id = AICPU_TASK_INVALID; + } + + // Count clusters per thread first (round-robin may distribute unevenly) + int32_t clusters_per_thread[MAX_AICPU_THREADS] = {}; + for (int32_t ci = 0; ci < cluster_count; ci++) { + clusters_per_thread[ci % divisor]++; + } + for (int32_t i = 0; i < divisor; i++) { + core_trackers_[i].init(clusters_per_thread[i]); + core_count_per_thread_[i] = 0; + } + + // Mark orchestrator threads explicitly (no cores). + for (int32_t t = divisor; t < thread_num_; t++) { + DEV_INFO("Thread %d: orchestrator (0 cores)", t); + } + + // Per-sched-thread running core index used while filling core_assignments_. + int32_t core_idx[MAX_AICPU_THREADS] = {}; + int32_t cluster_idx_per_thread[MAX_AICPU_THREADS] = {}; + + for (int32_t ci = 0; ci < cluster_count; ci++) { + int32_t t = ci % divisor; + int32_t& idx = core_idx[t]; + + int32_t aic_wid = aic_worker_ids_[ci]; + int32_t aiv0_wid = aiv_worker_ids_[2 * ci]; + int32_t aiv1_wid = aiv_worker_ids_[2 * ci + 1]; + + core_trackers_[t].set_cluster(cluster_idx_per_thread[t]++, aic_wid, aiv0_wid, aiv1_wid); + + core_assignments_[t][idx++] = aic_wid; + core_assignments_[t][idx++] = aiv0_wid; + core_assignments_[t][idx++] = aiv1_wid; + + DEV_INFO("Thread %d: cluster %d (AIC=%d, AIV0=%d, AIV1=%d)", t, ci, aic_wid, aiv0_wid, aiv1_wid); + } + + for (int32_t t = 0; t < divisor; t++) { + core_count_per_thread_[t] = core_idx[t]; + DEV_INFO("Thread %d: total %d cores (%d clusters)", t, core_idx[t], core_trackers_[t].get_cluster_count()); + } + + return true; +} + +/** + * Reassign all cores evenly across all threads (schedulers + orchestrators). + * Called by the last orchestrator thread when orchestration completes. + * Writes into new_core_assignments_ / new_core_count_per_thread_. + */ +void AicpuExecutor::reassign_cores_for_all_threads() { + DEV_INFO("Reassigning cores (cluster-aligned) for %d threads: %d AIC, %d AIV", thread_num_, aic_count_, aiv_count_); + + // Collect running worker_ids from all current trackers + bool running_cores[MAX_CORES_PER_THREAD] = {}; + for (int32_t i = 0; i < thread_num_; i++) { + auto all_running = core_trackers_[i].get_all_running_cores(); + int32_t bp; + while ((bp = all_running.pop_first()) >= 0) { + running_cores[core_trackers_[i].get_core_id_by_offset(bp)] = true; + } + } + + // Count clusters per thread (round-robin across all threads) + int32_t cluster_count = aic_count_; + int32_t clusters_per_thread[MAX_AICPU_THREADS] = {}; + for (int32_t ci = 0; ci < cluster_count; ci++) { + clusters_per_thread[ci % thread_num_]++; + } + + // Re-init all trackers and reset core counts + for (int32_t i = 0; i < thread_num_; i++) { + core_trackers_[i].init(clusters_per_thread[i]); + core_count_per_thread_[i] = 0; + } + + // Assign clusters round-robin and restore running state + int32_t cluster_idx_per_thread[MAX_AICPU_THREADS] = {}; + for (int32_t ci = 0; ci < cluster_count; ci++) { + int32_t t = ci % thread_num_; + + int32_t aic_wid = aic_worker_ids_[ci]; + int32_t aiv0_wid = aiv_worker_ids_[2 * ci]; + int32_t aiv1_wid = aiv_worker_ids_[2 * ci + 1]; + + int32_t cl_idx = cluster_idx_per_thread[t]++; + core_trackers_[t].set_cluster(cl_idx, aic_wid, aiv0_wid, aiv1_wid); + + // init() marks all idle; toggle cores that were running + if (running_cores[aic_wid]) { + core_trackers_[t].change_core_state(cl_idx * 3); + } + if (running_cores[aiv0_wid]) { + core_trackers_[t].change_core_state(cl_idx * 3 + 1); + } + if (running_cores[aiv1_wid]) { + core_trackers_[t].change_core_state(cl_idx * 3 + 2); + } + + core_assignments_[t][core_count_per_thread_[t]++] = aic_wid; + core_assignments_[t][core_count_per_thread_[t]++] = aiv0_wid; + core_assignments_[t][core_count_per_thread_[t]++] = aiv1_wid; + } + + // Log final distribution + DEV_INFO("Core reassignment complete:"); + for (int32_t t = 0; t < thread_num_; t++) { + int32_t aic_running = core_trackers_[t].get_running_count(); + int32_t aiv_running = core_trackers_[t].get_running_count(); + DEV_INFO(" Thread %d: %d cores, %d clusters (AIC running=%d, AIV running=%d)", + t, + core_count_per_thread_[t], + core_trackers_[t].get_cluster_count(), + aic_running, + aiv_running); + } +} + +int32_t AicpuExecutor::init(Runtime* runtime) { + bool expected = false; + if (!initialized_.compare_exchange_strong(expected, true, std::memory_order_acq_rel, std::memory_order_acquire)) { + return 0; + } + + DEV_INFO("AicpuExecutor: Initializing"); + + if (runtime == nullptr) { + DEV_ERROR("runtime is nullptr"); + init_failed_.store(true, std::memory_order_release); + return -1; + } + + func_id_to_addr_ = runtime->func_id_to_addr_; + + // Read execution parameters from runtime + thread_num_ = runtime->sche_cpu_num; + orch_thread_num_ = runtime->orch_thread_num; + sched_thread_num_ = thread_num_ - orch_thread_num_; + orch_to_sched_ = runtime->orch_to_sched; + if (thread_num_ == 0) thread_num_ = 1; + + if (!orch_to_sched_ && sched_thread_num_ == 0) { + DEV_ERROR( + "no scheduler and orch not trans to schedulers when finished, maybe you need set env PTO2_ORCH_TO_SCHED=1 " + "or scale down orch number."); + init_failed_.store(true, std::memory_order_release); + return -1; + } + + if (thread_num_ < 1 || thread_num_ > MAX_AICPU_THREADS) { + DEV_ERROR("Invalid thread_num: %d", thread_num_); + init_failed_.store(true, std::memory_order_release); + return -1; + } + + // Zero all per-core execution state before handshake + memset(core_exec_states_, 0, sizeof(core_exec_states_)); + + // Use handshake mechanism to discover cores (aligned with host_build_graph) + int32_t rc = handshake_all_cores(runtime); + if (rc != 0) { + DEV_ERROR("handshake_all_cores failed"); + init_failed_.store(true, std::memory_order_release); + return -1; + } + + // Dynamically assign cores to threads + if (!assign_cores_to_threads()) { + return -1; + } + + DEV_INFO("Config: threads=%d, cores=%d, cores_per_thread=%d", thread_num_, cores_total_num_, thread_cores_num_); + + // Initialize runtime execution state + // Task count comes from PTO2 shared memory + if (runtime->get_pto2_gm_sm_ptr()) { + auto* header = static_cast(runtime->get_pto2_gm_sm_ptr()); + int32_t pto2_count = 0; + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + pto2_count += header->rings[r].fc.current_task_index.load(std::memory_order_acquire); + } + total_tasks_ = pto2_count > 0 ? pto2_count : 0; + } else { + total_tasks_ = 0; + } + completed_tasks_.store(0, std::memory_order_release); + // Host orchestration: graph already built, no wait needed. Device orch: Thread 3 will set this. + bool orch_on_host = runtime->get_orch_built_on_host(); + DEV_INFO("Init: orch_built_on_host=%d", orch_on_host ? 1 : 0); + orchestrator_done_ = orch_on_host; + + // Initial ready tasks will be populated via scheduler ready queues + + // Clear per-core dispatch payloads + memset(s_pto2_payload_per_core, 0, sizeof(s_pto2_payload_per_core)); + + // Initialize per-core GlobalContext (sub_block_id) based on cluster position. + // This is done once at startup and never modified afterwards. + for (int32_t t = 0; t < sched_thread_num_; t++) { + CoreTracker& tracker = core_trackers_[t]; + for (int32_t c = 0; c < tracker.get_cluster_count(); c++) { + int32_t cluster_offset = c * 3; // Each cluster = 1 AIC + 2 AIV + auto aiv0_id = tracker.get_core_id_by_offset(tracker.get_aiv0_core_offset(cluster_offset)); + auto aiv1_id = tracker.get_core_id_by_offset(tracker.get_aiv1_core_offset(cluster_offset)); + s_pto2_payload_per_core[aiv0_id].global_context.sub_block_id = 0; + s_pto2_payload_per_core[aiv1_id].global_context.sub_block_id = 1; + } + } + + DEV_INFO("Init: PTO2 mode, task count from shared memory"); + + finished_count_.store(0, std::memory_order_release); + + init_done_.store(true, std::memory_order_release); + DEV_INFO("AicpuExecutor: Init complete"); + return 0; +} + +/** + * Shutdown AICore - Send exit signal via registers to all AICore kernels + */ +int32_t AicpuExecutor::shutdown_aicore( + Runtime* runtime, int32_t thread_idx, const int32_t* cur_thread_cores, int32_t core_num) { + (void)runtime; // NOLINT(readability/casting) + if (core_num == 0) return 0; + + DEV_INFO("Thread %d: Shutting down %d cores", thread_idx, core_num); + + for (int32_t i = 0; i < core_num; i++) { + int32_t core_id = cur_thread_cores[i]; + uint64_t reg_addr = core_exec_states_[core_id].reg_addr; + if (reg_addr != 0) { + platform_deinit_aicore_regs(reg_addr); + } else { + DEV_ERROR("Thread %d: Core %d has invalid register address", thread_idx, core_id); + } + } + DEV_INFO("Thread %d: Shutdown complete", thread_idx); + return 0; +} + +int32_t AicpuExecutor::resolve_and_dispatch_pto2(Runtime* runtime, int32_t thread_idx) { + int32_t& core_num = core_count_per_thread_[thread_idx]; + CoreTracker& tracker = core_trackers_[thread_idx]; + DEV_INFO("Thread %d: resolve_and_dispatch_pto2 entry", thread_idx); + + void* sm_base = runtime->get_pto2_gm_sm_ptr(); + if (!sm_base) { + DEV_ERROR("PTO2 dispatch: sm_base is null"); + return -1; + } + DEV_INFO("Thread %d: sm_base=%p", thread_idx, sm_base); + + PTO2SharedMemoryHeader* header = static_cast(sm_base); + DEV_INFO("Thread %d: header=%p, task_desc_offset[0]=%lu, window_size=%lu", + thread_idx, + static_cast(header), + static_cast(header->rings[0].task_descriptors_offset), + static_cast(header->rings[0].task_window_size)); + + Handshake* hank = static_cast(runtime->workers); + DEV_INFO("Thread %d: hank=%p, window_size=%lu", + thread_idx, + static_cast(hank), + static_cast(header->rings[0].task_window_size)); + + // One-time init: assign perf buffers (one thread does it; others wait) + if (!pto2_init_done_.exchange(true, std::memory_order_acq_rel)) { + DEV_INFO("Thread %d: doing one-time init", thread_idx); + +#if PTO2_PROFILING + // Assign perf buffers to cores early so profiling captures all tasks + // (total_tasks written to header later when orchestrator completes) + if (runtime->enable_profiling) { + perf_aicpu_init_profiling(runtime); + // Initialize phase profiling for scheduler threads + orchestrator threads + perf_aicpu_init_phase_profiling(runtime, sched_thread_num_, orch_thread_num_); + perf_aicpu_set_orch_thread_idx(sched_thread_num_); + } +#endif + + DEV_INFO("Thread %d: one-time init done", thread_idx); + pto2_init_complete_.store(true, std::memory_order_release); + } else { + while (!pto2_init_complete_.load(std::memory_order_acquire)) { + SPIN_WAIT_HINT(); + } + } + + DEV_INFO("Thread %d: PTO2 dispatch starting with %d cores", thread_idx, core_num); + int32_t cur_thread_completed = 0; + int32_t idle_iterations = 0; + int32_t last_progress_count = 0; +#if PTO2_PROFILING + bool profiling_enabled = runtime->enable_profiling; +#endif + + // Scheduler profiling counters +#if PTO2_PROFILING + uint64_t sched_scan_cycle = 0; + uint64_t sched_complete_cycle = 0; + uint64_t sched_dispatch_cycle = 0; + uint64_t sched_idle_cycle = 0; + uint64_t sched_loop_count = 0; + uint32_t phase_complete_count = 0; + uint32_t phase_dispatch_count = 0; +#if PTO2_SCHED_PROFILING + uint64_t complete_probe_count = 0; + uint64_t complete_hit_count = 0; + uint64_t notify_edges_total = 0; + int32_t notify_max_degree = 0; + uint64_t notify_tasks_enqueued = 0; + uint64_t fanin_edges_total = 0; + int32_t fanin_max_degree = 0; + uint64_t pop_hit = 0; + uint64_t pop_miss = 0; + uint64_t local_dispatch_count = 0; + uint64_t local_overflow_count = 0; + uint64_t sched_complete_perf_cycle = 0; + uint64_t sched_dispatch_pop_cycle = 0; + uint64_t sched_dispatch_setup_cycle = 0; +#endif +#endif + + // Local-first dispatch buffers (stack-allocated, one per CoreType per scheduling thread). + // Initialized once; must be empty at the start of each iteration. + constexpr int LOCAL_READY_CAP_PER_TYPE = 64; + PTO2TaskSlotState* local_ptrs[PTO2_NUM_RESOURCE_SHAPES][LOCAL_READY_CAP_PER_TYPE]; + PTO2LocalReadyBuffer local_bufs[PTO2_NUM_RESOURCE_SHAPES]; + for (int32_t i = 0; i < PTO2_NUM_RESOURCE_SHAPES; i++) { + local_bufs[i].reset(local_ptrs[i], LOCAL_READY_CAP_PER_TYPE); + } + PTO2TaskSlotState* deferred_release_slot_states[256]; + int32_t deferred_release_count = 0; + + bool cores_released = false; + +#if PTO2_PROFILING + uint64_t sched_start_ts = get_sys_cnt_aicpu(); +#endif + + while (true) { + bool made_progress = false; +#if PTO2_PROFILING + CYCLE_COUNT_START(); + sched_loop_count++; + uint64_t _t0_phase = _t0; +#endif + int32_t task_count = 0; + if (!tracker.has_any_running_cores()) { + bool orch_done = orchestrator_done_; + if (orch_done) { + // Check for orchestrator fatal error — exit immediately + int32_t orch_err = header->orch_error_code.load(std::memory_order_acquire); + if (orch_err != PTO2_ERROR_NONE) { + DEV_ERROR( + "Thread %d: Fatal error (code=%d), sending EXIT_SIGNAL to all cores. " + "completed_tasks=%d, total_tasks=%d", + thread_idx, + orch_err, + completed_tasks_.load(std::memory_order_relaxed), + total_tasks_); + emergency_shutdown(runtime); + completed_.store(true, std::memory_order_release); + break; + } + + // Normal exit: all tasks complete + task_count = total_tasks_; + if (task_count > 0 && completed_tasks_.load(std::memory_order_relaxed) >= task_count) { + completed_.store(true, std::memory_order_release); + DEV_INFO("Thread %d: PTO2 completed tasks %d/%d", + thread_idx, + completed_tasks_.load(std::memory_order_relaxed), + task_count); + break; + } + } + } + + // Check for core transition request (execute once per thread) + if (!cores_released && orch_to_sched_ && transition_requested_.load(std::memory_order_acquire)) { + if (!reassigned_.load(std::memory_order_acquire)) { + wait_reassign_.fetch_add(1, std::memory_order_release); + while (!reassigned_.load(std::memory_order_acquire)) { + if (completed_.load(std::memory_order_acquire)) { + break; + } + SPIN_WAIT_HINT(); + } + if (completed_.load(std::memory_order_acquire)) { + break; + } + } + cores_released = true; + } + +#if PTO2_PROFILING + CYCLE_COUNT_LAP(sched_idle_cycle); +#endif + + // Process completed and dispatch FIRST to minimize Sched (dispatch→finish) latency. + // Sched time = finish_ts - dispatch_ts; recording finish_ts here at loop start reduces + // tail overhead (time from AICore done to AICPU recording finish). + + // Phase 1: Check running cores for completion, process and move to idle + int32_t completed_this_turn = 0; + + // Check AIC running cores + bool try_completed = false; + if (tracker.has_running_cores()) { + try_completed = true; + check_running_cores_for_completion(thread_idx, + hank, + completed_this_turn, + cur_thread_completed, + made_progress, + deferred_release_slot_states, + deferred_release_count, + local_bufs +#if PTO2_PROFILING + , + profiling_enabled, + phase_complete_count +#endif +#if PTO2_SCHED_PROFILING + , + complete_probe_count, + complete_hit_count, + notify_edges_total, + notify_max_degree, + notify_tasks_enqueued, + fanin_edges_total, + fanin_max_degree, + sched_complete_perf_cycle +#endif + ); + } + + // Check AIV running cores + if (tracker.has_running_cores()) { + try_completed = true; + check_running_cores_for_completion(thread_idx, + hank, + completed_this_turn, + cur_thread_completed, + made_progress, + deferred_release_slot_states, + deferred_release_count, + local_bufs +#if PTO2_PROFILING + , + profiling_enabled, + phase_complete_count +#endif +#if PTO2_SCHED_PROFILING + , + complete_probe_count, + complete_hit_count, + notify_edges_total, + notify_max_degree, + notify_tasks_enqueued, + fanin_edges_total, + fanin_max_degree, + sched_complete_perf_cycle +#endif + ); + } + if (completed_this_turn > 0) { +#if PTO2_SCHED_PROFILING + rt->scheduler.tasks_completed.fetch_add(completed_this_turn, std::memory_order_relaxed); +#endif + int32_t prev = completed_tasks_.fetch_add(completed_this_turn, std::memory_order_relaxed); + int32_t new_total = prev + completed_this_turn; + last_progress_count = new_total; + if (thread_idx == 0 && task_count > 0) { + if (new_total <= PROGRESS_VERBOSE_THRESHOLD || + new_total / PROGRESS_LOG_INTERVAL != prev / PROGRESS_LOG_INTERVAL || new_total >= task_count) { + DEV_ALWAYS("PTO2 progress: completed=%d total=%d (%.1f%%)", + new_total, + task_count, + 100.0 * new_total / task_count); + } + } + } + +#if PTO2_PROFILING + if (!try_completed) { + CYCLE_COUNT_LAP(sched_idle_cycle); + } else { + CYCLE_COUNT_LAP(sched_complete_cycle); + if (profiling_enabled && phase_complete_count > 0) { + perf_aicpu_record_phase( + thread_idx, AicpuPhaseId::SCHED_COMPLETE, _t0_phase, _t1, sched_loop_count, phase_complete_count); + _t0_phase = _t1; + phase_complete_count = 0; + } + } +#endif + + bool try_pushed = false; + const PTO2ResourceShape* dispatch_order = get_dispatch_order(thread_idx); + for (int32_t si = 0; si < PTO2_NUM_RESOURCE_SHAPES; si++) { + PTO2ResourceShape shape = dispatch_order[si]; + auto valid_cluster_states = tracker.get_valid_cluster_offset_states(shape); + if (!valid_cluster_states.has_value()) { + continue; + } + auto& local_buf = local_bufs[static_cast(shape)]; + + while (valid_cluster_states.has_value()) { + int want = valid_cluster_states.count(); + PTO2TaskSlotState* batch[CoreTracker::MAX_CLUSTERS]; + int got = pop_ready_tasks_batch(shape, + thread_idx, + local_buf, + batch, + want +#if PTO2_SCHED_PROFILING + , + pop_hit, + pop_miss, + local_dispatch_count, + sched_dispatch_pop_cycle +#endif + ); + if (got == 0) break; + + for (int bi = 0; bi < got; bi++) { + PTO2TaskSlotState* slot_state = batch[bi]; + try_pushed = true; +#if PTO2_SCHED_PROFILING + uint64_t t_setup_start = get_sys_cnt_aicpu(); +#endif + // Dispatch as many blocks as possible for this task using available clusters. + // For block_num=1 the inner body executes exactly once (no overhead). + do { + auto current_valid_cluster_offset = valid_cluster_states.pop_first(); + if (shape == PTO2ResourceShape::MIX) { + // Full-cluster: all active subtasks share the same block_idx. + uint8_t mask = slot_state->active_mask; + if (mask & PTO2_SUBTASK_MASK_AIC) { + dispatch_subtask_to_core(runtime, + thread_idx, + tracker.get_aic_core_offset(current_valid_cluster_offset), + *slot_state, + PTO2SubtaskSlot::AIC +#if PTO2_PROFILING + , + profiling_enabled +#endif + ); + } + if (mask & PTO2_SUBTASK_MASK_AIV0) { + dispatch_subtask_to_core(runtime, + thread_idx, + tracker.get_aiv0_core_offset(current_valid_cluster_offset), + *slot_state, + PTO2SubtaskSlot::AIV0 +#if PTO2_PROFILING + , + profiling_enabled +#endif + ); + } + if (mask & PTO2_SUBTASK_MASK_AIV1) { + dispatch_subtask_to_core(runtime, + thread_idx, + tracker.get_aiv1_core_offset(current_valid_cluster_offset), + *slot_state, + PTO2SubtaskSlot::AIV1 +#if PTO2_PROFILING + , + profiling_enabled +#endif + ); + } + slot_state->next_block_idx++; + } else if (shape == PTO2ResourceShape::AIC) { + dispatch_subtask_to_core(runtime, + thread_idx, + tracker.get_aic_core_offset(current_valid_cluster_offset), + *slot_state, + PTO2SubtaskSlot::AIC +#if PTO2_PROFILING + , + profiling_enabled +#endif + ); + slot_state->next_block_idx++; + } else { // shape == PTO2ResourceShape::AIV + auto core_offset = tracker.is_aiv0_core_idle(current_valid_cluster_offset) + ? tracker.get_aiv0_core_offset(current_valid_cluster_offset) + : tracker.get_aiv1_core_offset(current_valid_cluster_offset); + dispatch_subtask_to_core(runtime, + thread_idx, + core_offset, + *slot_state, + PTO2SubtaskSlot::AIV0 +#if PTO2_PROFILING + , + profiling_enabled +#endif + ); + slot_state->next_block_idx++; + // Refresh idle state so the do-while naturally picks up + // the other AIV in the same cluster on the next iteration. + if (slot_state->next_block_idx < slot_state->block_num) { + valid_cluster_states = tracker.get_valid_cluster_offset_states(shape); + } + } +#if PTO2_PROFILING + phase_dispatch_count += __builtin_popcount(slot_state->active_mask); +#endif + DEV_DEBUG("Thread %d: Dispatched %s task %" PRId64 " block %d/%d to cluster_offset %d", + thread_idx, + shape_name(shape), + static_cast(slot_state->task->task_id.raw), + slot_state->next_block_idx - 1, + slot_state->block_num, + current_valid_cluster_offset); + } while (slot_state->next_block_idx < slot_state->block_num && valid_cluster_states.has_value()); + + // Re-enqueue only if blocks remain after exhausting local clusters + if (slot_state->next_block_idx < slot_state->block_num) { + rt->scheduler.ready_queues[static_cast(shape)].push(slot_state); + } + made_progress = true; +#if PTO2_SCHED_PROFILING + sched_dispatch_setup_cycle += (get_sys_cnt_aicpu() - t_setup_start); +#endif + } + + // lazy update valid_cluster_states + if (!valid_cluster_states.has_value()) { + valid_cluster_states = tracker.get_valid_cluster_offset_states(shape); + } + } + } + + // requeue in global ready queue + for (int32_t si = 0; si < PTO2_NUM_RESOURCE_SHAPES; si++) { + PTO2ResourceShape shape = dispatch_order[si]; + auto& local_buf = local_bufs[static_cast(shape)]; + auto& ready_queue = rt->scheduler.ready_queues[static_cast(shape)]; +#if PTO2_SCHED_PROFILING + local_overflow_count += local_buf.count; +#endif + if (local_buf.count > 0) { + ready_queue.push_batch(local_buf.slot_states, local_buf.count); + local_buf.count = 0; + } + } + +#if PTO2_PROFILING + if (!try_pushed) { + CYCLE_COUNT_LAP(sched_idle_cycle); + } else { + CYCLE_COUNT_LAP(sched_dispatch_cycle); + if (profiling_enabled && phase_dispatch_count > 0) { + perf_aicpu_record_phase( + thread_idx, AicpuPhaseId::SCHED_DISPATCH, _t0_phase, _t1, sched_loop_count, phase_dispatch_count); + _t0_phase = _t1; + phase_dispatch_count = 0; + } + } +#endif + +#if !PTO2_PROFILING + (void)try_completed; // NOLINT(readability/casting) + (void)try_pushed; // NOLINT(readability/casting) +#endif + + if (made_progress) { + idle_iterations = 0; + } else { + // Batch deferred fanin releases during idle. + // Processing all pending releases at once advances the ring faster, + // freeing heap space for the orchestrator without blocking completion polling. + while (deferred_release_count > 0) { +#if PTO2_SCHED_PROFILING + int32_t fe = + rt->scheduler.on_task_release(*deferred_release_slot_states[--deferred_release_count], thread_idx); +#else + int32_t fe = rt->scheduler.on_task_release(*deferred_release_slot_states[--deferred_release_count]); +#endif + (void)fe; // NOLINT(readability/casting) +#if PTO2_SCHED_PROFILING + fanin_edges_total += fe; + if (fe > fanin_max_degree) fanin_max_degree = fe; +#endif + } + idle_iterations++; + + // Check for orchestrator fatal error during idle (every 1024 iterations) + // orch_error_code is set in shared memory by the orchestrator's spin loop + // BEFORE orchestrator_done_ is set, so this catches errors earlier. + if (idle_iterations % FATAL_ERROR_CHECK_INTERVAL == 0) { + int32_t orch_err = header->orch_error_code.load(std::memory_order_acquire); + if (orch_err != PTO2_ERROR_NONE) { + DEV_ERROR("Thread %d: Fatal error detected (code=%d), sending EXIT_SIGNAL to all cores", + thread_idx, + orch_err); + emergency_shutdown(runtime); + completed_.store(true, std::memory_order_release); + break; + } + } + + if (thread_idx == 0 && task_count > 0 && idle_iterations % STALL_LOG_INTERVAL == 0) { + int32_t c = completed_tasks_.load(std::memory_order_relaxed); + DEV_ALWAYS("PTO2 stall: no progress for %d iterations, completed=%d total=%d (last progress at %d)", + idle_iterations, + c, + task_count, + last_progress_count); + // Scan all task slots to find truly stuck tasks using scheduler state + PTO2SchedulerState* sched = &rt->scheduler; + PTO2SharedMemoryHeader* sm_header_diag = static_cast(sm_base); + int32_t cnt_ready = 0, cnt_waiting = 0, cnt_inflight = 0; + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + int32_t ring_task_count = + sm_header_diag->rings[r].fc.current_task_index.load(std::memory_order_relaxed); + for (int32_t si = 0; si < ring_task_count; si++) { + PTO2TaskSlotState& slot_state = sched->get_slot_state(r, si); + PTO2TaskState st = slot_state.task_state.load(std::memory_order_relaxed); + int32_t rc = slot_state.fanin_refcount.load(std::memory_order_relaxed); + int32_t fi = slot_state.fanin_count; + int32_t kid = slot_state.task->kernel_id[0]; + if (st >= PTO2_TASK_COMPLETED) continue; // Already done + if (st == PTO2_TASK_READY || st == PTO2_TASK_RUNNING) { + cnt_inflight++; + continue; + } + // PENDING + if (rc >= fi) { + // Ready (all deps satisfied) but not enqueued — this is the real bug + cnt_ready++; + if (cnt_ready <= STALL_DUMP_READY_MAX) { + DEV_ALWAYS( + " STUCK-READY ring=%d task_id=%" PRId64 + " kernel_id=%d refcount=%d fanin=%d state=%d", // NOLINT(whitespace/line_length) + r, + static_cast(slot_state.task->task_id.raw), + kid, + rc, + fi, + static_cast(st)); + } + } else { + cnt_waiting++; + if (cnt_waiting <= STALL_DUMP_WAIT_MAX) { + DEV_ALWAYS( + " STUCK-WAIT ring=%d task_id=%" PRId64 + " kernel_id=%d refcount=%d fanin=%d state=%d", // NOLINT(whitespace/line_length) + r, + static_cast(slot_state.task->task_id.raw), + kid, + rc, + fi, + static_cast(st)); + } + } + } + } + DEV_ALWAYS(" scan result: stuck_ready=%d stuck_waiting=%d in_flight=%d", + cnt_ready, + cnt_waiting, + cnt_inflight); + // Log this thread's dispatch state + int32_t aic_running = tracker.get_running_count(); + int32_t aiv_running = tracker.get_running_count(); + int32_t total_running = aic_running + aiv_running; + DEV_ALWAYS(" thread=%d running_cores=%d (AIC=%d AIV=%d) core_num=%d", + thread_idx, + total_running, + aic_running, + aiv_running, + core_num); + // Dump running cores + auto all_running = tracker.get_all_running_cores(); + int32_t dump_count = 0; + int32_t bp; + while (dump_count < STALL_DUMP_CORE_MAX && (bp = all_running.pop_first()) >= 0) { + dump_count++; + int32_t cid = tracker.get_core_id_by_offset(bp); + int32_t sw_tid = core_exec_states_[cid].executing_reg_task_id; + int32_t hw_kernel = -1; + if (sw_tid >= 0 && core_exec_states_[cid].executing_slot_state) { + int32_t diag_slot = static_cast(core_exec_states_[cid].executing_subslot); + hw_kernel = core_exec_states_[cid].executing_slot_state->task->kernel_id[diag_slot]; + } + uint64_t cond_reg = read_reg(core_exec_states_[cid].reg_addr, RegId::COND); + DEV_ALWAYS(" core=%d cond=0x%x(state=%d,id=%d) exec_id=%d kernel=%d", + cid, + static_cast(cond_reg), + EXTRACT_TASK_STATE(cond_reg), + EXTRACT_TASK_ID(cond_reg), + sw_tid, + hw_kernel); + } + // Dump cluster state + for (int32_t cli = 0; cli < tracker.get_cluster_count() && cli < STALL_DUMP_CORE_MAX; cli++) { + int32_t offset = cli * 3; + DEV_ALWAYS(" cluster[%d] aic=%d(%s) aiv0=%d(%s) aiv1=%d(%s)", + cli, + tracker.get_aic_core_id(offset), + tracker.is_aic_core_idle(offset) ? "idle" : "busy", + tracker.get_aiv0_core_id(offset), + tracker.is_aiv0_core_idle(offset) ? "idle" : "busy", + tracker.get_aiv1_core_id(offset), + tracker.is_aiv1_core_idle(offset) ? "idle" : "busy"); + } + } + if (idle_iterations > MAX_IDLE_ITERATIONS) { + DEV_ERROR("Thread %d: PTO2 timeout after %d idle iterations", thread_idx, idle_iterations); +#if PTO2_PROFILING + // Benchmark: scheduler lifetime end timestamp on timeout path + uint64_t sched_timeout_ts = get_sys_cnt_aicpu(); + DEV_ALWAYS("Thread %d: sched_start=%" PRIu64 " sched_end(timeout)=%" PRIu64 " sched_cost=%.3fus", + thread_idx, + static_cast(sched_start_ts), + static_cast(sched_timeout_ts), + cycles_to_us(sched_timeout_ts - sched_start_ts)); +#endif + return -1; + } else { + SPIN_WAIT_HINT(); + } +#if PTO2_PROFILING + CYCLE_COUNT_LAP(sched_idle_cycle); + if (profiling_enabled) { + perf_aicpu_record_phase(thread_idx, AicpuPhaseId::SCHED_IDLE_WAIT, _t0_phase, _t1, sched_loop_count, 0); + _t0_phase = _t1; + } +#endif + } + } + +#if PTO2_PROFILING + // Record sched_end before any DEV_ALWAYS to avoid init cost contamination + uint64_t sched_end_ts = get_sys_cnt_aicpu(); + DEV_ALWAYS("Thread %d: sched_start=%" PRIu64 " sched_end=%" PRIu64 " sched_cost=%.3fus", + thread_idx, + static_cast(sched_start_ts), + static_cast(sched_end_ts), + cycles_to_us(sched_end_ts - sched_start_ts)); + + // Scheduler summary logging (always print when PTO2_PROFILING=1) + uint64_t sched_total = sched_complete_cycle + sched_scan_cycle + sched_dispatch_cycle + sched_idle_cycle; + if (sched_total == 0) sched_total = 1; // avoid div-by-zero + +#if PTO2_SCHED_PROFILING + // Two-level tree display: sub-phase breakdown within complete and dispatch + { + PTO2SchedProfilingData sp = pto2_scheduler_get_profiling(thread_idx); + uint64_t otc_total = sp.lock_cycle + sp.fanout_cycle + sp.fanin_cycle + sp.self_consumed_cycle; + uint64_t complete_poll = (sched_complete_cycle > otc_total + sched_complete_perf_cycle) + ? (sched_complete_cycle - otc_total - sched_complete_perf_cycle) + : 0; + uint64_t dispatch_poll = (sched_dispatch_cycle > sched_dispatch_pop_cycle + sched_dispatch_setup_cycle) + ? (sched_dispatch_cycle - sched_dispatch_pop_cycle - sched_dispatch_setup_cycle) + : 0; + + DEV_ALWAYS("Thread %d: === Scheduler Phase Breakdown: total=%.3fus, %d tasks ===", + thread_idx, + cycles_to_us(sched_total), + cur_thread_completed); + + // Level 1: complete + double notify_avg = + cur_thread_completed > 0 ? static_cast(notify_edges_total) / cur_thread_completed : 0.0; + double fanin_avg = + cur_thread_completed > 0 ? static_cast(fanin_edges_total) / cur_thread_completed : 0.0; + DEV_ALWAYS("Thread %d: complete : %.3fus (%.1f%%) [fanout: edges=%" PRIu64 + ", max_degree=%d, avg=%.1f] [fanin: " // NOLINT(whitespace/line_length) + "edges=%" PRIu64 ", max_degree=%d, avg=%.1f]", + thread_idx, + cycles_to_us(sched_complete_cycle), + sched_complete_cycle * 100.0 / sched_total, + static_cast(notify_edges_total), + notify_max_degree, + notify_avg, + static_cast(fanin_edges_total), + fanin_max_degree, + fanin_avg); + + // Level 2: complete sub-phases (percentage relative to complete) + uint64_t c_parent = sched_complete_cycle > 0 ? sched_complete_cycle : 1; + uint64_t complete_miss_count = + (complete_probe_count > complete_hit_count) ? (complete_probe_count - complete_hit_count) : 0; + double complete_hit_rate = complete_probe_count > 0 ? complete_hit_count * 100.0 / complete_probe_count : 0.0; + DEV_ALWAYS("Thread %d: poll : %.3fus (%.1f%%) hit=%" PRIu64 ", miss=%" PRIu64 ", hit_rate=%.1f%%", + thread_idx, + cycles_to_us(complete_poll), + complete_poll * 100.0 / c_parent, + static_cast(complete_hit_count), + static_cast(complete_miss_count), + complete_hit_rate); + DEV_ALWAYS("Thread %d: otc_lock : %.3fus (%.1f%%) work=%.3fus wait=%.3fus atomics=%" PRIu64 "", + thread_idx, + cycles_to_us(sp.lock_cycle), + sp.lock_cycle * 100.0 / c_parent, + cycles_to_us(sp.lock_cycle - sp.lock_wait_cycle), + cycles_to_us(sp.lock_wait_cycle), + static_cast(sp.lock_atomic_count)); + DEV_ALWAYS("Thread %d: otc_fanout : %.3fus (%.1f%%) work=%.3fus wait=%.3fus atomics=%" PRIu64 "", + thread_idx, + cycles_to_us(sp.fanout_cycle), + sp.fanout_cycle * 100.0 / c_parent, + cycles_to_us(sp.fanout_cycle - sp.push_wait_cycle), + cycles_to_us(sp.push_wait_cycle), + static_cast(sp.fanout_atomic_count)); + DEV_ALWAYS("Thread %d: otc_fanin : %.3fus (%.1f%%) atomics=%" PRIu64 "", + thread_idx, + cycles_to_us(sp.fanin_cycle), + sp.fanin_cycle * 100.0 / c_parent, + static_cast(sp.fanin_atomic_count)); + DEV_ALWAYS("Thread %d: otc_self : %.3fus (%.1f%%) atomics=%" PRIu64 "", + thread_idx, + cycles_to_us(sp.self_consumed_cycle), + sp.self_consumed_cycle * 100.0 / c_parent, + static_cast(sp.self_atomic_count)); + DEV_ALWAYS("Thread %d: perf : %.3fus (%.1f%%)", + thread_idx, + cycles_to_us(sched_complete_perf_cycle), + sched_complete_perf_cycle * 100.0 / c_parent); + + // Level 1: dispatch + uint64_t pop_total = pop_hit + pop_miss; + double pop_hit_rate = pop_total > 0 ? pop_hit * 100.0 / pop_total : 0.0; + DEV_ALWAYS("Thread %d: dispatch : %.3fus (%.1f%%) [pop: hit=%" PRIu64 ", miss=%" PRIu64 + ", hit_rate=%.1f%%]", // NOLINT(whitespace/line_length) + thread_idx, + cycles_to_us(sched_dispatch_cycle), + sched_dispatch_cycle * 100.0 / sched_total, + static_cast(pop_hit), + static_cast(pop_miss), + pop_hit_rate); + uint64_t global_dispatch_count = pop_hit - local_dispatch_count; + uint64_t total_dispatched = local_dispatch_count + global_dispatch_count; + double local_hit_rate = total_dispatched > 0 ? local_dispatch_count * 100.0 / total_dispatched : 0.0; + DEV_ALWAYS("Thread %d: local_disp : local=%" PRIu64 ", global=%" PRIu64 ", overflow=%" PRIu64 + ", local_rate=%.1f%%", // NOLINT(whitespace/line_length) + thread_idx, + static_cast(local_dispatch_count), + static_cast(global_dispatch_count), + static_cast(local_overflow_count), + local_hit_rate); + + // Level 2: dispatch sub-phases (percentage relative to dispatch) + uint64_t d_parent = sched_dispatch_cycle > 0 ? sched_dispatch_cycle : 1; + DEV_ALWAYS("Thread %d: poll : %.3fus (%.1f%%)", + thread_idx, + cycles_to_us(dispatch_poll), + dispatch_poll * 100.0 / d_parent); + DEV_ALWAYS("Thread %d: pop : %.3fus (%.1f%%) work=%.3fus wait=%.3fus atomics=%" PRIu64 "", + thread_idx, + cycles_to_us(sched_dispatch_pop_cycle), + sched_dispatch_pop_cycle * 100.0 / d_parent, + cycles_to_us(sched_dispatch_pop_cycle - sp.pop_wait_cycle), + cycles_to_us(sp.pop_wait_cycle), + static_cast(sp.pop_atomic_count)); + DEV_ALWAYS("Thread %d: setup : %.3fus (%.1f%%)", + thread_idx, + cycles_to_us(sched_dispatch_setup_cycle), + sched_dispatch_setup_cycle * 100.0 / d_parent); + + // Level 1: scan + DEV_ALWAYS("Thread %d: scan : %.3fus (%.1f%%)", + thread_idx, + cycles_to_us(sched_scan_cycle), + sched_scan_cycle * 100.0 / sched_total); + + // Level 1: idle + DEV_ALWAYS("Thread %d: idle : %.3fus (%.1f%%)", + thread_idx, + cycles_to_us(sched_idle_cycle), + sched_idle_cycle * 100.0 / sched_total); + + // Average per completion + if (cur_thread_completed > 0) { + DEV_ALWAYS("Thread %d: avg/complete : %.3fus", + thread_idx, + cycles_to_us(sched_complete_cycle) / cur_thread_completed); + } + } +#endif + // Summary line (always print when PTO2_PROFILING=1) + DEV_ALWAYS("Thread %d: Scheduler summary: total_time=%.3fus, loops=%" PRIu64 ", tasks_scheduled=%d", + thread_idx, + cycles_to_us(sched_total), + static_cast(sched_loop_count), + cur_thread_completed); +#endif + +#if PTO2_PROFILING + // Flush performance buffers for cores managed by this thread + if (profiling_enabled) { + perf_aicpu_flush_buffers(runtime, thread_idx, core_assignments_[thread_idx], core_num); + perf_aicpu_flush_phase_buffers(thread_idx); + } +#endif + + return cur_thread_completed; +} + +int32_t AicpuExecutor::run(Runtime* runtime) { + int32_t thread_idx = thread_idx_++; + DEV_INFO("Thread %d: Start", thread_idx); + + // Orchestrator check + if (thread_idx >= sched_thread_num_) { +#if PTO2_PROFILING + uint64_t orch_cycle_start = 0; + int32_t pto2_submitted_tasks = -1; +#endif + int32_t orch_idx = thread_idx - sched_thread_num_; + if (runtime->get_orch_built_on_host()) { + DEV_INFO("Thread %d: Host orchestration mode, no-op (orch_idx=%d)", thread_idx, orch_idx); + } else { + // First orchestrator thread (orch_idx == 0): load SO, create runtime + if (orch_idx == 0) { + DEV_INFO("Thread %d: Primary orchestrator, loading SO via dlopen", thread_idx); + + const void* so_data = runtime->get_device_orch_so_data(); + size_t so_size = runtime->get_device_orch_so_size(); + + if (so_data == nullptr || so_size == 0) { + DEV_ERROR("Thread %d: Device orchestration SO not set", thread_idx); + return -1; + } + + // Try multiple paths that may allow execution on AICPU + char so_path[256]; + bool file_created = false; + const char* candidate_dirs[] = { + "/usr/lib64/aicpu_kernels/0/aicpu_kernels_device", "/usr/lib64", "/lib64", "/var/tmp", "/tmp"}; + const int32_t num_candidates = sizeof(candidate_dirs) / sizeof(candidate_dirs[0]); + + for (int32_t i = 0; i < num_candidates && !file_created; i++) { + snprintf(so_path, sizeof(so_path), "%s/libdevice_orch_%d.so", candidate_dirs[i], getpid()); + int32_t fd = open(so_path, O_WRONLY | O_CREAT | O_TRUNC, 0755); + if (fd < 0) { + DEV_INFO("Thread %d: Cannot create SO at %s (errno=%d), trying next path", + thread_idx, + so_path, + errno); + continue; + } + ssize_t written = write(fd, so_data, so_size); + close(fd); + if (written != static_cast(so_size)) { + DEV_INFO("Thread %d: Cannot write SO to %s (errno=%d), trying next path", + thread_idx, + so_path, + errno); + unlink(so_path); + continue; + } + file_created = true; + DEV_INFO("Thread %d: Created SO file at %s (%zu bytes)", thread_idx, so_path, so_size); + } + + if (!file_created) { + DEV_ERROR("Thread %d: Failed to create SO file in any candidate path", thread_idx); + return -1; + } + + dlerror(); + void* handle = dlopen(so_path, RTLD_LAZY | RTLD_LOCAL); + const char* dlopen_err = dlerror(); + if (handle == nullptr) { + DEV_ERROR("Thread %d: dlopen failed: %s", thread_idx, dlopen_err ? dlopen_err : "unknown"); + unlink(so_path); + return -1; + } + DEV_INFO("Thread %d: dlopen succeeded, handle=%p", thread_idx, handle); + + dlerror(); + auto config_func = + reinterpret_cast(dlsym(handle, "aicpu_orchestration_config")); + + dlerror(); + DeviceOrchestrationFunc orch_func = + reinterpret_cast(dlsym(handle, "aicpu_orchestration_entry")); + const char* dlsym_error = dlerror(); + if (dlsym_error != nullptr) { + DEV_ERROR("Thread %d: dlsym failed: %s", thread_idx, dlsym_error); + dlclose(handle); + unlink(so_path); + return -1; + } + if (orch_func == nullptr) { + DEV_ERROR("Thread %d: dlsym returned NULL for aicpu_orchestration_entry", thread_idx); + dlclose(handle); + unlink(so_path); + return -1; + } + + dlerror(); + auto bind_runtime_func = + reinterpret_cast(dlsym(handle, "pto2_framework_bind_runtime")); + const char* bind_runtime_error = dlerror(); + if (bind_runtime_error != nullptr) { + DEV_INFO("Thread %d: Optional TLS runtime binder not found: %s", thread_idx, bind_runtime_error); + bind_runtime_func = nullptr; + } + + const ChipStorageTaskArgs& args = runtime->get_orch_args(); + int32_t arg_count = args.tensor_count() + args.scalar_count(); + DEV_INFO("Thread %d: sm_ptr=%p, arg_count=%d", thread_idx, runtime->get_pto2_gm_sm_ptr(), arg_count); + for (int32_t i = 0; i < args.tensor_count() && i < 20; i++) { + const ContinuousTensor& t = args.tensor(i); + DEV_INFO("Thread %d: orch_args[%d] = TENSOR(data=0x%lx, ndims=%u, dtype=%u)", + thread_idx, + i, + static_cast(t.data), + t.ndims, + static_cast(t.dtype)); + } + for (int32_t i = 0; i < args.scalar_count() && (args.tensor_count() + i) < 20; i++) { + DEV_INFO("Thread %d: orch_args[%d] = SCALAR(0x%lx)", + thread_idx, + args.tensor_count() + i, + static_cast(args.scalar(i))); + } + + uint64_t task_window_size = PTO2_TASK_WINDOW_SIZE; + uint64_t heap_size = PTO2_HEAP_SIZE; + int32_t expected_arg_count = 0; + if (config_func) { + PTO2OrchestrationConfig cfg = config_func(args); + expected_arg_count = cfg.expected_arg_count; + DEV_INFO("Thread %d: Config: expected_args=%d", thread_idx, expected_arg_count); + } else { + DEV_INFO("Thread %d: No config function, using defaults", thread_idx); + } + + if (expected_arg_count > 0 && arg_count < expected_arg_count) { + DEV_ERROR("Thread %d: arg_count %d < expected %d", thread_idx, arg_count, expected_arg_count); + dlclose(handle); + unlink(so_path); + return -1; + } + + if (runtime->pto2_task_window_size > 0) { + task_window_size = runtime->pto2_task_window_size; + } + if (runtime->pto2_heap_size > 0) { + heap_size = runtime->pto2_heap_size; + } + int32_t dep_pool_capacity = PTO2_DEP_LIST_POOL_SIZE; + if (runtime->pto2_dep_pool_size > 0) { + dep_pool_capacity = static_cast(runtime->pto2_dep_pool_size); + } + DEV_INFO("Thread %d: Ring sizes: task_window=%lu, heap=%lu, dep_pool=%d", + thread_idx, + static_cast(task_window_size), + static_cast(heap_size), + dep_pool_capacity); + + void* sm_ptr = runtime->get_pto2_gm_sm_ptr(); + void* gm_heap = runtime->get_pto2_gm_heap_ptr(); + + uint64_t sm_size = pto2_sm_calculate_size(task_window_size); + PTO2SharedMemoryHandle* sm_handle = + pto2_sm_create_from_buffer(sm_ptr, sm_size, task_window_size, heap_size); + if (!sm_handle) { + DEV_ERROR("Thread %d: Failed to create shared memory handle", thread_idx); + dlclose(handle); + unlink(so_path); + return -1; + } + + rt = pto2_runtime_create_from_sm( + PTO2_MODE_EXECUTE, sm_handle, gm_heap, heap_size, orch_thread_num_, dep_pool_capacity); + if (!rt) { + DEV_ERROR("Thread %d: Failed to create PTO2Runtime", thread_idx); + pto2_sm_destroy(sm_handle); + dlclose(handle); + unlink(so_path); + return -1; + } + +#if PTO2_PROFILING + for (int i = 0; i < orch_thread_num_; i++) { + rt->orchestrators[i].enable_profiling = runtime->enable_profiling; + } +#endif + + // With multi-ring, slot_states are per-ring inside the scheduler. + runtime->set_pto2_slot_states_ptr(nullptr); + + // Store shared state for other orchestrator threads + orch_func_ = orch_func; + orch_bind_runtime_ = bind_runtime_func; + orch_args_cached_ = &args; + orch_so_handle_ = handle; + snprintf(orch_so_path_, sizeof(orch_so_path_), "%s", so_path); + + // All-orchestrator mode: primary orchestrator does one-time init + if (sched_thread_num_ == 0) { + DEV_INFO("Thread %d: All-orchestrator mode, doing one-time init", thread_idx); + if (runtime->enable_profiling) { + perf_aicpu_init_profiling(runtime); + // After transition, all threads become schedulers + perf_aicpu_init_phase_profiling(runtime, thread_num_, orch_thread_num_); + perf_aicpu_set_orch_thread_idx(0); + } + pto2_init_done_.store(true, std::memory_order_release); + pto2_init_complete_.store(true, std::memory_order_release); + DEV_INFO("Thread %d: One-time init done", thread_idx); + } + + runtime_init_ready_.store(true, std::memory_order_release); + } else { + // Non-primary orchestrator: wait for primary to finish setup + while (!runtime_init_ready_.load(std::memory_order_acquire)) { + SPIN_WAIT_HINT(); + } + } + + // Wait for scheduler's one-time init to complete + // (or primary orchestrator's init in all-orchestrator mode) + while (!pto2_init_complete_.load(std::memory_order_acquire)) { + SPIN_WAIT_HINT(); + } + + pto2_set_orch_thread_idx(orch_idx); + +#if PTO2_PROFILING + // Each orchestrator thread sets its own phase buffer index (thread-local) + if (runtime->enable_profiling) { + perf_aicpu_set_orch_thread_idx(thread_idx); + } +#endif + +#if PTO2_PROFILING + orch_cycle_start = get_sys_cnt_aicpu(); +#endif + if (orch_bind_runtime_ != nullptr) { + orch_bind_runtime_(rt); + } + pto2_rt_scope_begin(rt); + orch_func_(*orch_args_cached_, orch_thread_num_, orch_idx); + pto2_rt_scope_end(rt); +#if PTO2_PROFILING + uint64_t orch_cycle_end = get_sys_cnt_aicpu(); + (void)orch_cycle_end; // NOLINT(readability/casting) +#endif + + // Print orchestrator profiling data +#if PTO2_ORCH_PROFILING + PTO2OrchProfilingData p = pto2_orchestrator_get_profiling(); + uint64_t total = + p.sync_cycle + p.alloc_cycle + p.params_cycle + p.lookup_cycle + p.insert_cycle + p.fanin_cycle; + if (total == 0) total = 1; // avoid div-by-zero + DEV_ALWAYS("Thread %d: === Orchestrator Profiling: %" PRId64 " tasks, total=%.3fus ===", + thread_idx, + static_cast(p.submit_count), + cycles_to_us(total)); + DEV_ALWAYS("Thread %d: task+heap_alloc: %.3fus (%.1f%%) work=%.3fus wait=%.3fus atomics=%" PRIu64 "", + thread_idx, + cycles_to_us(p.alloc_cycle), + p.alloc_cycle * 100.0 / total, + cycles_to_us(p.alloc_cycle - p.alloc_wait_cycle), + cycles_to_us(p.alloc_wait_cycle), + static_cast(p.alloc_atomic_count)); + DEV_ALWAYS("Thread %d: sync_tensormap : %.3fus (%.1f%%)", + thread_idx, + cycles_to_us(p.sync_cycle), + p.sync_cycle * 100.0 / total); + DEV_ALWAYS("Thread %d: lookup+dep : %.3fus (%.1f%%)", + thread_idx, + cycles_to_us(p.lookup_cycle), + p.lookup_cycle * 100.0 / total); + DEV_ALWAYS("Thread %d: tensormap_ins : %.3fus (%.1f%%)", + thread_idx, + cycles_to_us(p.insert_cycle), + p.insert_cycle * 100.0 / total); + DEV_ALWAYS("Thread %d: param_copy : %.3fus (%.1f%%) atomics=%" PRIu64 "", + thread_idx, + cycles_to_us(p.params_cycle), + p.params_cycle * 100.0 / total, + static_cast(p.params_atomic_count)); + DEV_ALWAYS("Thread %d: fanin+ready : %.3fus (%.1f%%) work=%.3fus wait=%.3fus atomics=%" PRIu64 "", + thread_idx, + cycles_to_us(p.fanin_cycle), + p.fanin_cycle * 100.0 / total, + cycles_to_us(p.fanin_cycle - p.fanin_wait_cycle), + cycles_to_us(p.fanin_wait_cycle), + static_cast(p.fanin_atomic_count)); + DEV_ALWAYS("Thread %d: avg/task : %.3fus", + thread_idx, + p.submit_count > 0 ? cycles_to_us(total) / p.submit_count : 0.0); + +#if PTO2_TENSORMAP_PROFILING + PTO2TensorMapProfilingData tp = pto2_tensormap_get_profiling(); + DEV_ALWAYS("Thread %d: === TensorMap Lookup Stats ===", thread_idx); + DEV_ALWAYS("Thread %d: lookups : %" PRIu64 ", inserts: %" PRIu64 "", + thread_idx, + static_cast(tp.lookup_count), + static_cast(tp.insert_count)); + DEV_ALWAYS("Thread %d: chain walked : total=%" PRIu64 ", avg=%.1f, max=%d", + thread_idx, + static_cast(tp.lookup_chain_total), + tp.lookup_count > 0 ? static_cast(tp.lookup_chain_total) / tp.lookup_count : 0.0, + tp.lookup_chain_max); + DEV_ALWAYS("Thread %d: overlap checks : %" PRIu64 ", hits=%" PRIu64 " (%.1f%%)", + thread_idx, + static_cast(tp.overlap_checks), + static_cast(tp.overlap_hits), + tp.overlap_checks > 0 ? tp.overlap_hits * 100.0 / tp.overlap_checks : 0.0); +#endif + +#if PTO2_PROFILING + // Write orchestrator summary to shared memory for host-side export (only if profiling enabled) + if (runtime->enable_profiling) { + AicpuOrchSummary orch_summary = {}; + orch_summary.start_time = orch_cycle_start; + orch_summary.end_time = orch_cycle_end; + orch_summary.sync_cycle = p.sync_cycle; + orch_summary.alloc_cycle = p.alloc_cycle; + orch_summary.args_cycle = p.args_cycle; + orch_summary.lookup_cycle = p.lookup_cycle; + orch_summary.heap_cycle = 0; // Now included in alloc_cycle + orch_summary.insert_cycle = p.insert_cycle; + orch_summary.fanin_cycle = p.fanin_cycle; + orch_summary.scope_end_cycle = p.scope_end_cycle; + orch_summary.submit_count = p.submit_count; + perf_aicpu_write_orch_summary(&orch_summary); + } +#endif +#endif + +#if PTO2_PROFILING + // Write core-to-thread mapping (one-time, after orchestration) + if (runtime->enable_profiling) { + perf_aicpu_write_core_assignments( + core_assignments_, core_count_per_thread_, sched_thread_num_, cores_total_num_); + // Flush orchestrator's phase record buffer + perf_aicpu_flush_phase_buffers(thread_idx); + } +#endif + + // Coordinate orchestrator completion + int32_t finished = orch_finished_count_.fetch_add(1, std::memory_order_acq_rel) + 1; + if (finished == orch_thread_num_) { + // Last orchestrator: signal completion and trigger core transition + pto2_rt_orchestration_done(rt); + + void* sm = runtime->get_pto2_gm_sm_ptr(); + PTO2SharedMemoryHeader* sm_header = static_cast(sm); + int32_t pto2_task_count = 0; + if (sm_header) { + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + pto2_task_count += sm_header->rings[r].fc.current_task_index.load(std::memory_order_acquire); + } + } +#if PTO2_PROFILING + pto2_submitted_tasks = pto2_task_count; +#endif + total_tasks_ = pto2_task_count; + if (runtime->enable_profiling && pto2_task_count > 0) { + perf_aicpu_update_total_tasks(runtime, static_cast(pto2_task_count)); + } + orchestrator_done_ = true; + { + int32_t orch_err = 0; + void* sm = runtime->get_pto2_gm_sm_ptr(); + if (sm) { + orch_err = + static_cast(sm)->orch_error_code.load(std::memory_order_relaxed); + } + + // Fatal error: shutdown AICore immediately before core transition. + if (orch_err != PTO2_ERROR_NONE) { + emergency_shutdown(runtime); + completed_.store(true, std::memory_order_release); + } + } + +#if PTO2_ORCH_PROFILING + uint64_t reassign_cycle_start = get_sys_cnt_aicpu(); +#endif + + // Skip core transition on fatal error — cores already shut down above + if (completed_.load(std::memory_order_acquire)) { + // Signal transition to unblock scheduler threads waiting at core transition + transition_requested_.store(true, std::memory_order_release); + reassigned_.store(true, std::memory_order_release); + } else if (orch_to_sched_) { + // Compute new core assignments for all threads and initialize donated slots + DEV_INFO("Thread %d: Set orchestrator_done=true, requesting core transition", thread_idx); +#if PTO2_PROFILING + uint64_t orch_stage_end_ts = get_sys_cnt_aicpu(); +#endif + transition_requested_.store(true, std::memory_order_release); +#if PTO2_PROFILING + DEV_ALWAYS( + "Thread %d: orch_stage_end=%" PRIu64 "", thread_idx, static_cast(orch_stage_end_ts)); +#endif + + // Wait for scheduler threads to acknowledge transition request + // All-orchestrator mode (sched_thread_num_ == 0): skip the wait + if (sched_thread_num_ > 0) { + while (wait_reassign_.load(std::memory_order_acquire) != sched_thread_num_) { + if (completed_.load(std::memory_order_acquire)) { + break; + } + SPIN_WAIT_HINT(); + } + } + if (!completed_.load(std::memory_order_acquire)) { + reassign_cores_for_all_threads(); + reassigned_.store(true, std::memory_order_release); + } + } + +#if PTO2_ORCH_PROFILING + uint64_t reassign_cycle_end = get_sys_cnt_aicpu(); + DEV_ALWAYS("Thread %d: reassign, cost %.3fus (orch_idx=%d)", + thread_idx, + cycles_to_us(reassign_cycle_end - reassign_cycle_start), + orch_idx); +#endif + } else { + // Non-last orchestrator: wait for last orchestrator to finish setup + if (orch_to_sched_) { + while (!transition_requested_.load(std::memory_order_acquire)) { + SPIN_WAIT_HINT(); + } + while (!reassigned_.load(std::memory_order_acquire)) { + if (completed_.load(std::memory_order_acquire)) { + break; + } + SPIN_WAIT_HINT(); + } + } + } + } +#if PTO2_PROFILING + uint64_t orch_end_ts = get_sys_cnt_aicpu(); + DEV_ALWAYS("Thread %d: orch_start=%" PRIu64 " orch_end=%" PRIu64 " orch_cost=%.3fus", + thread_idx, + static_cast(orch_cycle_start), + static_cast(orch_end_ts), + cycles_to_us(orch_end_ts - orch_cycle_start)); + if (pto2_submitted_tasks >= 0) { + DEV_ALWAYS("PTO2 total submitted tasks = %d, already executed %d tasks", + pto2_submitted_tasks, + completed_tasks_.load(std::memory_order_acquire)); + } +#endif + DEV_INFO("Thread %d: Orchestrator completed (orch_idx=%d)", thread_idx, orch_idx); + } + + // Scheduler thread (orchestrator threads skip dispatch when orch_to_sched_ is false) + if (!completed_.load(std::memory_order_acquire) && (thread_idx < sched_thread_num_ || orch_to_sched_)) { + // Device orchestration: wait for primary orchestrator to initialize SM header + if (!runtime->get_orch_built_on_host()) { + while (!runtime_init_ready_.load(std::memory_order_acquire)) { + SPIN_WAIT_HINT(); + } + } + always_assert(rt != nullptr); + int32_t completed = resolve_and_dispatch_pto2(runtime, thread_idx); + DEV_INFO("Thread %d: Executed %d tasks from runtime", thread_idx, completed); + } + + // Always shutdown AICore — even if completed_ was already true. + // platform_deinit_aicore_regs is idempotent; orchestrator threads have + // core_count_per_thread_ == 0 so they skip the loop harmlessly. + { + const int32_t* shutdown_cores = core_assignments_[thread_idx]; + int32_t shutdown_count = core_count_per_thread_[thread_idx]; + if (shutdown_count > 0) { + auto rc = shutdown_aicore(runtime, thread_idx, shutdown_cores, shutdown_count); + if (rc != 0) { + return rc; + } + } + } + + DEV_INFO("Thread %d: Completed", thread_idx); + + // Check if this is the last thread to finish + int32_t prev_finished = finished_count_.fetch_add(1, std::memory_order_acq_rel); + if (prev_finished + 1 == thread_num_) { + finished_.store(true, std::memory_order_release); + // Destroy PTO2 runtime and close orchestration SO (moved from orchestrator path) + if (!runtime->get_orch_built_on_host() && orch_so_handle_ != nullptr) { + // Clear the borrowed pointer in the orchestration SO before destroying + // rt, so g_pto2_current_runtime never points to freed memory. + if (orch_bind_runtime_ != nullptr) { + orch_bind_runtime_(nullptr); + } + pto2_runtime_destroy(rt); + dlclose(orch_so_handle_); + unlink(orch_so_path_); + } + } + + return 0; +} + +void AicpuExecutor::deinit(Runtime* runtime) { + // 1. Invalidate AICPU cache for Runtime address range. + // Next round's Host DMA (rtMemcpy) writes fresh Runtime to HBM but + // bypasses this cache. Invalidating now ensures next round reads from HBM. + cache_invalidate_range(runtime, sizeof(Runtime)); + + // Reset all per-core execution state + for (int32_t i = 0; i < RUNTIME_MAX_WORKER; i++) { + core_exec_states_[i] = {}; + core_exec_states_[i].executing_reg_task_id = AICPU_TASK_INVALID; + } + + // Clear per-core dispatch payloads + memset(s_pto2_payload_per_core, 0, sizeof(s_pto2_payload_per_core)); + + completed_tasks_.store(0, std::memory_order_release); + total_tasks_ = 0; + finished_count_.store(0, std::memory_order_release); + orchestrator_done_ = false; + pto2_init_done_.store(false, std::memory_order_release); + pto2_init_complete_.store(false, std::memory_order_release); + runtime_init_ready_.store(false, std::memory_order_release); + + // Reset core transition state + transition_requested_.store(false, std::memory_order_release); + wait_reassign_.store(0, std::memory_order_release); + reassigned_.store(false, std::memory_order_release); + completed_.store(false, std::memory_order_release); + orch_finished_count_.store(0, std::memory_order_release); + + // Reset core discovery state + aic_count_ = 0; + aiv_count_ = 0; + + regs_ = 0; + orch_func_ = nullptr; + orch_bind_runtime_ = nullptr; + orch_args_cached_ = nullptr; + orch_so_handle_ = nullptr; + orch_so_path_[0] = '\0'; + + // Clear file-scope PTO2Runtime pointer (freed by orchestrator thread before deinit) + rt = nullptr; + + DEV_INFO("DeInit: Runtime execution state reset"); + + initialized_.store(false, std::memory_order_release); + init_done_.store(false, std::memory_order_release); + init_failed_.store(false, std::memory_order_release); + thread_idx_.store(0, std::memory_order_release); + finished_.store(false, std::memory_order_release); + + DEV_INFO("DeInit: AicpuExecutor reset complete"); +} + +void AicpuExecutor::emergency_shutdown(Runtime* runtime) { + DEV_WARN("Emergency shutdown: sending exit signal to all initialized cores"); + Handshake* all_handshakes = reinterpret_cast(runtime->workers); + for (int32_t i = 0; i < cores_total_num_; i++) { + Handshake* hank = &all_handshakes[i]; + OUT_OF_ORDER_STORE_BARRIER(); + hank->aicpu_regs_ready = 1; + if (core_exec_states_[i].reg_addr != 0) { + platform_deinit_aicore_regs(core_exec_states_[i].reg_addr); + } + } + + DEV_WARN("Emergency shutdown complete"); +} + +void AicpuExecutor::diagnose_stuck_state( + Runtime* runtime, int32_t thread_idx, const int32_t* cur_thread_cores, int32_t core_num, Handshake* hank) { + (void)runtime; // NOLINT(readability/casting) + PTO2SchedulerState* sched = &rt->scheduler; + DEV_ALWAYS("========== DIAGNOSTIC REPORT: Thread %d ==========", thread_idx); + + int32_t completed = completed_tasks_.load(std::memory_order_acquire); + int32_t total = total_tasks_; + DEV_ALWAYS("Progress: %d/%d tasks (%.1f%%)", completed, total, total > 0 ? completed * 100.0 / total : 0.0); + + uint64_t aic_ready = 0, aiv_ready = 0, mix_ready = 0; + if (rt) { + aic_ready = sched->ready_queues[static_cast(PTO2ResourceShape::AIC)].size(); + aiv_ready = sched->ready_queues[static_cast(PTO2ResourceShape::AIV)].size(); + mix_ready = sched->ready_queues[static_cast(PTO2ResourceShape::MIX)].size(); + } + DEV_ALWAYS("Ready Queues: AIC=%lu, AIV=%lu, MIX=%lu", aic_ready, aiv_ready, mix_ready); + + int32_t busy_cores = 0; + int32_t idle_cores = 0; + + DEV_ALWAYS("Core Status:"); + for (int32_t i = 0; i < core_num; i++) { + int32_t core_id = cur_thread_cores[i]; + Handshake* h = &hank[core_id]; + const char* core_type_str = core_type_to_string(h->core_type); + + uint64_t reg_addr = core_exec_states_[core_id].reg_addr; + uint64_t reg_val = read_reg(reg_addr, RegId::COND); + int32_t reg_task_id = EXTRACT_TASK_ID(reg_val); + int32_t reg_state = EXTRACT_TASK_STATE(reg_val); + int32_t task_id = core_exec_states_[core_id].executing_reg_task_id; + + if (reg_state != TASK_FIN_STATE || task_id >= 0) { + busy_cores++; + if (task_id >= 0) { + int32_t kernel_id = -1; + if (rt && rt->sm_handle && core_exec_states_[core_id].executing_slot_state) { + int32_t diag_slot = static_cast(core_exec_states_[core_id].executing_subslot); + kernel_id = core_exec_states_[core_id].executing_slot_state->task->kernel_id[diag_slot]; + } + DEV_ALWAYS( + " Core %d [%s, BUSY]: COND=0x%lx (reg_task_id=%d, reg_state=%s), executing_reg_task_id=%d, " + "kernel_id=%d", + core_id, + core_type_str, + reg_val, + reg_task_id, + reg_state == TASK_FIN_STATE ? "FIN" : "ACK", + task_id, + kernel_id); + } else { + DEV_ALWAYS(" Core %d [%s, BUSY]: COND=0x%lx (reg_task_id=%d, reg_state=%s) but task_id not tracked", + core_id, + core_type_str, + reg_val, + reg_task_id, + reg_state == TASK_FIN_STATE ? "FIN" : "ACK"); + } + } else { + idle_cores++; + } + } + + DEV_ALWAYS("Summary: %d busy, %d idle", busy_cores, idle_cores); + + // Diagnose deadlock vs livelock + if (busy_cores == 0 && aic_ready == 0 && aiv_ready == 0 && completed < total) { + DEV_ALWAYS("*** DEADLOCK DETECTED ***"); + DEV_ALWAYS("All cores idle, no ready tasks, but %d tasks incomplete", total - completed); + DEV_ALWAYS("Check PTO2 shared memory for task dependency state"); + } else if (busy_cores > 0) { + DEV_ALWAYS("*** LIVELOCK / HUNG TASK ***"); + DEV_ALWAYS("%d cores executing but no progress", busy_cores); + } + + DEV_ALWAYS("========== END DIAGNOSTIC =========="); +} + +// ===== Public Entry Point ===== + +/** + * aicpu_execute - Main AICPU kernel execution entry point + * + * This is called by DynTileFwkBackendKernelServer in kernel.cpp. + * Orchestrates the complete task runtime execution: + * 1. Initialize executor (thread-safe, first thread only) + * 2. Wait for initialization to complete + * 3. Execute tasks on managed cores + * 4. Cleanup when last thread finishes + * + * @param runtime Pointer to Runtime structure + * @return 0 on success, non-zero on error + */ +extern "C" int32_t aicpu_execute(Runtime* runtime) { + if (runtime == nullptr) { + DEV_ERROR("%s", "Invalid argument: null Runtime pointer"); + return -1; + } + + DEV_INFO("%s", "aicpu_execute: Starting AICPU kernel execution"); + + // Get platform register addresses from platform-level global + g_aicpu_executor.regs_ = get_platform_regs(); + + g_aicpu_executor.init(runtime); + + while (!g_aicpu_executor.init_done_.load(std::memory_order_acquire)) { + if (g_aicpu_executor.init_failed_.load(std::memory_order_acquire)) { + DEV_ERROR("%s", "aicpu_execute: Initialization failed, aborting execution"); + return -1; + } + } + + int32_t rc = g_aicpu_executor.run(runtime); + if (rc != 0) { + DEV_ERROR("aicpu_execute: Thread execution failed with rc=%d", rc); + return rc; + } + + // Last thread cleans up + if (g_aicpu_executor.finished_.load(std::memory_order_acquire)) { + DEV_INFO("aicpu_execute: Last thread finished, cleaning up"); + g_aicpu_executor.deinit(runtime); + } + + DEV_INFO("%s", "aicpu_execute: Kernel execution completed successfully"); + return 0; +} diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/build_config.py b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/build_config.py new file mode 100644 index 000000000..e0a10c982 --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/build_config.py @@ -0,0 +1,26 @@ +# Copyright (c) PyPTO Contributors. +# This program is free software, you can redistribute it and/or modify it under the terms and conditions of +# CANN Open Software License Agreement Version 2.0 (the "License"). +# Please refer to the License for details. You may not use this file except in compliance with the License. +# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. +# See LICENSE in the root of the software repository for the full text of the License. +# ----------------------------------------------------------------------------------------------------------- +# Tensormap and Ringbuffer Runtime build configuration +# All paths are relative to this file's directory (src/runtime/tensormap_and_ringbuffer/) +# +# This is a device-orchestration runtime where: +# - AICPU thread 3 runs the orchestrator (builds task graph on device) +# - AICPU threads 0/1/2 run schedulers (dispatch tasks to AICore) +# - AICore executes tasks via an aligned PTO2DispatchPayload + pre-built dispatch_args +# +# The "orchestration" directory contains source files compiled into both +# runtime targets AND the orchestration .so (e.g., tensor methods needed +# by the Tensor constructor's validation logic). + +BUILD_CONFIG = { + "aicore": {"include_dirs": ["runtime", "common"], "source_dirs": ["aicore", "orchestration"]}, + "aicpu": {"include_dirs": ["runtime", "common"], "source_dirs": ["aicpu", "runtime", "orchestration"]}, + "host": {"include_dirs": ["runtime", "common"], "source_dirs": ["host", "runtime", "orchestration"]}, + "orchestration": {"include_dirs": ["runtime", "orchestration", "common"], "source_dirs": ["orchestration"]}, +} diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/common/intrinsic.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/common/intrinsic.h new file mode 100644 index 000000000..9ea70625a --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/common/intrinsic.h @@ -0,0 +1,141 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ + +/** + * @file intrinsic.h + * @brief SPMD execution context for AICore user kernels + * + * Topology data exposed to user kernels has two distinct lifetimes: + * + * 1. Global topology (per-core, fixed after runtime init): + * - sub_block_id : identifies the AIV lane within a cluster + * (0 = AIV0/left, 1 = AIV1/right). Initialized once at runtime + * startup based on each core's cluster position; never changes. + * Only meaningful for AIV kernels in MIX tasks. + * + * 2. Local topology (per-dispatch, changes each dispatch): + * - block_idx : which logical block the current worker is executing + * - block_num : total number of blocks in this task (= block_dim) + * Written by build_payload() before each dispatch. + * + * Both categories are injected via two pointer slots appended at the tail + * of the kernel args[] array: + * + * args layout: + * [0 .. tensor_count-1] = tensor GM pointers + * [tensor_count .. +scalar_count-1] = scalar values + * ... + * [SPMD_LOCAL_CONTEXT_INDEX] = (uint64_t)&LocalContext (per-dispatch) + * [SPMD_GLOBAL_CONTEXT_INDEX] = (uint64_t)&GlobalContext (per-core) + * + * The suffix positions are compile-time constants and do not depend on the + * runtime tensor_count or scalar_count. + * + * Include this header in AICore kernel source files to use the Get* accessors. + * Do NOT depend on the raw index constants; always use the accessor functions. + * + * On CCEC (real hardware), __gm__ and __aicore__ must be defined before + * including this header (e.g. via or manual #define). + * The #ifndef guards below provide fallbacks for non-kernel builds + * (AICPU, HOST) where these qualifiers are not needed. + */ + +#pragma once + +#include + +#ifndef __gm__ +#define __gm__ +#endif + +#ifndef __aicore__ +#define __aicore__ +#endif + +/** Number of extra pointer slots appended to the args[] tail (LocalContext + GlobalContext). */ +static constexpr int32_t PTO2_EXT_PARAMS_COUNT = 2; + +/** + * Args[] suffix indices for context pointers. + * Derived from MAX_TENSOR_ARGS(16) + MAX_SCALAR_ARGS(128). + * Users should not depend on these values; use the Get* functions below. + */ +static constexpr int32_t SPMD_LOCAL_CONTEXT_INDEX = 144; +static constexpr int32_t SPMD_GLOBAL_CONTEXT_INDEX = 145; + +/** + * Per-core global context, stored in PTO2DispatchPayload. + * Initialized once at runtime startup (init_global_context) based on each + * core's cluster position. Never modified after initialization. + */ +struct GlobalContext { + // AIV lane within cluster: 0=AIV0(left), 1=AIV1(right). + // Used by AIV to select the correct intra-cluster hw instruction. + // Not meaningful for AIC kernels or single-AIV tasks. + int32_t sub_block_id; +}; + +/** + * Per-dispatch local context, stored in PTO2DispatchPayload. + * Written by build_payload() before each dispatch. Different blocks of the + * same task receive different block_idx values but the same block_num. + */ +struct LocalContext { + int32_t block_idx; // Logical block index within the task [0, block_num) + int32_t block_num; // How many logical blocks this task requires. + // Currently fixed to 1 (block_dim > 1 not yet implemented). + // NOT the same as RUNTIME_CONFIG.block_dim in kernel_config.py, + // which controls how many physical cores the runtime launches. +}; + +/** + * Return the AIV lane index within the cluster. + * In a MIX 1C2V task: AIV0(left)=0, AIV1(right)=1. + * + * This value is only meaningful for AIV kernels in MIX tasks. It tells + * the AIV whether it is the left lane or the right lane within the cluster, + * which determines the correct hardware instruction for intra-cluster + * communication. + * + * AIC kernels should NOT call this function. + * Single-AIV tasks have no intra-cluster communication, so sub_block_id + * has no meaning and should not be used. + */ +static __aicore__ inline int32_t get_sub_block_id(__gm__ int64_t* args) { + __gm__ GlobalContext* ctx = + reinterpret_cast<__gm__ GlobalContext*>(static_cast(args[SPMD_GLOBAL_CONTEXT_INDEX])); + return ctx->sub_block_id; +} + +/** + * Return the logical block index assigned to the current worker. + * Range: [0, get_block_num(args)). + * Within the same task, different blocks receive different indices. + */ +static __aicore__ inline int32_t get_block_idx(__gm__ int64_t* args) { + __gm__ LocalContext* ctx = + reinterpret_cast<__gm__ LocalContext*>(static_cast(args[SPMD_LOCAL_CONTEXT_INDEX])); + return ctx->block_idx; +} + +/** + * Return how many logical blocks the current task requires. + * All blocks of the same task see the same value. + * Currently always returns 1 (block_dim>1 not yet implemented). + * + * Note: this is NOT the same as RUNTIME_CONFIG.block_dim in + * kernel_config.py, which controls how many physical cores are launched. + */ +static __aicore__ inline int32_t get_block_num(__gm__ int64_t* args) { + __gm__ LocalContext* ctx = + reinterpret_cast<__gm__ LocalContext*>(static_cast(args[SPMD_LOCAL_CONTEXT_INDEX])); + return ctx->block_num; +} diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/MULTI_RING.md b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/MULTI_RING.md new file mode 100644 index 000000000..339c1ee5a --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/MULTI_RING.md @@ -0,0 +1,237 @@ +# Multi-Ring Buffer Architecture + +> Extension to the PTO2 runtime. For the base architecture, see [RUNTIME_LOGIC.md](RUNTIME_LOGIC.md). + +## 1. Problem + +The single-ring design uses one `last_task_alive` watermark shared by HeapRing, TaskRing, and DepPool. When tasks from an inner scope (e.g., per-block iteration) complete, their resources cannot be reclaimed until **all** prior tasks — including those from the outer scope — also complete. This wastes ring capacity and can trigger deadlocks when ring sizes are small. + +## 2. Solution + +Split HeapRing, TaskRing, and DepPool into arrays of `PTO2_MAX_RING_DEPTH` (4) independent instances. Each scope depth maps to its own ring, with an independent `last_task_alive` watermark. + +``` +Scope depth 0 ──► rings[0] = { HeapRing, TaskRing, DepPool } +Scope depth 1 ──► rings[1] = { HeapRing, TaskRing, DepPool } +Scope depth 2 ──► rings[2] = { HeapRing, TaskRing, DepPool } +Scope depth ≥3 ──► rings[3] = { HeapRing, TaskRing, DepPool } (clamped) +``` + +Inner-scope tasks can now be reclaimed independently without waiting for outer-scope tasks to complete. + +## 3. Task ID Encoding + +Task IDs are widened from 32-bit to 64-bit to carry the ring identity: + +``` +task_id.raw = (ring_id << 32) | local_id +``` + +`PTO2TaskId` exposes direct accessors in `pto_runtime2_types.h`: + +| API | Purpose | +|-----|---------| +| `pto2_make_task_id(ring_id, local_id)` | Compose a 64-bit task ID (`PTO2TaskId`) | +| `task_id.ring()` | Extract `ring_id` (bits 63-32) | +| `task_id.local()` | Extract `local_id` (bits 31-0) | +| `task_id.raw` | Access the packed 64-bit encoding | + +Type changes: + +| Field | Before | After | +|-------|--------|-------| +| `PTO2TaskDescriptor.task_id` | `int32_t` | `PTO2TaskId` | +| `PTO2TensorMapEntry.producer_task_id` | `int32_t` | `PTO2TaskId` | +| `PTO2TaskSlotState.ring_id` | N/A | `uint8_t` (new, denormalized for fast access) | + +## 4. Data Structures + +### 4.1 PTO2RingSet (new) + +Bundles the three per-ring resources into a single aggregate (`pto_ring_buffer.h`): + +```cpp +struct PTO2RingSet { + PTO2HeapRing heap_ring; + PTO2TaskRing task_ring; + PTO2DepListPool dep_pool; +}; +``` + +### 4.2 PTO2OrchestratorState (modified) + +```cpp +// Before: single ring +PTO2HeapRing heap_ring; +PTO2TaskRing task_ring; +PTO2DepListPool dep_pool; + +// After: per-ring array +PTO2RingSet rings[PTO2_MAX_RING_DEPTH]; +int32_t dep_pool_last_reclaimed[PTO2_MAX_RING_DEPTH]; +``` + +Ring selection: `current_ring_id() = min(scope_stack_top, PTO2_MAX_RING_DEPTH - 1)`. + +### 4.3 PTO2SharedMemoryHeader (modified) + +Per-ring flow control and per-ring layout info are grouped together: + +```cpp +struct PTO2RingFlowControl { + std::atomic current_task_index; // task ring head + std::atomic last_task_alive; // task ring tail + std::atomic heap_top; // heap alloc pointer + std::atomic heap_tail; // heap reclaim pointer +}; + +struct PTO2SharedMemoryRingHeader { + PTO2RingFlowControl fc; + uint64_t task_window_size; + uint64_t heap_size; + uint64_t task_descriptors_offset; +}; + +// In header: +PTO2SharedMemoryRingHeader rings[PTO2_MAX_RING_DEPTH]; +``` + +The global `heap_tail_gen` ticket counter is removed; each ring's scheduler state serializes ring-advance via a per-ring try-lock. + +### 4.4 PTO2SharedMemoryHandle (modified) + +Per-ring descriptor and payload arrays: + +```cpp +PTO2TaskDescriptor* task_descriptors[PTO2_MAX_RING_DEPTH]; +PTO2TaskPayload* task_payloads[PTO2_MAX_RING_DEPTH]; +``` + +### 4.5 PTO2SchedulerState (modified) + +```cpp +struct RingSchedState { + PTO2TaskSlotState* slot_states; + int32_t task_window_size; + int32_t task_window_mask; + std::atomic advance_lock; +}; + +RingSchedState ring_sched_states[PTO2_MAX_RING_DEPTH]; +``` + +### 4.6 PTO2TensorMap (modified) + +```cpp +PTO2TensorMapEntry** task_entry_heads[PTO2_MAX_RING_DEPTH]; +int64_t last_task_alives[PTO2_MAX_RING_DEPTH]; +``` + +Entry validity checks and `cleanup_retired` operate per-ring: + +```cpp +bool entry_valid(const PTO2TensorMapEntry& e) { + int32_t ring = e.producer_task_id.ring(); + int32_t local = e.producer_task_id.local(); + return local >= last_task_alives[ring]; +} +``` + +### 4.7 Unchanged Structures + +| Structure | Reason | +|-----------|--------| +| `PTO2DepListEntry` | Stores `PTO2TaskSlotState*` pointer — naturally crosses ring boundaries | +| `PTO2TaskPayload` | `fanin_slot_states[]` are pointers — no ring coupling | +| `PTO2ReadyQueue` | Global ready queues shared across all rings (tasks ready to dispatch regardless of origin ring) | +| `PTO2DispatchPayload` | Built per-dispatch, no ring state needed | + +## 5. Reclamation + +### 5.1 Per-Ring Watermark Advancement + +Each ring's `last_task_alive` advances independently: + +``` +advance_ring_pointers(ring_id): + la = rings[ring_id].fc.last_task_alive + while task_state[la & mask] >= CONSUMED: + advance heap_tail from packed_buffer_end + reset fanin_refcount + CAS(last_task_alive, la, la+1) + la++ +``` + +Per-ring try-locks in the scheduler state prevent concurrent scheduler threads from interleaving heap_tail writes within the same ring. + +### 5.2 Cross-Ring Dependencies + +Dependency edges use `PTO2TaskSlotState*` pointers, which naturally span rings: + +- Ring 1 task depends on ring 0 producer → ring 0's `fanout_head` linked list contains a ring 1 `PTO2TaskSlotState*` +- When ring 0 task completes, it walks its fanout list and decrements ring 1 consumers' `fanin_refcount` +- No special cross-ring logic needed — pointer-based design is ring-agnostic + +### 5.3 DepPool Reclamation + +``` +pto2_dep_pool_reclaim(ring_id): + la = rings[ring_id].fc.last_task_alive + newest_consumed = la - 1 + mark = task_payloads[ring_id][slot(newest_consumed)].dep_pool_mark + if mark > 0: + rings[ring_id].dep_pool.advance_tail(mark) +``` + +Note: dep entries from ring N's pool may appear in ring M's fanout lists. Reclamation is safe because the entries are accessed during fanout traversal (completion time), which always happens before the consumer task — and therefore the dep entry — becomes eligible for reclamation. + +## 6. AICPU Register Protocol Fix + +The AICore dispatch protocol uses 32-bit registers. With multi-ring, `task_id` truncation to 32-bit loses the `ring_id`, causing collisions: + +``` +Ring 0, local_id=0 → DATA_MAIN_BASE = 0 + 1 = 1 +Ring 1, local_id=0 → DATA_MAIN_BASE = 0 + 1 = 1 (collision!) +``` + +AICore uses `last_reg_val` to detect new dispatches — identical values cause skipped tasks and false completions from stale COND registers. + +**Fix**: Per-core monotonic dispatch counter `s_dispatch_seq[core_id]` replaces `task_id` in register writes, guaranteeing unique `DATA_MAIN_BASE` values per core regardless of ring origin. + +## 7. Configuration + +### 7.1 Compile-Time Defaults (per ring) + +| Constant | Default | Total (×4 rings) | +|----------|---------|-------------------| +| `PTO2_TASK_WINDOW_SIZE` | 16384 | 65536 | +| `PTO2_HEAP_SIZE` | 256 MB | 1 GB | +| `PTO2_DEP_LIST_POOL_SIZE` | 16384 | 65536 | + +### 7.2 Runtime Environment Overrides + +Uniform (applies to all rings): + +``` +PTO2_RING_TASK_WINDOW=1024 +PTO2_RING_HEAP=1048576 +PTO2_RING_DEP_POOL=1024 +``` + +In `kernel_config.py`: + +```python +RUNTIME_ENV = { + "PTO2_RING_TASK_WINDOW": "128", + "PTO2_RING_HEAP": "262144", + "PTO2_RING_DEP_POOL": "256", +} +``` + +### 7.3 Sizing Guidelines + +- `task_window` must be ≥ max tasks in any single scope + headroom for concurrent scopes +- `heap` must accommodate peak output buffer allocation across all in-flight tasks on that ring +- `dep_pool` must be ≥ total dependency entries for all in-flight tasks on that ring +- On hardware, back-pressure latency is higher than in simulation — size conservatively +- Adding inner `PTO2_SCOPE` reduces peak per-ring usage, enabling smaller sizes diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/ROADMAP.md b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/ROADMAP.md new file mode 100644 index 000000000..0dae7b8c8 --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/ROADMAP.md @@ -0,0 +1,86 @@ +# PTO2 Runtime Roadmap: Advanced Scheduling Features + +This document outlines the planned features and architectural changes for the PTO2 runtime system, specifically focusing on advanced cluster-aware and block-level scheduling semantics. + +## 1. In-Cluster Function Group Scheduling + +**Goal:** Enable co-scheduling of multiple tasks onto the same physical hardware cluster to leverage local interconnects and optimize data locality. + +### Concept +An **in-cluster function group** consists of all incore functions submitted between an `allocate_cluster()` and a `free_cluster()` call (or within a managed scope). The runtime treats this group as a co-scheduled unit: every task in the group executes on the **same physical cluster** (identified by a `clusterID`). + +### Required Architectural Changes + +#### 1. Task Descriptor Extension +The `PTO2TaskDescriptor` will be extended to record function group membership: +- `cluster_id` (int32_t): ID of the allocated cluster (-1 = unconstrained). +- `group_id` (int32_t): Function group identifier. + +#### 2. Orchestration API Additions +```cpp +// Allocate a cluster. Blocks if no cluster is available. +int32_t pto2_rt_allocate_cluster(PTO2Runtime* rt); + +// Release a cluster back to the free pool. +void pto2_rt_free_cluster(PTO2Runtime* rt, int32_t cluster_id); + +// Submit a task constrained to a specific cluster. +void pto2_rt_submit_task_clustered(PTO2Runtime* rt, int kernel_id, + int worker_type, Arg* args, + int n, int32_t cluster_id); +``` + +#### 3. Scheduler Enhancements +- **Cluster ↔ Core mapping**: A static, platform-specific mapping from `cluster_id` to the set of physical cores (e.g., cluster 0 = {AIC0, AIV0, AIV1}). +- **Cluster-Aware Dispatch**: When popping a task, if `cluster_id >= 0`, the scheduler dispatches it *only* to a core belonging to that specific cluster. +- **Cluster Free Pool**: A ring or bitset tracking free clusters to handle allocation and release. +- **Back-Pressure**: `pto2_rt_allocate_cluster` will implement a spin-wait pattern with deadlock detection, similar to the existing task and heap rings. + +--- + +## 2. `block_incore` (SPMD → MPMD) Task Submission + +**Goal:** Support executing a single logical SPMD block function as multiple independent MPMD tasks across available cores. + +### Execution Model +At the runtime level, the orchestration layer will **expand** a single `block_incore` call (with a specified `block_dim`) into `block_dim` individual tasks, each with a distinct `block_id`. + +```cpp +// Orchestration expansion logic +PTO2_SCOPE(rt) { + for (int bid = 0; bid < block_dim; bid++) { + // ... build args with make_scalar_param(bid) ... + pto2_rt_submit_task(rt, KERNEL_FUNC_ID, PTO2_WORKER_VECTOR, args, 4); + } +} +``` + +### Future Optimization Path +While the initial implementation will use O(N) expansion (submitting N individual task descriptors), future optimizations may include: +- **Batch Descriptors**: A single descriptor containing a `block_dim` field. +- **Group-Aware Dispatch**: The scheduler scans one descriptor and expands it into `block_dim` hardware dispatches. +- **Shared-Tensor Optimization**: Reducing TensorMap entries by having one entry per logical tensor instead of per-block tensor. + +--- + +## 3. `block_incore` as InCore Function (Cube + Vector) + +**Goal:** Allow a `block_incore` function to be a composite subgraph requiring both AIC (Cube) and AIV (Vector) cores working cooperatively on the same data block. + +### Execution Model +When combined with cluster allocation, both the cube and vector tasks of each block are pinned to the **same cluster**. This ensures they execute on co-located cores and can utilize local interconnects (e.g., `PIPE_IN`/`PIPE_OUT`) without round-tripping to Global Memory. + +```cpp +// Each block runs its cube and vector kernels on the same cluster +int32_t cid = pto2_rt_allocate_cluster(rt); +PTO2_SCOPE(rt) { + pto2_rt_submit_task_clustered(rt, CUBE_KERNEL, PTO2_WORKER_CUBE, ..., cid); + pto2_rt_submit_task_clustered(rt, VEC_KERNEL, PTO2_WORKER_VECTOR, ..., cid); +} +pto2_rt_free_cluster(rt, cid); +``` + +### Data Structure Impact Summary +- `PTO2TaskDescriptor`: Add `cluster_id`, `group_id`, `block_id`, `block_dim`. +- `PTO2SharedMemoryHeader`: Add cluster free pool tracking. +- **Scheduler**: Cluster-aware dispatch logic and cluster-to-core mapping tables. diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/RUNTIME_LOGIC.md b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/RUNTIME_LOGIC.md new file mode 100644 index 000000000..ecadc65c5 --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/RUNTIME_LOGIC.md @@ -0,0 +1,658 @@ +# PTO2 Runtime System Design + +## Overview + +PTO2 (Parallel Task Orchestration v2) is a runtime system for executing task graphs on Ascend AI processors. It coordinates four layers of execution: + +- **Host** (x86/ARM CPU): compiles kernels, allocates device memory, initializes the Runtime, and launches AICPU/AICore threads. +- **AICPU** (device ARM cores): runs the orchestrator (task graph builder) and scheduler threads. +- **AICore** (AI compute cores): executes kernel functions dispatched by the scheduler. +- **Shared Memory** (Global Memory): ring buffers, task descriptors, heap, and TensorMap shared between orchestrator and schedulers. + +``` +┌───────────────────────────────────────────────────────────────────────┐ +│ Host (CPU) │ +│ golden.py → code_runner.py → compile kernels → init Runtime │ +│ → upload binaries → launch AICPU/AICore → collect results │ +└───────────────────────────┬───────────────────────────────────────────┘ + │ device memory / GM +┌───────────────────────────▼───────────────────────────────────────────┐ +│ AICPU (4 threads) │ +│ Thread 3: Orchestrator (builds task graph) │ +│ Threads 0-2: Schedulers (dispatch tasks to AICore) │ +│ │ +│ ┌─────────────────────────────────────────────────────────────────┐ │ +│ │ Shared Memory (GM) │ │ +│ │ SharedMemoryHeader │ TaskDescriptors[] │ DepListPool │ │ +│ │ GM Heap (output buffers) │ │ +│ └─────────────────────────────────────────────────────────────────┘ │ +│ │ +│ Scheduler ──Handshake/Registers──► AICore workers (AIC + AIV) │ +└───────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## 1. Runtime Variants + +Three runtime backends exist under `src/runtime/`, each representing a different orchestration and scheduling strategy. + +### 1.1 host_build_graph + +The host builds the complete task graph before launching device execution. The orchestration SO is loaded and executed on the host CPU. + +- **Task storage**: fixed `Task[]` array (up to 131072 tasks) +- **Scheduling**: AICPU receives the pre-built graph and dispatches tasks by traversing dependencies +- **Use case**: development and debugging; no device-side orchestration overhead + +### 1.2 aicpu_build_graph + +The orchestration runs on an AICPU thread, building the task graph on device. Supports concurrent build + schedule (`build_mode=1`). + +- **Task storage**: same `Task[]` array as host_build_graph +- **AicpuBuildApi**: `add_task`, `add_successor_conditional`, `publish_task`, `device_malloc` +- **Use case**: reduced host→device data transfer; graph can depend on device-side data + +### 1.3 tensormap_and_ringbuffer (PTO2) + +The primary production runtime. Uses ring buffers for task slots and output memory, with a TensorMap for automatic dependency tracking. + +- **Task storage**: `PTO2TaskDescriptor[]` in shared memory ring buffer +- **Memory**: GM Heap ring for output buffer allocation +- **Dependencies**: automatically derived from tensor read/write patterns via TensorMap +- **Thread model**: 3 scheduler threads + 1 orchestrator thread on AICPU +- **Multi-ring**: HeapRing, TaskRing, and DepPool are split into `PTO2_MAX_RING_DEPTH` (4) independent instances for nested scope isolation. See [MULTI_RING.md](MULTI_RING.md) for details. +- **Use case**: production workloads; supports streaming, flow control, and large batch sizes + +--- + +## 2. Platform Abstraction + +Two platform implementations exist under `src/platform/`, sharing a common interface. + +### 2.1 a2a3 (Real Ascend Hardware) + +| Component | Description | +|-----------|-------------| +| `device_runner.cpp` | Uses CANN APIs: `rtMalloc`, `rtMemcpy`, `rtLaunchKernel` | +| `memory_allocator.cpp` | Wraps `rtMalloc`/`rtFree` with allocation tracking | +| `aicore/kernel.cpp` | `KERNEL_ENTRY(aicore_kernel)` → `aicore_execute` | +| `aicpu/kernel.cpp` | `DynTileFwkBackendKernelServer` entry → `aicpu_execute` | +| `spin_hint.h` | ARM `wfe`/`yield` instructions for efficient spinning | + +### 2.2 a2a3sim (Thread Simulation) + +| Component | Description | +|-----------|-------------| +| `device_runner.cpp` | Uses `std::thread` to simulate AICPU/AICore | +| `memory_allocator.cpp` | Wraps `malloc`/`free` | +| `aicore/kernel.cpp` | `aicore_execute_wrapper` sets `g_sim_reg_base` per core | +| `upload_kernel_binary` | `dlopen` kernel SO, `dlsym` entry point | + +### 2.3 Platform Constants (`platform_config.h`) + +| Constant | Value | Description | +|----------|-------|-------------| +| `PLATFORM_MAX_BLOCKDIM` | 24 | Maximum blocks (each = 1 AIC + 2 AIV) | +| `PLATFORM_MAX_AICPU_THREADS` | 4 | AICPU thread count (3 schedulers + 1 orchestrator) | +| `PLATFORM_MAX_AIC_PER_THREAD` | 24 | Max AIC cores per scheduler thread | +| `PLATFORM_MAX_AIV_PER_THREAD` | 48 | Max AIV cores per scheduler thread | +| `PLATFORM_PROF_SYS_CNT_FREQ` | 50 MHz | System counter frequency for profiling | + +--- + +## 3. Shared Memory Layout + +The orchestrator and schedulers communicate through a contiguous shared memory region in Global Memory (GM). Each ring level has its own TaskDescriptor and DepListPool sections. See [MULTI_RING.md §4.3–4.4](MULTI_RING.md) for the per-ring shared memory header and handle layout. + +``` +┌─────────────────────────────┐ offset 0 +│ PTO2SharedMemoryHeader │ (flow control, config, sync flags) +├─────────────────────────────┤ aligned +│ PTO2TaskDescriptor[N] │ N = task_window_size (default 65536) +├─────────────────────────────┤ aligned +│ PTO2DepListEntry[M+1] │ M = dep_list_pool_size (entry 0 = NULL sentinel) +└─────────────────────────────┘ +``` + +### 3.1 SharedMemoryHeader Fields + +| Field | Writer | Reader | Purpose | +|-------|--------|--------|---------| +| `current_task_index` | Orchestrator | Scheduler | Next task ID to allocate (task ring head) | +| `last_task_alive` | Scheduler | Orchestrator | Oldest still-active task (task ring tail) | +| `heap_top` | Orchestrator | Scheduler | Heap ring allocation pointer | +| `heap_tail` | Scheduler | Orchestrator | Heap ring reclamation pointer | +| `heap_tail_gen` | Scheduler | Scheduler | Ticket counter for serialized `heap_tail` writes | +| `orchestrator_done` | Orchestrator | Scheduler | Signals orchestration completion | +| `task_window_size` | Init | Both | Number of task slots | +| `heap_size` | Init | Both | Heap total size | +| `dep_list_pool_size` | Init | Both | Dependency list pool size | +| `task_descriptors_offset` | Init | Both | Offset to TaskDescriptor array in SM | +| `dep_list_pool_offset` | Init | Both | Offset to DepListPool in SM | +| `total_size` | Init | Both | Total shared memory size | +| `graph_output_ptr` | Orchestrator | Host | Address of final output (packed buffer) | +| `graph_output_size` | Orchestrator | Host | Size of final output in bytes | + +### 3.2 Size Calculation + +``` +total = ALIGN(Header) + ALIGN(window_size * sizeof(TaskDescriptor)) + + ALIGN((dep_pool_size + 1) * sizeof(DepListEntry)) +``` + +Alignment is 64 bytes (`PTO2_ALIGN_SIZE`). + +--- + +## 4. Ring Buffer Mechanisms + +> **Multi-ring extension**: All three ring buffers (TaskRing, HeapRing, DepPool) are replicated per scope depth. Each ring level has independent watermarks and reclamation. See [MULTI_RING.md](MULTI_RING.md) for details. + +### 4.1 Task Ring + +The task ring manages task slot allocation with back-pressure flow control. + +**Structure** (`PTO2TaskRing`): +- `descriptors`: pointer to `TaskDescriptor[]` in shared memory +- `window_size`: number of slots (power of 2) +- `current_index`: next task ID to allocate (monotonically increasing) +- `last_alive_ptr`: pointer to `header->last_task_alive` + +**Slot mapping**: `slot = task_id & (window_size - 1)` + +**Allocation** (`pto2_task_ring_alloc`): +``` +active_count = current_index - *last_alive_ptr +if active_count < window_size - 1: + allocate slot, advance current_index +else: + spin-wait (back-pressure from scheduler) +``` + +**Reclamation**: Scheduler threads advance `last_task_alive` via lock-free CAS when the oldest task reaches state CONSUMED (4). This frees slots for reuse. + +**Flow control**: When the ring is full, the orchestrator blocks until the scheduler advances `last_task_alive`. With `PTO2_RING_TASK_WINDOW=16` and 208 tasks, slots are recycled ~13 times each. + +### 4.2 Heap Ring + +The heap ring manages output buffer allocation from a circular GM heap. + +**Structure** (`PTO2HeapRing`): +- `base`: GM heap base address +- `size`: total heap size (default 1 GB) +- `top`: allocation pointer (local to orchestrator) +- `tail_ptr`: pointer to `header->heap_tail` (updated by scheduler) + +**Allocation**: Buffers are allocated contiguously from `top`. When reaching the end, allocation wraps to the beginning if `tail` has advanced far enough. Buffers never straddle the wrap-around boundary. + +**Reclamation**: When `last_task_alive` advances past a task, its `packed_buffer_end` is used to advance `heap_tail`, freeing the memory region. + +### 4.3 Dependency List Pool + +A simple bump allocator for `PTO2DepListEntry` nodes used in fanin/fanout linked lists. + +- **Entry 0**: NULL sentinel (`task_id=-1, next_offset=0`) +- **Allocation**: `pool->top++`, wraps around when full +- **Reclamation**: implicit — old entries become unreachable as `last_task_alive` advances + +### 4.4 Flow Control and Back-Pressure + +The ring buffer mechanism provides **flow control** between the orchestrator (producer) and the scheduler (consumer). When a ring is exhausted, the orchestrator **blocks** — it cannot submit new tasks or allocate more output memory until the scheduler reclaims slots/space by advancing the watermarks. + +**Task Ring back-pressure**: When `active_count = current_index - last_task_alive >= window_size - 1`, `pto2_task_ring_alloc` spin-waits until the scheduler completes tasks and advances `last_task_alive`. + +**Heap Ring back-pressure**: When the heap has insufficient contiguous space, `pto2_heap_ring_alloc` spin-waits until the scheduler advances `heap_tail` past completed tasks' output buffers. + +**TensorMap pool back-pressure**: When the entry pool is exhausted, `new_entry()` spin-waits on `pto2_orchestrator_sync_tensormap(force=true)` until cleanup frees entries (see Section 5.4). + +This back-pressure is essential for correctness with small ring sizes — for example, with `PTO2_RING_TASK_WINDOW=16` and 208 tasks, the orchestrator blocks ~192 times, each time waiting for the scheduler to drain completed tasks before continuing. + +### 4.5 Deadlock Detection + +A ring that is **too small** can cause a **deadlock**. The root cause is the scope mechanism: each task's `fanout_count` includes a reference from its owning scope. The scope reference is only released when `scope_end()` runs — but `scope_end()` is called by the orchestrator, which is blocked waiting for ring space. This creates a circular dependency: + +``` +Orchestrator blocked on task_ring_alloc (ring full) + → needs scheduler to advance last_task_alive + → needs tasks to reach CONSUMED state (fanout_count == 0) + → needs scope_end() to release scope reference + → needs orchestrator to continue + → DEADLOCK +``` + +The runtime detects this automatically by counting spin iterations in the allocation functions: + +**Periodic BLOCKED warnings** (every 10,000 spins): +``` +[TaskRing] BLOCKED (Flow Control): current=208, last_alive=192, active=16/16 (100.0%), spins=10000 +[HeapRing] BLOCKED: requesting 4096 bytes, available=0, top=65536, tail=0, spins=10000 +``` + +**Deadlock detection** (after 100,000 spins with no progress): +``` +FATAL: Flow Control Deadlock Detected! +Task Ring is FULL and no progress after 100000 spins. + - Active tasks: 16 + - Window size: 16 +Root Cause: + Tasks cannot transition to CONSUMED state because fanout_count + includes 1 for the owning scope, and scope_end() requires the + orchestrator to continue — creating a circular dependency. +Solution: + Recommended: 32 (at least 2x current active tasks) +``` + +The FATAL message is logged to the device log and the process exits. The solution is to increase the ring size so that it can hold at least all tasks within the largest parallel scope. For example, if a scope submits 13 tasks, `task_window >= 14` is required (13 + 1 to distinguish full from empty). + +**Sizing guideline**: `task_window_size` must be larger than the maximum number of tasks in any single `PTO2_SCOPE`. A safe choice is `2 × max_tasks_per_scope` or simply the default 65536 for production. + +--- + +## 5. TensorMap and Automatic Dependency Tracking + +### 5.1 Purpose + +TensorMap maintains a mapping from tensor memory regions to their producer task IDs. When a new task reads a tensor (INPUT/INOUT), TensorMap automatically discovers the producer and establishes a dependency edge. + +### 5.2 Hash Table Design + +- **Key**: tensor base address (`buffer.addr`) +- **Value**: producer task ID, with overlap detection for sub-regions +- **Overlap**: `COVERED` (new region fully contains old) or `OTHER` (partial overlap) +- Sub-tensors of the same base tensor hash to the same bucket, enabling overlap detection + +### 5.3 Entry Pool Management + +Unlike the Task Ring and Heap Ring, TensorMap entries are **not** managed by a ring buffer. Instead, a **fixed-size pool + free list** is used: + +1. **Free list first**: `free_entry_list[]` stores pointers to released entries. Allocation pops from here (O(1)). +2. **Bump allocation**: if free list is empty, `entry_pool[next_entry_idx++]` allocates from the end of the pool. +3. **Blocking reclaim**: if the pool is fully exhausted, `pto2_orchestrator_sync_tensormap(force=true)` reads the latest `last_task_alive` and calls `cleanup_retired` to batch-free all entries belonging to retired tasks, returning them to the free list. + +This design avoids the complexity of ring-based wrapping while still being bounded by `PTO2_TENSORMAP_POOL_SIZE` (default 65536 entries). + +### 5.4 Stale Entry Cleanup: Three-Layer Defense + +TensorMap must ensure entries for retired tasks (`producer_task_id < last_task_alive`) are removed, so that: +- The pool does not grow unboundedly (capacity is finite) +- Lookup performance does not degrade as stale entries accumulate in bucket chains + +Three complementary mechanisms achieve this: + +**Layer 1 — Chain Truncation during Lookup** (lazy, per-bucket): + +Since `insert` always prepends to the bucket head, entries in each bucket chain are in **descending task_id order**. When `pto2_tensormap_lookup` encounters the first stale entry (`producer_task_id < last_task_alive`), all subsequent entries in the chain are guaranteed stale too. The entire tail is truncated in one operation using `prev_in_bucket` pointers for O(1) unlinking. + +This guarantees lookup only traverses valid entries — O(valid_entries_in_bucket), not O(total_entries). + +**Layer 2 — Periodic Batch Cleanup** (`cleanup_retired`, per-task): + +Every time the orchestrator submits a task (Step 0 of `pto2_submit_task`), it calls `pto2_orchestrator_sync_tensormap`. When `last_task_alive` has advanced by more than `PTO2_TENSORMAP_CLEANUP_INTERVAL` (default 64) tasks since the last cleanup, `pto2_tensormap_cleanup_retired` runs: + +This uses the **per-task entry chain** (`task_entry_head[task_slot]`) — each task's entries are doubly-linked together at insert time via `next_in_task`/`prev_in_task`, allowing O(entries_per_task) cleanup without scanning the entire pool or all buckets. Freed entries are returned to `free_entry_list` for immediate reuse. + +**Layer 3 — Back-Pressure on Pool Exhaustion** (blocking): + +If both the free list and bump region are depleted, `new_entry()` blocks until `pto2_orchestrator_sync_tensormap(force=true)` frees entries by advancing `last_task_alive` through `cleanup_retired`. + +This forms a back-pressure mechanism analogous to the Task Ring's flow control. + +**Summary**: + +| Layer | Trigger | Method | Guarantees | +|-------|---------|--------|------------| +| Chain Truncation | Every lookup | Truncate stale tail of bucket chain | Lookup only visits valid entries | +| Periodic Cleanup | Every 64 retired tasks | Walk per-task chains, free entries | Pool capacity reclaimed in bounded time | +| Pool Back-Pressure | Pool exhausted | Block until scheduler advances watermark | Hard capacity bound, no OOM | + +In steady state, the number of valid TensorMap entries ≈ `active_tasks × avg_outputs_per_task`. With the default `task_window=65536` and `pool_size=65536`, this is well within bounds. With small windows (e.g., `task_window=16`), active entries are even fewer (~16 × a few), and cleanup runs frequently. + +### 5.5 Dependency Discovery Flow + +When `pto2_submit_task` processes parameters: + +1. **INPUT/INOUT**: `pto2_tensormap_lookup` searches for overlapping producers (with chain truncation) +2. For each producer found: `pto2_add_consumer_to_producer` adds the dependency +3. **OUTPUT/INOUT**: `pto2_tensormap_insert` registers the current task as the new producer at bucket head +4. Stale entries are pruned lazily during lookup (Layer 1) and periodically by cleanup (Layer 2) + +--- + +## 6. Task Descriptor and States + +### 6.1 PTO2TaskDescriptor (Hot Path) + +| Field | Description | +|-------|-------------| +| `task_id` | Canonical mixed-task ID (64-bit: `ring_id << 32 | local_id`). See [MULTI_RING.md §3](MULTI_RING.md). | +| `kernel_id[3]` | Per-slot kernel IDs: `[AIC, AIV0, AIV1]`; `INVALID_KERNEL_ID` = inactive | +| `active_mask` | Bitmask of active subtask slots: `bit0=AIC`, `bit1=AIV0`, `bit2=AIV1` | +| `subtask_done_mask` | Atomic bitmask; each subtask sets its done bit on completion | +| `fanin_count` | Number of producer dependencies | +| `fanout_lock` | Per-task spinlock for concurrent fanout modification | +| `fanout_head` | Head of fanout consumer list (pointer, protected by `fanout_lock`) | +| `fanout_count` | 1 (scope ref) + number of consumers | +| `packed_buffer_base` | Start of packed buffer in GM Heap | +| `packed_buffer_end` | End of packed buffer (for heap reclamation) | + +### 6.1b PTO2TaskPayload (Cold Path) + +| Field | Description | +|-------|-------------| +| `tensors[16]` | Tensor descriptors for parameters | +| `scalar_value[16]` | Scalar parameter values | +| `is_tensor[16]` | Whether each parameter is tensor or scalar | +| `param_count` | Number of valid parameters | +| `fanin_slot_states[]` | Producer slot state pointers (used by `on_task_release`) | +| `fanin_actual_count` | Actual fanin count | + +### 6.2 Task State Machine + +``` + [0] PENDING ──fanin satisfied──► [1] READY ──dispatch──► [2] RUNNING + ▲ │ + │ ▼ + slot recycled ◄── [4] CONSUMED ◄──fanout done── [3] COMPLETED +``` + +In the scheduler's `task_state[]` array (`std::atomic`): +- **0 (PENDING)**: waiting for dependencies (`fanin_refcount < fanin_count`) +- **1 (READY)**: all dependencies satisfied, waiting in ready queue +- **2 (RUNNING)**: currently executing on a worker +- **3 (COMPLETED)**: hardware execution complete, output may still be in use +- **4 (CONSUMED)**: output fully consumed, buffers can be released + +--- + +## 7. Orchestrator + +### 7.1 PTO2OrchestratorState + +The orchestrator runs on AICPU Thread 3 and builds the task graph by calling the user-provided orchestration function. + +Key members: +- `rings[PTO2_MAX_RING_DEPTH]`: per-ring `PTO2RingSet` (HeapRing + TaskRing + DepPool). See [MULTI_RING.md §4.2](MULTI_RING.md). +- `tensor_map`, `tensor_pool`: dependency tracking +- `scope_tasks[]`, `scope_begins[]`, `scope_stack_top`: scope nesting stack (flat buffer partitioned by level) +- `scheduler`: pointer to scheduler state (for simulated mode or `init_task_on_submit`) +- `gm_heap_base`, `gm_heap_size`: GM heap for output buffers + +### 7.2 Task Submission Flow (`pto2_submit_task`) + +| Step | Operation | +|------|-----------| +| 0 | `pto2_orchestrator_sync_tensormap` — prune stale TensorMap entries | +| 1 | `pto2_task_ring_alloc` — allocate task slot (may block on flow control) | +| 2 | Initialize task descriptor, copy parameters | +| 3 | **Lookup**: for each INPUT/INOUT param, search TensorMap for producers | +| 4 | **Dependency**: `pto2_add_consumer_to_producer` for each producer found | +| 5 | **Heap alloc**: `pto2_alloc_packed_buffer` for OUTPUT args (addr=0) | +| 6 | **Insert**: register OUTPUT/INOUT args in TensorMap | +| 7 | **Fanin**: finalize `fanin_count`; if `init_task_on_submit`, call scheduler's `init_task` | +| 8 | **Publish**: `STORE_RELEASE(current_task_index)` makes task visible to scanners | + +### 7.3 Lock Protocol for Concurrent Dependency Setup + +The orchestrator and scheduler run concurrently. When adding a consumer to a producer's fanout list: + +1. **Orchestrator acquires** the producer's `fanout_lock` via `pto2_fanout_lock(task)` (CAS spin-lock) +2. **Normal path**: prepend consumer to the producer's fanout list, increment `fanout_count` +3. **Release** `fanout_lock` + +The scheduler's completion handler mirrors this: +1. Mark `task_state[slot] = COMPLETED` +2. **Acquire** `fanout_lock`, read `fanout_head`, **release** lock +3. Traverse fanout list, incrementing each consumer's `fanin_refcount` +4. Mark `task_state[slot] = CONSUMED` when `fanout_refcount` reaches `fanout_count` + +This lock protocol guarantees every consumer is accounted for exactly once. + +### 7.4 Scope Mechanism (`PTO2_SCOPE`) + +Scopes control the lifetime of intermediate buffers. Each scope: +- Tracks tasks submitted within it via a flat `scope_tasks[]` buffer partitioned by `scope_begins[]` +- On `scope_end`: increments `fanout_refcount` for scope tasks; when it reaches `fanout_count`, the task's packed buffer can be reclaimed + +```cpp +PTO2_SCOPE(rt) { + // Tasks submitted here belong to this scope + pto2_rt_submit_aic_task(FUNC_QK, args); + pto2_rt_submit_aiv_task(FUNC_SF, args); +} +// scope_end: scope reference released from all tasks above +``` + +--- + +## 8. Scheduler + +### 8.1 Thread Model + +With `aicpu_thread_num=4`, the AICPU runs 4 threads: + +| Thread | Role | Cores | +|--------|------|-------| +| 0 | Scheduler | 6 AIC + ~13 AIV | +| 1 | Scheduler | 6 AIC + ~13 AIV | +| 2 | Scheduler | 6 AIC + ~13 AIV | +| 3 | Orchestrator | none | + +Core assignment: AICs and AIVs are divided equally among the 3 scheduler threads. + +### 8.2 Scheduler Main Loop + +Each scheduler thread runs a tight loop with two main phases: + +**Phase 1 — Completion Handling**: +- Poll register `COND` on each managed core +- When `TASK_FIN_STATE` detected: record completion timestamps, call `on_subtask_complete(task_id, subslot)` to set the done bit; when `subtask_done_mask == active_mask`, trigger `on_mixed_task_complete(task_id)` which marks `task_state[slot] = COMPLETED`, acquires fanout lock, traverses fanout list (incrementing consumers' `fanin_refcount`), marks `task_state[slot] = CONSUMED`, and advances `last_task_alive` watermark + +**Phase 2 — Dispatch**: +- For each idle core: pop a task from the matching shape-based ready queue (lock-free MPMC Vyukov queue, one per resource shape) +- Build `PTO2DispatchPayload` from `TaskDescriptor` with `task_id`, `subslot`, `kernel_id`, and `core_type` +- Write task pointer to `Handshake.task`, signal AICore via register `DATA_MAIN_BASE` + +After these phases, the scheduler updates profiling headers and checks for termination (all tasks completed and orchestrator done). + +### 8.3 Ready Queue Design + +Ready queues use a lock-free bounded MPMC (Vyukov) design: + +- One `PTO2ReadyQueue` per resource shape (5 shapes: `AIC_ONLY`, `AIV_X1`, `AIV_X2`, `AIC_AIV_X1`, `AIC_AIV_X2`) +- **Push**: any thread (orchestrator via `init_task`, or scheduler on completion) pushes newly-ready tasks to the queue matching `pto2_active_mask_to_shape(task->active_mask)` +- **Pop**: scheduler threads pop from the queue matching the idle core's resource shape +- Per-slot sequence counters prevent ABA problems +- `enqueue_pos` and `dequeue_pos` are on separate cache lines to avoid false sharing + +### 8.4 Watermark Advancement (last_task_alive) + +After a task reaches state CONSUMED (4), the scheduler tries to advance `last_task_alive`: + +``` +while la < current_task_index: + if task_state[la & mask] < CONSUMED: break + reset fanin_refcount[la & mask] = 0 + CAS(last_task_alive, la, la+1) + advance heap_tail from task's packed_buffer_end + la++ +``` + +This is lock-free (CAS-based) and multiple scheduler threads can attempt it concurrently. The `heap_tail_gen` ticket counter serializes `heap_tail` writes to ensure tasks' buffer regions are freed in order. + +--- + +## 9. AICore Worker Interaction + +### 9.1 Handshake Protocol + +Each AICore worker has a `Handshake` struct in shared memory: + +| Field | Direction | Purpose | +|-------|-----------|---------| +| `task` | AICPU→AICore | Pointer to `PTO2DispatchPayload` | +| `control` | AICPU→AICore | 0=normal, 1=shutdown | +| `perf_records_addr` | AICPU→AICore | Performance buffer address | + +### 9.2 Register-Based Dispatch + +Instead of polling `Handshake.task_status`, the production protocol uses hardware registers. + +> **Multi-ring note**: `task_id` is 64-bit but registers are 32-bit. A per-core monotonic dispatch counter (`s_dispatch_seq`) replaces `task_id` in register writes to prevent collisions. See [MULTI_RING.md §6](MULTI_RING.md). + +| Register | Direction | Usage | +|----------|-----------|-------| +| `DATA_MAIN_BASE` | AICPU→AICore | Write `task_id` to dispatch (idle=0x7FFFFFFD); `EXIT_SIGNAL` to shut down | +| `COND` | AICore→AICPU | `[bit31=state, bits30:0=task_id]`: ACK (state=0) or FIN (state=1) | + +**AICore execution loop**: +1. Poll `DATA_MAIN_BASE` for value != AICPU_IDLE_TASK_ID +2. Read payload from `Handshake.task` +3. Write ACK to `COND` +4. Execute kernel function via `func_id_to_addr` lookup +5. Write FIN to `COND` + +### 9.3 PTO2DispatchPayload + +Built by the scheduler from `PTO2TaskDescriptor`: + +| Field | Description | +|-------|-------------| +| `task_id` | Mixed-task identifier (for completion aggregation) | +| `subslot` | Which subtask slot this dispatch represents (`AIC`, `AIV0`, or `AIV1`) | +| `kernel_id` | Function ID for this subtask slot | +| `core_type` | AIC or AIV | +| `function_bin_addr` | GM address of compiled kernel binary | +| `num_args` | Number of arguments | +| `args[]` | Tensor addresses and scalar values | + +--- + +## 10. Kernel and Orchestration Loading + +### 10.1 Kernel Binary Loading + +1. **Host** compiles each kernel source (`.cpp`) into a binary (`.o` or `.so`) +2. `host_api.upload_kernel_binary(func_id, binary, size)` uploads to GM +3. The returned GM address is stored in `Runtime.func_id_to_addr_[func_id]` +4. When dispatching, the scheduler copies this address into `PTO2DispatchPayload.function_bin_addr` + +### 10.2 Orchestration SO Loading + +1. **Host** compiles the orchestration source into a shared library (`.so`) +2. The SO binary is embedded into `Runtime.device_orch_so_storage_[]` and copied to device +3. **AICPU Thread 3** writes the SO to a temp file, calls `dlopen` +4. `dlsym("aicpu_orchestration_config")` returns configuration (expected arg count) +5. `dlsym("aicpu_orchestration_entry")` returns the orchestration function pointer +6. Thread 3 creates a `PTO2Runtime`, calls the orchestration function within a `PTO2_SCOPE` +7. After orchestration completes: `dlclose`, delete temp file + +### 10.3 Thread Startup Synchronization + +| Flag | Set by | Waited by | Purpose | +|------|--------|-----------|---------| +| `runtime_init_ready_` | Thread 3 | Threads 0-2 | Runtime and SM handle initialized | +| `pto2_init_done_` | First init thread | Others | One-time memset of arrays started (exchange guard) | +| `pto2_init_complete_` | Init thread | Thread 3 + others | One-time init of per-task arrays done | + +Startup sequence: +1. Thread 3: create SM handle + runtime → set `runtime_init_ready_` +2. Scheduler threads: wait for `runtime_init_ready_` → one thread wins `pto2_init_done_` exchange → memset per-task arrays → set `pto2_init_complete_`; other threads wait for `pto2_init_complete_` +3. Thread 3: wait for `pto2_init_complete_` → configure orchestrator-scheduler pointers +4. Scheduler threads: enter main loop +5. Thread 3: call orchestration function → set `orchestrator_done_` + +--- + +## 11. PTO2 Orchestration API + +The orchestration API is defined in `pto_orchestration_api.h`. Orchestration code depends only on this header. + +### 11.1 Core API + +| Function/Macro | Purpose | +|----------------|---------| +| `pto2_rt_submit_task(mixed_kernels, args)` | Submit a mixed task with `MixedKernels` struct | +| `pto2_rt_submit_aic_task(kernel_id, args)` | Convenience: submit AIC-only task | +| `pto2_rt_submit_aiv_task(kernel_id, args)` | Convenience: submit AIV-only task | +| `PTO2_SCOPE() { ... }` | RAII scope for buffer lifetime | +| `pto2_rt_orchestration_done()` | Signal orchestration complete | + +### 11.2 Parameter Construction + +| Function | Description | +|----------|-------------| +| `make_tensor_external(ptr, shapes, ndim, dtype)` | Wrap an existing device pointer as a tensor | +| `TensorCreateInfo(shapes, ndim, dtype)` | Describe a runtime-created output buffer | +| `Arg::add_input(tensor)` | INPUT parameter — read by the task | +| `Arg::add_output(create_info)` | OUTPUT parameter — runtime allocates and returns a Tensor | +| `Arg::add_inout(tensor)` | INOUT parameter — existing tensor read then written | +| `Arg::add_scalar(value)` | 64-bit scalar parameter | + +### 11.3 Resource Shapes + +Tasks are queued by resource shape, which is derived from the `active_mask` in the `MixedKernels` struct: + +| Shape | Active Mask | Description | +|-------|-------------|-------------| +| `AIC_ONLY` | AIC only | AIC cores (matrix multiplication) | +| `AIV_X1` | AIV0 or AIV1 only | Single AIV core (vector operations) | +| `AIV_X2` | AIV0 + AIV1 | Two AIV cores | +| `AIC_AIV_X1` | AIC + one AIV | AIC + single AIV core | +| `AIC_AIV_X2` | AIC + AIV0 + AIV1 | Full cluster (AIC + two AIV cores) | + +### 11.4 Orchestration Export Interface + +Each orchestration `.so` must export: + +```cpp +extern "C" PTO2OrchestrationConfig aicpu_orchestration_config(uint64_t* args, int arg_count); +extern "C" void aicpu_orchestration_entry(uint64_t* args, int arg_count, int orch_thread_num, int orch_thread_index); +``` + +--- + +## 12. Example: Batch Paged Attention + +### 12.1 Kernel Configuration (`kernel_config.py`) + +```python +KERNELS = [ + {"func_id": 0, "name": "QK", "source": "aic/aic_qk_matmul.cpp", "core_type": "aic"}, + {"func_id": 1, "name": "SF", "source": "aiv/aiv_softmax_prepare.cpp", "core_type": "aiv"}, + {"func_id": 2, "name": "PV", "source": "aic/aic_pv_matmul.cpp", "core_type": "aic"}, + {"func_id": 3, "name": "UP", "source": "aiv/aiv_online_update.cpp", "core_type": "aiv"}, + {"func_id": 5, "name": "AIV_HUB", "source": "aiv/aiv_hub.cpp", "core_type": "aiv"}, +] + +ORCHESTRATION = { + "source": "orchestration/paged_attention_orch.cpp", + "function_name": "aicpu_orchestration_entry", +} + +RUNTIME_CONFIG = { + "runtime": "tensormap_and_ringbuffer", + "aicpu_thread_num": 4, + "block_dim": 24, +} +``` + +### 12.2 Orchestration Structure + +```cpp +void aicpu_orchestration_entry(uint64_t* args, int arg_count, int orch_thread_num, int orch_thread_index) { + // Unpack args: query, key_cache, value_cache, block_table, context_lens, out, config + for (q_idx = 0; q_idx < q_loop; q_idx++) { + for (batch_start = 0; batch_start < batch; batch_start += IN_CORE_BATCH) { + PTO2_SCOPE() { + // Describe accumulator tensors (oi, li, mi) with TensorCreateInfo + // Submit AIV_HUB to initialize accumulators + for (bn = 0; bn < max_bn; bn++) { + // Allocate intermediate tensors (sij, pij, mij, lij, oi_new) + // Submit QK (CUBE) → SF (VECTOR) → PV (CUBE) → UP (VECTOR) + } + } + } + } +} +``` diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/SCALAR_DATA_ACCESS.md b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/SCALAR_DATA_ACCESS.md new file mode 100644 index 000000000..98a4893ea --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/SCALAR_DATA_ACCESS.md @@ -0,0 +1,137 @@ +# Scalar Data Access — get/set_tensor_data Design + +## 1. Overview + +During task graph construction, orchestration sometimes needs to read InCore kernel results (for control-flow decisions) or write initial values into tensors. `get_tensor_data` / `set_tensor_data` provide **blocking** cross-layer data access, allowing orchestration to safely read and write tensor data. + +**Core design principle**: Reuse the existing TensorMap dependency tracking mechanism — no new synchronization infrastructure. + +## 2. API + +```cpp +// Blocking read: returns value at the given indices (default: raw uint64_t bits) +// Specify T for typed read: float val = get_tensor_data(tensor, 1, idx); +template +T get_tensor_data(const Tensor& tensor, uint32_t ndims, const uint32_t indices[]); + +// Blocking write: stores value at the given indices (type deduced from argument) +// Typed write: set_tensor_data(tensor, 1, idx, 42.0f); +template +void set_tensor_data(Tensor& tensor, uint32_t ndims, const uint32_t indices[], T value); +``` + +Both call into the runtime through the ops table — orchestration .so needs no runtime symbol linkage. + +## 3. Blocking Interface Design + +### 3.1 get_tensor_data Flow + +```text +addr null-check → TensorMap lookup → spin-wait producer COMPLETED → compute flat offset → memcpy read +``` + +- **addr null-check**: `buffer.addr == 0` means unallocated — log error, return 0 +- **TensorMap lookup**: find producer task by `buffer.addr` +- **spin-wait**: wait until producer `task_state >= PTO2_TASK_COMPLETED` +- **No producer** (`lookup_result.count == 0`): skip waiting, read immediately + +### 3.2 set_tensor_data Flow + +```text +addr null-check → TensorMap lookup → spin-wait producer COMPLETED → spin-wait consumers done → memcpy write +``` + +One extra step versus get_tensor_data: wait for all consumers to finish (`fanout_refcount >= fanout_count - 1`, excluding the scope reference). + +### 3.3 Timeout + +- Uses cycle counter (`get_sys_cnt_aicpu()`), checked every 1024 spins +- Threshold: `PTO2_TENSOR_DATA_TIMEOUT_CYCLES` (~10 s at 1.5 GHz) +- On timeout: sets `orch.fatal = true`, preventing further task submission + +## 4. add_output with Initial Value + +```cpp +TensorCreateInfo ci(shapes, ndims, dtype); +ci.set_initial_value(initial_value); +args.add_output(ci); +``` + +**Mechanism**: + +1. `ci.set_initial_value(value)` marks the create-info with an initial value before submission +2. `add_output(ci)` stores a pointer to `ci` in `Arg` (the original must remain valid until submit) +3. During payload init, the output tensor is materialized via `init_from_create_info()` which triggers the fill +4. Fill strategy: + - Small buffer (< 64 B): element-by-element memcpy directly into dst + - Large buffer (≥ 64 B): fill the first 64 bytes as a template block, then bulk-memcpy in 64 B chunks; partial tail copy for remainder + +**Constraint**: existing tensors are write targets only through `add_inout()`. + +## 5. Scalar Dependencies via 1-Element Tensors + +Traditional scalars (`Arg::add_scalar`) are one-way inputs with no TensorMap tracking. For cross-task scalar values, use a 1-element tensor as the carrier: + +```cpp +uint32_t shapes[1] = {1}; +TensorCreateInfo scalar_ci(shapes, 1, DataType::FLOAT32); + +// Submit with initial value and keep the returned tensor +scalar_ci.set_initial_value(float_to_u64(77.0f)); +Arg args; +args.add_output(scalar_ci); +TaskOutputTensors outs = pto2_rt_submit_aiv_task(FUNC_NOOP, args); +const Tensor& scalar_tensor = outs.get_ref(0); + +// Orchestration-side blocking read (waits for kernel completion) +uint32_t idx[1] = {0}; +float val = get_tensor_data(scalar_tensor, 1, idx); +``` + +**Advantage**: Fully reuses existing TensorMap (producer tracking, fanin/fanout dependencies) — no new infrastructure needed. + +## 6. Data Hazard Analysis + +Three actors: + +- **Kernel**: InCore task submitted via add_input/add_output/add_inout (asynchronous execution) +- **Orch Read**: orchestration calls `get_tensor_data` (blocking read) +- **Orch Write**: orchestration calls `set_tensor_data` (blocking write) + +### Hazard Matrix (earlier operation → later operation) + +| # | Earlier Op | Later Op | Hazard | Guarantee | Safe? | +| - | ---------- | -------- | ------ | --------- | ----- | +| 1 | Kernel write (OUTPUT) | Orch Read | RAW | spin-wait producer COMPLETED | Yes | +| 2 | Kernel write (OUTPUT) | Orch Write | WAW | spin-wait producer COMPLETED | Yes | +| 3 | Kernel read (INPUT) | Orch Write | WAR | spin-wait fanout_refcount | **Needs INOUT** | +| 4 | Kernel read-write (INOUT) | Orch Read | RAW | spin-wait producer COMPLETED | Yes | +| 5 | Kernel read-write (INOUT) | Orch Write | WAW+WAR | spin-wait producer + consumers | Yes | +| 6 | Orch Write | Kernel read (INPUT) | RAW | blocking completes before next submit | Yes | +| 7 | Orch Write | Kernel write (OUTPUT) | WAW | same — serial guarantee | Yes | +| 8 | Orch Read | Kernel write (OUTPUT) | WAR | same — serial guarantee | Yes | +| 9–12 | Orch ↔ Orch | — | — | same-thread serial execution | Yes | + +### Key Design Points + +**Scenario #3 is the only case requiring special attention**: + +TensorMap tracks only producers (OUTPUT/INOUT), not pure INPUT consumers. If a tensor is only registered via `add_input()`, TensorMap has no producer entry for it. `set_tensor_data`'s `wait_for_tensor_ready()` sees `lookup_result.count == 0` and returns immediately — but the kernel may still be reading → **WAR data race**. + +**Solution**: For tensors that may later be written via `set_tensor_data`, use `add_inout()` instead of `add_input()`. INOUT registers a producer entry in TensorMap, enabling `set_tensor_data` to track all consumers through `fanout_refcount`. + +**Scenarios #6–8 serial guarantee**: + +get/set_tensor_data are blocking calls, and orchestration is single-threaded serial submission. After a blocking operation completes, subsequent code (including task submissions) executes strictly afterward. + +## 7. External Tensor Behavior + +`make_tensor_external()` creates tensors with a pre-set `buffer.addr` (pointing to host-allocated device memory). + +| Scenario | Behavior | +| -------- | -------- | +| External tensor never submitted as OUTPUT/INOUT | No TensorMap entry — get/set execute immediately | +| External tensor previously submitted as OUTPUT/INOUT | TensorMap has producer entry — get/set spin-wait | +| External tensor submitted as INPUT, then set_tensor_data | **WAR risk** — must use INOUT instead (same as scenario #3) | + +**Key rule**: If an external tensor will later be written via `set_tensor_data`, all prior kernel accesses must use `add_inout()`, not `add_input()`. diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/SUBMIT_BY_CLUSTER.md b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/SUBMIT_BY_CLUSTER.md new file mode 100644 index 000000000..54e0f4196 --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/SUBMIT_BY_CLUSTER.md @@ -0,0 +1,226 @@ +# Submit by Cluster - Requirements and Main-Branch-Aligned Design + +## 1. Goal + +Define a single, main-branch-aligned specification for PTO2 cluster submission that combines: + +1. Product requirements (what must be true). +2. Runtime design (how it is implemented on current main baseline). + +The target model is: one submitted graph node is one `MixedTask`, and dispatch/completion is mixed-task-granular. + +## 2. Background and Motivation + +Future Ascend hardware is expected to provide stronger locality within an AICore cluster (`1 AIC + 2 AIV`). +The runtime therefore needs a "submit together, run together" model for related AIC/AIV kernels. + +Legacy per-task submit (`kernel_id + worker_type`) cannot express atomic co-dispatch of multiple kernels to one cluster. + +## 3. Scope + +### In Scope + +1. New orchestration-facing submit API for cluster-aware mixed submission. +2. Runtime/backend scheduler and executor changes to treat a mixed submit as one atomic scheduling unit. +3. Dependency gating, readiness, dispatch, completion, and reclamation at mixed-task granularity. +4. AIV slot equivalence (`AIV0` and `AIV1` are equivalent execution targets). + +### Out of Scope + +1. User-facing cluster pinning (`allocate_cluster/free_cluster`-style APIs). +2. New worker types beyond AIC/AIV. +3. Cross-cluster user placement policies. +4. Hardware topology changes beyond `1 AIC + 2 AIV` per cluster. + +## 4. Main-Branch Baseline Constraints + +Design must preserve the current main runtime architecture: + +1. Multi-orchestrator runtime wiring (`orchestrators[]`, `orch_count`, thread-local `pto2_current_orch_idx`). +2. Executor threading split (orchestrator threads vs scheduler threads), and post-orchestrator transition (`transition_requested_` + `reassign_cores_for_all_threads()`). +3. Shared-memory hot/cold split (`PTO2TaskDescriptor` hot + `PTO2TaskPayload` cold). + +## 5. Terminology + +1. `cluster`: one physical unit with `1 AIC + 2 AIV`. +2. `MixedKernels`: 3 submit slots (`AIC`, `AIV0`, `AIV1`) with `INVALID_KERNEL_ID` for inactive slots. +3. `MixedTask`: one runtime graph node created by one submit call. +4. `active_mask`: bitmask of active subtask slots. +5. `resource shape`: normalized lane demand class of a mixed task. + +## 6. API Contract + +```cpp +inline constexpr int32_t INVALID_KERNEL_ID = -1; + +struct MixedKernels { + int32_t aic_kernel_id{INVALID_KERNEL_ID}; + int32_t aiv0_kernel_id{INVALID_KERNEL_ID}; + int32_t aiv1_kernel_id{INVALID_KERNEL_ID}; +}; + +static inline void pto2_rt_submit_task(PTO2Runtime* rt, + const MixedKernels& mixed_kernels, + Arg* args, + int32_t num_args); + +static inline void pto2_rt_submit_aic_task(PTO2Runtime* rt, + int32_t kernel_id, + Arg* args, + int32_t num_args); + +static inline void pto2_rt_submit_aiv_task(PTO2Runtime* rt, + int32_t kernel_id, + Arg* args, + int32_t num_args); +``` + +Rules: + +1. One submit call creates one `MixedTask`. +2. All active slots share the same `args` and `num_args`. +3. At least one slot must be active. +4. `aiv0_kernel_id` and `aiv1_kernel_id` are semantically equivalent. +5. Wrappers are orchestration sugar only (inline in orchestration API); no dedicated runtime ops entries. +6. Submit-contract types are defined once in a shared header-only submit-types surface consumed by orchestration and runtime headers. +7. Invalid submits follow existing PTO2 behavior (`always_assert`), not a new recoverable return-code API. + +## 7. Data Model (Requirements + Design) + +`PTO2TaskDescriptor` (hot path) carries mixed-task identity/state: + +1. `task_id` +2. `active_mask` +3. `subtask_done_mask` +4. `kernel_id[3]` for `(AIC, AIV0, AIV1)` +5. dependency heads/counters and packed-buffer metadata + +`PTO2TaskPayload` (cold path) carries: + +1. shared args/tensors/scalars copied once per mixed submit +2. fanin mixed-task IDs +3. other cold-path submit metadata + +Producer identity in TensorMap is mixed-task ID end-to-end. + +## 8. Scheduling Model + +### 8.1 Resource Shapes + +Runtime uses shape-based ready queues (not worker-type queues): + +1. `AIC_ONLY` +2. `AIV_X1` +3. `AIV_X2` +4. `AIC_AIV_X1` +5. `AIC_AIV_X2` + +Queueing key is normalized resource shape (not raw slot label). + +### 8.2 Atomic Cluster Dispatch + +1. Dispatch decision unit is one mixed task. +2. For multi-slot mixed tasks, partial launch is forbidden. +3. A mixed task is dispatchable only when one local owned cluster can satisfy all required lanes. +4. Compatible mixed tasks may co-reside over time if they use disjoint free lanes. + +### 8.3 Dependency and Completion + +1. Fanin release/readiness remains dependency-correct and graph-level. +2. Two-stage completion: + - `on_subtask_complete(task_id, subslot)` + - `on_mixed_task_complete(task_id)` only when `subtask_done_mask == active_mask` +3. Downstream release is triggered once per mixed task completion, not once per subslot. + +## 9. Executor Ownership and Numbering + +### 9.1 Canonical Flattened Numbering (Unchanged) + +Given `block_dim` clusters: + +1. AIC IDs: `[0, block_dim)` +2. AIV IDs: `[block_dim, 3 * block_dim)` +3. Cluster `i`: `{i, block_dim + i, 2 * block_dim + i}` + +This project-defined flattened numbering is kept unchanged. + +### 9.2 Cluster Ownership + +1. One cluster must be owned by one scheduler domain/thread at a time. +2. No split-cluster ownership in either: + - initial `assign_cores_to_threads()` + - post-orchestrator `reassign_cores_for_all_threads()` +3. Lane occupancy bookkeeping must remain consistent with ownership after reassignment. + +## 10. Functional Requirements + +### 10.1 Valid Mixed Shapes + +1. AIC only +2. AIV only (1 or 2 AIV lanes) +3. AIC + 1 AIV +4. AIC + 2 AIV + +### 10.2 Runtime Behavior per Submit + +1. Validate submit arguments. +2. Allocate mixed-task ID and initialize descriptor/payload once. +3. Build fanin/fanout at mixed-task granularity. +4. Enqueue by shape when ready. +5. Dispatch all active lanes atomically when resources allow. +6. Aggregate completion and release downstream once. + +## 11. Non-Functional Requirements + +1. Correctness: no dependency violation, no partial mixed-task dispatch. +2. Determinism: dependency-correct ordering preserved; AIV lane choice may vary but remains semantically equivalent. +3. Fairness: resource-aware polling heuristic is allowed; strict starvation-free guarantee across all shapes is not required. +4. Performance: no obvious regression for non-cluster workflows. +5. Observability: lifecycle visibility for submit/ready/dispatch/block/complete. + +## 12. Acceptance Criteria + +Feature is accepted when: + +1. Orchestration compiles and submits via `MixedKernels` API/wrappers. +2. Scheduler dispatches each mixed task as one cluster scheduling decision. +3. Dependencies gate mixed-task readiness correctly. +4. AIV execution remains cluster-local and semantically equivalent across lanes. +5. Existing non-cluster workflows continue to pass without behavior regression. +6. Cluster ownership is never split across scheduler domains before/after transition. + +## 13. Verification Matrix + +Recommended validation coverage: + +1. Mapping correctness for cluster-to-core ID relation. +2. Atomic dispatch for multi-slot shapes. +3. Dependency gating and completion aggregation (`done_mask == active_mask`). +4. Lane-occupancy co-residency behavior for compatible shapes. +5. Multi-orchestrator and core-transition ownership stability. +6. Invalid submit handling (`always_assert` path). +7. Regression coverage for existing examples/tests. + +Milestone command (device): + +```bash +python examples/scripts/run_example.py \ + -k tests/st/tensormap_and_ringbuffer/batch_paged_attention/kernels \ + -g tests/st/tensormap_and_ringbuffer/batch_paged_attention/golden.py \ + -p a2a3 -d 9 +``` + +Final validation: + +```bash +./ci.sh +``` + +## 14. Resolved Decisions + +1. Legacy orchestration-facing single-task submit is replaced by mixed submit contract. +2. Invalid mixed submits fail with existing submit-time assert behavior. +3. Per-cluster concurrent capacity is lane-occupancy-driven, not a fixed constant. +4. Submit-contract types live in one shared header-only surface. +5. Resource-aware dispatch heuristics are allowed without a strict starvation-free guarantee. + diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/device_log_profiling.md b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/device_log_profiling.md new file mode 100644 index 000000000..010e6c682 --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/device_log_profiling.md @@ -0,0 +1,167 @@ +# PTO2 Device Log Profiling Guide + +## How to Find Device Logs + +AICPU logs (via `DEV_ALWAYS`) are written by CANN's **dlog** subsystem and do **not** appear in the `run_example.py` terminal output. They are written to CANN's device log directory: + +```text +$HOME/ascend/log/debug/device-/device-_.log +``` + +Each run produces a new log file (or appends to an existing one). Find the most recent file by modification time: + +```bash +ls -lt $HOME/ascend/log/debug/device-/ | head -5 +``` + +## Log Structure Overview + +A single run produces two profiling blocks in the device log: + +| Block | Emitted by | Function | Content | +| ----- | ---------- | -------- | ------- | +| **Orchestrator Profiling** | Thread 3 (orchestrator) | `aicpu_orchestration_entry` | Time breakdown of graph construction on device | +| **PTO2 Scheduler Summary** | Threads 0/1/2 (schedulers) | `resolve_and_dispatch_pto2` | Per-thread scheduling statistics, phase timing, and lock contention | + +All timing values are in microseconds (us), converted from AICPU cycle counters. + +--- + +## Block 1: Orchestrator Profiling + +Thread 3 loads the orchestration `.so` via `dlopen`, calls `aicpu_orchestration_entry`, and prints a profiling summary after it returns. + +### Example (from a real run: batch=64, 16704 tasks) + +```text +Thread 3: Calling aicpu_orchestration_entry from SO +aicpu_orchestration_entry ">>>>>> batch = 64" +Thread 3: aicpu_orchestration_entry returned, cost 20943.940us +Thread 3: === Orchestrator Profiling: 16704 tasks, total=14601.580us === +Thread 3: sync_tensormap : 286.300us (2.0%) +Thread 3: task_ring_alloc: 380.400us (2.6%) +Thread 3: param_copy : 2147.800us (14.7%) +Thread 3: lookup+dep : 7290.300us (49.9%) +Thread 3: heap_alloc : 701.500us (4.8%) +Thread 3: tensormap_ins : 1890.380us (12.9%) +Thread 3: fanin+ready : 1207.400us (8.3%) +Thread 3: finalize+SM : 697.500us (4.8%) +Thread 3: scope_end : 364.080us +Thread 3: avg/task : 0.874us +Thread 3: PTO2 total submitted tasks = 16704 +``` + +### Field Reference + +| Field | Source (`pto_orchestrator.cpp`) | Description | +| ----- | ------------------------------- | ----------- | +| **cost** | Wall-clock around `orch_func()` call | Total time including orchestration logic + scope overhead | +| **total** | Sum of all sub-steps below | Accumulated time inside `pto2_submit_task` across all tasks | +| **sync_tensormap** | `g_orch_sync_cycle` | TensorMap validity sync and optional cleanup before each submission | +| **task_ring_alloc** | `g_orch_alloc_cycle` | Allocating a task slot from the task ring buffer | +| **param_copy** | `g_orch_args_cycle` | Copying param descriptors + tensor descriptor copies into task-owned storage | +| **lookup+dep** | `g_orch_lookup_cycle` | TensorMap lookup for inputs/inouts + building fanin/fanout dependency edges | +| **heap_alloc** | `g_orch_heap_cycle` | Allocating packed output buffers from the heap ring | +| **tensormap_ins** | `g_orch_insert_cycle` | Inserting output/inout tensors into the TensorMap | +| **fanin+ready** | `g_orch_fanin_cycle` | Building the fanin list + checking if task is already ready (Step 5/5b) | +| **scope_end** | `g_orch_scope_end_cycle` | `pto2_scope_end` overhead (notifying scheduler of scope completion) | +| **avg/task** | `total / submit_count` | Average orchestrator time per task submission | + +### Interpreting the Numbers + +- **cost > total**: The difference is overhead outside `pto2_submit_task` (the orchestration user code itself, scope_begin/end, TensorCreateInfo construction, etc.). +- **lookup+dep** is typically the dominant cost (~50%) because it involves TensorMap hash lookups and building dependency edges with spinlock-protected fanout list insertions. +- **param_copy** scales with the number of parameters per task. +- **avg/task < 1us** indicates efficient graph construction. + +--- + +## Block 2: PTO2 Scheduler Summary + +Each of the 3 scheduler threads (Thread 0, 1, 2) prints its own summary after completing all tasks. The output has two sub-sections: **summary** and **phase breakdown**. + +### Example (Thread 0, from a different run: batch=1, 1044 tasks) + +```text +Thread 0: completed=352 tasks in 3477.420us (147 loops, 2.4 tasks/loop) +Thread 0: --- Phase Breakdown --- +Thread 0: complete: 1485.020us (42.7%) [fanout: edges=432, max_degree=2, avg=1.2] [fanin: edges=320, max_degree=3, avg=0.9] +Thread 0: scan: 14.400us (0.4%) +Thread 0: dispatch: 1973.060us (56.7%) [pop: hit=352, miss=3043, hit_rate=10.4%] +Thread 0: idle: 4.940us (0.1%) +``` + +### Summary Line + +```text +Thread N: completed=X tasks in Yus (Z loops, W tasks/loop) +``` + +| Field | Description | +| ----- | ----------- | +| **completed** | Number of tasks this thread processed to completion | +| **Y us** | Total scheduler loop time (sum of all phase cycles) | +| **Z loops** | Number of scheduler loop iterations | +| **W tasks/loop** | Average tasks completed per loop iteration; higher = better throughput | + +### Phase Breakdown + +The scheduler loop runs four phases each iteration. Each phase's time is accumulated across all loop iterations. + +| Phase | What it does | Inline stats | +| ----- | ------------ | ------------ | +| **complete** | Polls handshake on each managed core; when a core completes, calls `on_subtask_complete(task_id, subslot)` to set the done bit; when `subtask_done_mask == active_mask`, triggers `on_mixed_task_complete` which traverses fanout list (notify consumers) and fanin list (release producers) | `fanout`: edges/max_degree/avg for consumer notification; `fanin`: edges/max_degree/avg for producer release | +| **scan** | Updates the perf profiling header with latest scheduler state | — | +| **dispatch** | For each idle core, pops a task from the shape-based ready queue via `get_ready_task(shape)`, builds the dispatch payload, and writes the task to the core's handshake register | `pop`: `hit` = successful pops (task dispatched), `miss` = empty queue pops, `hit_rate` = hit/(hit+miss) | +| **idle** | Scheduler loop iteration where no progress was made (no completions, no dispatches) | — | + +**Interpreting phase percentages:** + +- **dispatch** is typically the largest (~55-60%) because it includes ready-queue pops (with spinlock), payload construction, and cache flush (`dc cvac` + `dsb sy`). +- **complete** is the second largest (~40-45%) because it traverses both fanout (CAS-based fanin decrement, conditional ready-queue push) and fanin (release_producer, check_consumed, ring pointer advancement). +- **scan** is small (<1%) — only updates the perf header. +- **idle** is negligible when tasks are flowing; high idle% indicates the scheduler is starved. + +**Interpreting pop hit_rate:** + +- **High hit_rate (>50%)**: Ready queue is well-supplied; dispatch is efficient. +- **Low hit_rate (<10%)**: Ready queue is mostly empty when cores become idle. The bottleneck is upstream (orchestrator submission speed or fanout resolution latency), not dispatch itself. + +### Per-Task Averages + +Divide each thread's phase times by its `completed` count to get per-task scheduling cost: + +| Metric | Formula | Typical value | +| ------ | ------- | ------------- | +| Scheduling overhead per task | total_time / completed | ~5-10 us/task | +| Dispatch per task | dispatch_time / completed | ~3-6 us/task | +| Complete per task | complete_time / completed | ~2-4 us/task | + +--- + +## Cross-Referencing with Host Profiling + +When `--enable-profiling` is used, the host terminal prints a **Task Statistics by Function** table with `Total_Exec` (total AICore kernel execution time). Combined with device log data: + +| Metric | Source | Description | +| ------ | ------ | ----------- | +| Avg kernel exec time | `Total_Exec / total_tasks` (host) | Time AICore spends executing each kernel | +| Avg scheduling overhead | `sum(thread_total) / total_tasks` (device log) | Time AICPU spends scheduling each task | +| Sched/Exec ratio | scheduling / execution | Scheduling overhead relative to kernel execution | + +A high sched/exec ratio (e.g., >3x) indicates that scheduling overhead dominates, and optimizations should target the scheduler's dispatch hot path (cache flush, payload construction) or upstream task flow. + +--- + +## Quick Reference: Extracting Profiling Data + +```bash +# Find the latest device log for device 2 +ls -t $HOME/ascend/log/debug/device-2/device-*.log | head -1 + +# Extract orchestrator profiling (Thread 3) +grep "Thread 3:" + +# Extract scheduler profiling (Threads 0/1/2) +grep -E "Thread [012]:" +``` diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/profiling_levels.md b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/profiling_levels.md new file mode 100644 index 000000000..0578b5327 --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/profiling_levels.md @@ -0,0 +1,355 @@ +# PTO Runtime2 Profiling Levels + +This document describes the profiling macro hierarchy and logging control in the PTO Runtime2 system. + +## Overview + +PTO Runtime2 uses a hierarchical profiling system with compile-time macros to control profiling code compilation and log output. The `enable_profiling` runtime flag controls data collection (performance buffers, shared memory writes) but does NOT control log output. + +## Profiling Macro Hierarchy + +``` +PTO2_PROFILING (base level, default=1) +├── PTO2_ORCH_PROFILING (orchestrator, default=0, requires PTO2_PROFILING=1) +| └──PTO2_TENSORMAP_PROFILING (tensormap, default=0, requires PTO2_ORCH_PROFILING=1) +├── PTO2_SCHED_PROFILING (scheduler, default=0, requires PTO2_PROFILING=1) +└── --enable-profiling (Dump profiling merged swimlane json file for visualization, requires PTO2_PROFILING=1) + +``` + +### Compile-Time Validation + +Each sub-level macro requires `PTO2_PROFILING=1`: + +```cpp +#if PTO2_ORCH_PROFILING && !PTO2_PROFILING +#error "PTO2_ORCH_PROFILING requires PTO2_PROFILING=1" +#endif + +#if PTO2_SCHED_PROFILING && !PTO2_PROFILING +#error "PTO2_SCHED_PROFILING requires PTO2_PROFILING=1" +#endif + +#if PTO2_TENSORMAP_PROFILING && !PTO2_ORCH_PROFILING +#error "PTO2_TENSORMAP_PROFILING requires PTO2_ORCH_PROFILING=1" +#endif +``` + +## Profiling Levels + +### Level 0: No Profiling (PTO2_PROFILING=0) + +**What's compiled:** +- Debug/diagnostic logs (always present) +- Progress tracking (`PTO2 progress: completed=...`) +- Stall detection and dump (triggered only after `MAX_IDLE_ITERATIONS` idle loops) +- Deadlock/livelock detection (`diagnose_stuck_state`, called on stall) + +**What's NOT compiled:** +- All `CYCLE_COUNT_*` timing counters (`sched_*_cycle`, orchestrator cost counters) +- Scheduler/Orchestrator profiling summary logs guarded by `#if PTO2_PROFILING` +- Performance data collection paths (`enable_profiling` runtime flag becomes ineffective because profiling code is not compiled) + +**Log output (normal run, no stall):** +- No `sched_start/sched_end/sched_cost` timestamps +- No `orch_start/orch_end/orch_cost` timestamps +- No `Scheduler summary: total_time=...` +- No `PTO2 total submitted tasks` log +- `PTO2 progress: completed=... total=...` may appear (thread 0 only, at task completion milestones) + + +--- + +### Level 1: Basic Profiling (PTO2_PROFILING=1) + +**What's compiled:** +- Base timing counters for scheduler loop (`sched_complete/dispatch/idle/scan`) +- Per-thread orchestration timing (`orch_start`, `orch_end`, `orch_cost`) +- Stage-level orchestration end timestamp (`orch_stage_end`, printed by last orch thread only, marks the moment all orch threads have finished and core transition is about to be requested; only when `orch_to_sched_` is true) +- PTO2 total submitted tasks count (printed by last orch thread, after orch timing line) +- Scheduler summary output (`total_time`, `loops`, `tasks_scheduled`) +- Scheduler lifetime timestamps and cost (`sched_start`, `sched_end`, `sched_cost` — captured inside `resolve_and_dispatch_pto2()`, printed before Scheduler summary) + +**What's NOT compiled:** +- Detailed phase breakdowns +- TensorMap statistics + +**Log output (additional lines vs Level 0, per normal run):** +- `Thread %d: orch_start=%llu orch_end=%llu orch_cost=%.3fus` — each orch thread, after orchestration fully complete +- `PTO2 total submitted tasks = %d, already executed %d tasks` — last orch thread only (×1), after orch timing line +- `Thread %d: orch_stage_end=%llu` — last orch thread only (×1), only when `orch_to_sched_=true` +- `Thread %d: sched_start=%llu sched_end=%llu sched_cost=%.3fus` — each sched thread, printed before Scheduler summary +- `Thread %d: Scheduler summary: total_time=%.3fus, loops=%llu, tasks_scheduled=%d` — each sched thread +- `Thread %d: sched_start=%llu sched_end(timeout)=%llu sched_cost=%.3fus` — timeout path only (replaces normal `sched_end`) + +**DEV_ALWAYS count (normal run):** +- `orch_to_sched_=false` (default): `N_sched*2 + N_orch*1 + 1` (orch_timing + PTO2_total + sched_timing + Scheduler_summary) +- `orch_to_sched_=true` (`PTO2_ORCH_TO_SCHED=1`): adds 1 (`orch_stage_end`) + +> See the table at the end for concrete counts based on the `paged_attention` example. + +**Example log output — `orch_to_sched_=false`** (from `paged_attention`, device 10): +``` +Thread 2: orch_start=48214752948321 orch_end=48214752959379 orch_cost=230.000us +Thread 3: orch_start=48214752948316 orch_end=48214752961505 orch_cost=275.000us +PTO2 total submitted tasks = 13, already executed 13 tasks +Thread 1: sched_start=48214752948235 sched_end=48214752962379 sched_cost=295.000us +Thread 1: Scheduler summary: total_time=159.560us, loops=3782, tasks_scheduled=6 +Thread 0: sched_start=48214752948200 sched_end=48214752963571 sched_cost=320.000us +Thread 0: Scheduler summary: total_time=183.180us, loops=4611, tasks_scheduled=7 +``` + +**Example log output — `orch_to_sched_=true`** (`PTO2_ORCH_TO_SCHED=1`, from `paged_attention`, device 11): +``` +Thread 3: orch_stage_end=48236915058307 +Thread 3: orch_start=48236915044001 orch_end=48236915058781 orch_cost=308.000us +Thread 2: orch_start=48236915044003 orch_end=48236915058782 orch_cost=308.000us +PTO2 total submitted tasks = 13, already executed 13 tasks +Thread 0: sched_start=48236915043911 sched_end=48236915059191 sched_cost=318.000us +Thread 0: Scheduler summary: total_time=187.920us, loops=4561, tasks_scheduled=4 +Thread 1: sched_start=48236915043947 sched_end=48236915061881 sched_cost=372.000us +Thread 1: Scheduler summary: total_time=168.620us, loops=3880, tasks_scheduled=9 +``` + +> With `orch_to_sched_=true`, orch threads transition to schedulers after orchestration. They print `orch_end` but do NOT print `Scheduler summary` or `sched_end` (they have no cores assigned at shutdown time). + +**Note:** +- All logs above are controlled by compile-time macro `PTO2_PROFILING`, not by `enable_profiling`. +- `enable_profiling` only controls shared-memory data collection / swimlane export. +- Enable `orch_to_sched_` via environment variable: `PTO2_ORCH_TO_SCHED=1`. + +--- + +### Level 2: Scheduler Detailed Profiling (PTO2_SCHED_PROFILING=1) + +**Requires:** `PTO2_PROFILING=1` + +**What's compiled:** +- All Level 1 features +- Detailed scheduler phase counters +- Phase-specific statistics (complete, scan, dispatch, idle) +- Hit rate tracking (complete poll, ready queue pop) + +**Log output:** 18 DEV_ALWAYS logs (11 debug + 2 basic + 7 scheduler detailed - 2 replaced) +- Replaces scheduler summary with detailed breakdown + +**Scheduler output:** +``` +Thread X: === Scheduler Phase Breakdown: total=XXXus, XXX tasks === +Thread X: complete : XXXus (XX.X%) [fanout: edges=XXX, max_degree=X, avg=X.X] [fanin: edges=XXX, max_degree=X, avg=X.X] +Thread X: poll : XXXus (XX.X%) hit=XXX, miss=XXX, hit_rate=XX.X% +Thread X: otc_lock : XXXus (XX.X%) work=XXXus wait=XXXus atomics=XXX +Thread X: otc_fanout : XXXus (XX.X%) work=XXXus wait=XXXus atomics=XXX +Thread X: otc_fanin : XXXus (XX.X%) atomics=XXX +Thread X: otc_self : XXXus (XX.X%) atomics=XXX +Thread X: perf : XXXus (XX.X%) +Thread X: dispatch : XXXus (XX.X%) [pop: hit=XXX, miss=XXX, hit_rate=XX.X%] +Thread X: poll : XXXus (XX.X%) +Thread X: pop : XXXus (XX.X%) work=XXXus wait=XXXus atomics=XXX +Thread X: setup : XXXus (XX.X%) +Thread X: scan : XXXus (XX.X%) +Thread X: idle : XXXus (XX.X%) +Thread X: avg/complete : XXXus +Thread X: Scheduler summary: total_time=XXXus, loops=XXX, tasks_scheduled=XXX +``` + +--- + +### Level 3: Orchestrator Detailed Profiling (PTO2_ORCH_PROFILING=1) + +**Requires:** `PTO2_PROFILING=1` + +**What's compiled:** +- All Level 1 features +- Detailed orchestrator phase counters +- Per-phase cycle tracking +- Atomic operation counters +- Wait time tracking + +**Log output:** 30 DEV_ALWAYS logs (11 debug + 2 basic + 1 scheduler summary + 17 orchestrator detailed - 1 replaced) +- Replaces basic orchestration completion with detailed breakdown + +**Orchestrator output:** +``` +Thread X: === Orchestrator Profiling: XXX tasks, total=XXXus === +Thread X: sync_tensormap : XXXus (XX.X%) +Thread X: task_ring_alloc: XXXus (XX.X%) work=XXXus wait=XXXus atomics=XXX +Thread X: param_copy : XXXus (XX.X%) atomics=XXX +Thread X: lookup+dep : XXXus (XX.X%) +Thread X: heap_alloc : XXXus (XX.X%) work=XXXus wait=XXXus atomics=XXX +Thread X: tensormap_ins : XXXus (XX.X%) +Thread X: fanin+ready : XXXus (XX.X%) work=XXXus wait=XXXus atomics=XXX +Thread X: finalize+SM : XXXus (XX.X%) work=XXXus wait=XXXus atomics=XXX +Thread X: scope_end : XXXus atomics=XXX +Thread X: avg/task : XXXus +``` + +**Note:** Orchestrator logs always print when `PTO2_ORCH_PROFILING=1`, regardless of `enable_profiling` flag. + +--- + +### Level 4: TensorMap Profiling (PTO2_TENSORMAP_PROFILING=1) + +**Requires:** `PTO2_PROFILING=1` AND `PTO2_ORCH_PROFILING=1` + +**What's compiled:** +- All Level 3 features +- TensorMap lookup statistics +- Hash chain walk tracking +- Overlap check counters + +**Log output:** 34 DEV_ALWAYS logs (30 from Level 3 + 4 tensormap) + +**TensorMap output:** +``` +Thread X: === TensorMap Lookup Stats === +Thread X: lookups : XXX, inserts: XXX +Thread X: chain walked : total=XXX, avg=X.X, max=X +Thread X: overlap checks : XXX, hits=XXX (XX.X%) +``` + +--- + +## Runtime Flag: enable_profiling + +The `runtime->enable_profiling` flag controls **data collection**, NOT log output. + +### When enable_profiling=true: +- Performance buffers are allocated and written +- Per-task timing data is collected +- Phase profiling data is recorded +- Orchestrator summary is written to shared memory + +### When enable_profiling=false: +- No performance data collection +- No shared memory writes +- Logs still print (controlled by macros only) + +### Usage: +```cpp +// Initialize runtime with profiling enabled +runtime->enable_profiling = true; +``` + +--- + +## Common Profiling Configurations + +### Development (minimal overhead) +```bash +# No profiling overhead +PTO2_PROFILING=0 +``` + +### Basic Performance Monitoring +```bash +# Minimal overhead, summary logs only +PTO2_PROFILING=1 +PTO2_ORCH_PROFILING=0 +PTO2_SCHED_PROFILING=0 +``` + +### Scheduler Performance Analysis +```bash +# Detailed scheduler breakdown +PTO2_PROFILING=1 +PTO2_ORCH_PROFILING=0 +PTO2_SCHED_PROFILING=1 +``` + +### Orchestrator Performance Analysis +```bash +# Detailed orchestrator breakdown +PTO2_PROFILING=1 +PTO2_ORCH_PROFILING=1 +PTO2_SCHED_PROFILING=0 +``` + +### Full Profiling (maximum overhead) +```bash +# All profiling features enabled +PTO2_PROFILING=1 +PTO2_ORCH_PROFILING=1 +PTO2_SCHED_PROFILING=1 +PTO2_TENSORMAP_PROFILING=1 +``` + +--- + +## Setting Profiling Macros + +### At compile time: +```bash +# In CMakeLists.txt or build command +add_definitions(-DPTO2_PROFILING=1) +add_definitions(-DPTO2_ORCH_PROFILING=1) +``` + +### In source code (before including headers): +```cpp +#define PTO2_PROFILING 1 +#define PTO2_ORCH_PROFILING 1 +#include "pto_runtime2_types.h" +``` + +--- + +## Log Output Summary + +> Example: `paged_attention` on Ascend hardware, 2 sched threads + 2 orch threads, normal run (no stall/timeout). + +| Level | Macro Settings | DEV_ALWAYS Count (`orch_to_sched_=false`) | DEV_ALWAYS Count (`orch_to_sched_=true`) | Description | +|-------|---------------|------------------------------------------|------------------------------------------|-------------| +| 0 | `PTO2_PROFILING=0` | 0 | 0 | No timing output | +| 1 | `PTO2_PROFILING=1` | 7 | 8 | Timing timestamps + scheduler summary | +| 2 | `+PTO2_SCHED_PROFILING=1` | — | — | Scheduler detailed phase breakdown | +| 3 | `+PTO2_ORCH_PROFILING=1` | — | — | Orchestrator detailed phase breakdown | +| 4 | `+PTO2_TENSORMAP_PROFILING=1` | — | — | TensorMap lookup stats | + +--- + +## Implementation Notes + +### Key Principles + +1. **Macros control compilation and logging** + - `#if PTO2_PROFILING` controls whether profiling code is compiled + - Logs print when macro is enabled, regardless of runtime flag + +2. **Runtime flag controls data collection** + - `enable_profiling` controls performance buffer allocation + - Controls shared memory writes for host-side export + - Does NOT control log output + +3. **Consistent behavior across components** + - Scheduler logs: macro-controlled only + - Orchestrator logs: macro-controlled only + - Data collection: runtime flag controlled + +### Code Locations + +- Macro definitions: `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h` +- Scheduler profiling: `src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp` (lines 770-835) +- Orchestrator profiling: `src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp` (lines 1035-1105) +- TensorMap profiling: `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.h` + +--- + +## Performance Impact + +### Compilation overhead: +- Level 0: No overhead +- Level 1: Minimal (counter increments, basic arithmetic) +- Level 2-4: Low to moderate (additional counters, cycle measurements) + +### Runtime overhead: +- Logging: Negligible (device logs are asynchronous) +- Data collection (`enable_profiling=true`): Low to moderate + - Performance buffer writes + - Shared memory updates + - Per-task timing measurements + +### Recommendation: +- Use Level 0 for production +- Use Level 1-2 for performance monitoring +- Use Level 3-4 for detailed performance analysis only diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/host/runtime_compile_info.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/host/runtime_compile_info.cpp new file mode 100644 index 000000000..76c0e8a74 --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/host/runtime_compile_info.cpp @@ -0,0 +1,18 @@ +#include "host/platform_compile_info.h" +#include "host/runtime_compile_info.h" +#include + +extern "C" { + +ToolchainType get_incore_compiler(void) { + if (strcmp(get_platform(), "a2a3") == 0) return TOOLCHAIN_CCEC; + return TOOLCHAIN_HOST_GXX_15; +} + +ToolchainType get_orchestration_compiler(void) { + // tensormap_and_ringbuffer: a2a3 needs aarch64 cross-compile (AICPU is aarch64) + if (strcmp(get_platform(), "a2a3") == 0) return TOOLCHAIN_AARCH64_GXX; + return TOOLCHAIN_HOST_GXX; +} + +} diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/host/runtime_maker.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/host/runtime_maker.cpp new file mode 100644 index 000000000..e29bee245 --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/host/runtime_maker.cpp @@ -0,0 +1,381 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ +/** + * Runtime Builder - rt2 Implementation (Device Orchestration) + * + * Provides init_runtime_impl and validate_runtime_impl functions for rt2 runtime. + * Supports device orchestration where AICPU thread 3 runs the orchestrator. + * + * init_runtime_impl: + * - Converts host tensor pointers to device pointers (all tensors copied both directions) + * - Copies orchestration SO to device memory + * - Sets up runtime state for device orchestration + * + * validate_runtime_impl: + * - Copies recorded tensors back from device to host + * - Frees device memory + */ + +#include +#include +#include + +#include +#include +#include +#include +#include +#include + +#include "../runtime/pto_shared_memory.h" +#include "../runtime/runtime.h" +#include "callable.h" +#include "common/platform_config.h" +#include "common/unified_log.h" + +// Helper: return current time in milliseconds +static int64_t _now_ms() { + struct timeval tv; + gettimeofday(&tv, nullptr); + return static_cast(tv.tv_sec) * 1000 + tv.tv_usec / 1000; +} + +/** + * Parse an environment variable as uint64_t with optional power-of-2 constraint. + * Returns the parsed value on success, or 0 if unset or validation fails. + */ +static uint64_t parse_env_uint64(const char* name, uint64_t min_val, bool require_power_of_2) { + const char* env = std::getenv(name); + if (!env) return 0; + char* endptr; + errno = 0; + uint64_t val = strtoull(env, &endptr, 10); + if (errno == ERANGE || endptr == env || *endptr != '\0' || val < min_val) { + LOG_WARN("%s=%s invalid (must be a valid integer >= %" PRIu64 "), ignored", name, env, min_val); + return 0; + } + if (require_power_of_2 && (val & (val - 1)) != 0) { + LOG_WARN("%s=%s invalid (must be a power of 2, >= %" PRIu64 "), ignored", name, env, min_val); + return 0; + } + return static_cast(val); +} + +/** + * Initialize a pre-allocated runtime for device orchestration. + * + * For rt2 runtime, orchestration runs on AICPU thread 3 (device-side). + * This function: + * - Copies tensor metadata and replaces host pointers with device pointers + * - Copies all tensor data to device + * - Records all tensors for copy-back + * - Copies orchestration SO to device memory + * - Sets up runtime state for device orchestration + * + * @param runtime Pointer to pre-constructed Runtime + * @param callable ChipCallable containing orch binary, func_name, and child kernels + * @param orch_args Separated tensor/scalar arguments + * @return 0 on success, -1 on failure + */ +extern "C" int init_runtime_impl(Runtime* runtime, const ChipCallable* callable, const ChipStorageTaskArgs* orch_args) { + // Validate inputs + if (runtime == nullptr) { + LOG_ERROR("Runtime pointer is null"); + return -1; + } + + // Register kernel binaries from ChipCallable children + if (callable->child_count() > 0) { + LOG_INFO("Registering %d kernel(s) in init_runtime_impl", callable->child_count()); + for (int32_t i = 0; i < callable->child_count(); i++) { + int func_id = callable->child_func_id(i); + const auto& kernel = callable->child(i); + uint64_t addr = runtime->host_api.upload_kernel_binary(func_id, + reinterpret_cast(&kernel), + CoreCallable::binary_data_offset() + kernel.binary_size()); + if (addr == 0) { + LOG_ERROR("Failed to upload kernel binary for func_id=%d", func_id); + return -1; + } + runtime->set_function_bin_addr(func_id, addr); + } + } + + const uint8_t* orch_so_binary = static_cast(callable->binary_data()); + size_t orch_so_size = callable->binary_size(); + + if (orch_so_binary == nullptr || orch_so_size == 0) { + LOG_ERROR("Orchestration SO binary is required for device orchestration"); + return -1; + } + + if (orch_args == nullptr) { + LOG_ERROR("orch_args pointer is null"); + return -1; + } + + int tensor_count = orch_args->tensor_count(); + int scalar_count = orch_args->scalar_count(); + LOG_INFO("RT2 init: %d tensors + %d scalars, device orchestration mode", tensor_count, scalar_count); + + int64_t t_total_start = _now_ms(); + + // Build device args: copy from input, replace host tensor pointers with device pointers + ChipStorageTaskArgs device_args; + + int64_t t_args_start = _now_ms(); + for (int i = 0; i < tensor_count; i++) { + ContinuousTensor t = orch_args->tensor(i); + + void* host_ptr = reinterpret_cast(static_cast(t.data)); + size_t size = static_cast(t.nbytes()); + + void* dev_ptr = runtime->host_api.device_malloc(size); + if (dev_ptr == nullptr) { + LOG_ERROR("Failed to allocate device memory for tensor %d", i); + return -1; + } + + int rc = runtime->host_api.copy_to_device(dev_ptr, host_ptr, size); + if (rc != 0) { + LOG_ERROR("Failed to copy tensor %d to device", i); + runtime->host_api.device_free(dev_ptr); + return -1; + } + runtime->record_tensor_pair(host_ptr, dev_ptr, size); + LOG_INFO(" Tensor %d: %zu bytes at %p", i, size, dev_ptr); + + t.data = reinterpret_cast(dev_ptr); + device_args.add_tensor(t); + } + for (int i = 0; i < scalar_count; i++) { + device_args.add_scalar(orch_args->scalar(i)); + } + int64_t t_args_end = _now_ms(); + + // Copy orchestration SO to device memory (AICPU cannot access host memory) + int64_t t_so_start = _now_ms(); + void* dev_so = runtime->host_api.device_malloc(orch_so_size); + if (dev_so == nullptr) { + LOG_ERROR("Failed to allocate device memory for orchestration SO"); + return -1; + } + int rc = runtime->host_api.copy_to_device(dev_so, orch_so_binary, orch_so_size); + if (rc != 0) { + LOG_ERROR("Failed to copy orchestration SO to device"); + runtime->host_api.device_free(dev_so); + return -1; + } + // Copy SO binary into Runtime's internal storage (device_orch_so_storage_) + // Pass the HOST pointer (orch_so_binary), not the device pointer (dev_so) + // AICPU Thread 3 will read from get_device_orch_so_data() which returns this storage + runtime->set_device_orch_so(orch_so_binary, orch_so_size); + runtime->record_tensor_pair(nullptr, dev_so, orch_so_size); + LOG_INFO("Orchestration SO: %zu bytes copied to device", orch_so_size); + int64_t t_so_end = _now_ms(); + + // Read ready queue shard count from environment for AICPU scheduler + { + const char* env_shards = std::getenv("PTO2_READY_QUEUE_SHARDS"); + if (env_shards) { + char* endptr; + int64_t val = strtol(env_shards, &endptr, 10); + if (endptr != env_shards && *endptr == '\0' && val >= 1 && val <= PLATFORM_MAX_AICPU_THREADS) { + runtime->ready_queue_shards = static_cast(val); + } else { + LOG_WARN("PTO2_READY_QUEUE_SHARDS=%s is invalid or out of range [1,%d], using default %d", + env_shards, + PLATFORM_MAX_AICPU_THREADS, + RUNTIME_DEFAULT_READY_QUEUE_SHARDS); + runtime->ready_queue_shards = RUNTIME_DEFAULT_READY_QUEUE_SHARDS; + } + } + LOG_INFO("Ready queue shards: %d", runtime->ready_queue_shards); + } + + // Read orchestrator-to-scheduler transition flag from environment + { + const char* env_val = std::getenv("PTO2_ORCH_TO_SCHED"); + if (env_val && (env_val[0] == '1' || env_val[0] == 't' || env_val[0] == 'T')) { + runtime->orch_to_sched = true; + } + LOG_INFO("Orchestrator-to-scheduler transition: %s", runtime->orch_to_sched ? "enabled" : "disabled"); + } + + // Read ring buffer size overrides from environment + { + runtime->pto2_task_window_size = parse_env_uint64("PTO2_RING_TASK_WINDOW", 4, true); + runtime->pto2_heap_size = parse_env_uint64("PTO2_RING_HEAP", 1024, true); + runtime->pto2_dep_pool_size = parse_env_uint64("PTO2_RING_DEP_POOL", 4, false); + if (runtime->pto2_task_window_size || runtime->pto2_heap_size || runtime->pto2_dep_pool_size) { + LOG_INFO("Ring buffer overrides: task_window=%" PRIu64 " heap=%" PRIu64 " dep_pool=%" PRIu64, + (uint64_t)(runtime->pto2_task_window_size ? runtime->pto2_task_window_size : PTO2_TASK_WINDOW_SIZE), + (uint64_t)(runtime->pto2_heap_size ? runtime->pto2_heap_size : PTO2_HEAP_SIZE), + (uint64_t)(runtime->pto2_dep_pool_size ? runtime->pto2_dep_pool_size : PTO2_DEP_LIST_POOL_SIZE)); + } + } + + // Resolve effective sizes (env override or compile-time default) + uint64_t eff_heap_size = runtime->pto2_heap_size ? runtime->pto2_heap_size : PTO2_HEAP_SIZE; + uint64_t eff_task_window_size = + runtime->pto2_task_window_size ? runtime->pto2_task_window_size : PTO2_TASK_WINDOW_SIZE; + + // Allocate GM heap for orchestrator output buffers (all rings combined) + uint64_t total_heap_size = eff_heap_size * PTO2_MAX_RING_DEPTH; + int64_t t_heap_start = _now_ms(); + void* gm_heap = runtime->host_api.device_malloc(total_heap_size); + int64_t t_heap_end = _now_ms(); + if (gm_heap == nullptr) { + LOG_ERROR("Failed to allocate GM heap"); + return -1; + } + runtime->record_tensor_pair(nullptr, gm_heap, total_heap_size); + runtime->set_pto2_gm_heap(gm_heap); + + // Allocate PTO2 shared memory + int64_t t_sm_start = _now_ms(); + uint64_t sm_size = pto2_sm_calculate_size(eff_task_window_size); + void* sm_ptr = runtime->host_api.device_malloc(sm_size); + int64_t t_sm_end = _now_ms(); + if (sm_ptr == nullptr) { + LOG_ERROR("Failed to allocate PTO2 shared memory"); + return -1; + } + runtime->set_pto2_gm_sm_ptr(sm_ptr); + runtime->record_tensor_pair(nullptr, sm_ptr, static_cast(sm_size)); + + // Set up device orchestration state + runtime->set_orch_built_on_host(false); + runtime->set_orch_args(device_args); + + LOG_INFO("Device orchestration ready: %d tensors + %d scalars", tensor_count, scalar_count); + + int64_t t_total_end = _now_ms(); + LOG_INFO("TIMING: args_malloc_copy = %" PRId64 "ms", t_args_end - t_args_start); + LOG_INFO("TIMING: orch_so_copy = %" PRId64 "ms", t_so_end - t_so_start); + LOG_INFO("TIMING: gm_heap_alloc(1GB) = %" PRId64 "ms", t_heap_end - t_heap_start); + LOG_INFO("TIMING: shared_mem_alloc = %" PRId64 "ms", t_sm_end - t_sm_start); + LOG_INFO("TIMING: total_init_runtime_impl = %" PRId64 "ms", t_total_end - t_total_start); + + return 0; +} + +/** + * Validate runtime results and cleanup. + * + * This function: + * 1. Copies recorded tensors from device back to host + * 2. Frees device memory for recorded tensors + * 3. Clears tensor pair state + * + * @param runtime Pointer to Runtime + * @return 0 on success, -1 on failure + */ +extern "C" int validate_runtime_impl(Runtime* runtime) { + if (runtime == nullptr) { + LOG_ERROR("Runtime pointer is null"); + return -1; + } + + int rc = 0; + + LOG_INFO("=== Copying Results Back to Host ==="); + + // Copy all recorded tensors from device back to host + TensorPair* tensor_pairs = runtime->get_tensor_pairs(); + int tensor_pair_count = runtime->get_tensor_pair_count(); + + LOG_INFO("Tensor pairs to process: %d", tensor_pair_count); + + // PTO2 (device orchestration): graph output may be in packed buffer + void* pto2_sm = runtime->get_pto2_gm_sm_ptr(); + uint64_t graph_out_ptr = 0; + uint64_t graph_out_size = 0; + + if (pto2_sm != nullptr) { + // Copy header from device to host to read graph_output_ptr/size + PTO2SharedMemoryHeader host_header; + int hdr_rc = runtime->host_api.copy_from_device(&host_header, pto2_sm, sizeof(PTO2SharedMemoryHeader)); + if (hdr_rc == 0) { + graph_out_ptr = host_header.graph_output_ptr; + graph_out_size = host_header.graph_output_size; + if (graph_out_ptr != 0) { + LOG_INFO("Graph output buffer: ptr=0x%" PRIx64 ", size=%" PRIu64, graph_out_ptr, graph_out_size); + } + } else { + LOG_WARN("Failed to copy PTO2 header from device"); + } + } + + bool first_output_tensor = true; + for (int i = 0; i < tensor_pair_count; i++) { + const TensorPair& pair = tensor_pairs[i]; + + // Skip if device pointer is null + if (pair.dev_ptr == nullptr) { + LOG_WARN("Tensor %d has null device pointer, skipping", i); + continue; + } + + // If host pointer is null, this is a device-only allocation (no copy-back) + if (pair.host_ptr == nullptr) { + LOG_INFO("Tensor %d: device-only allocation (no copy-back)", i); + continue; + } + + void* src_ptr = pair.dev_ptr; + size_t copy_size = pair.size; + + // Use graph_output_ptr for the first output tensor if available + if (first_output_tensor && graph_out_ptr != 0 && graph_out_size > 0) { + src_ptr = reinterpret_cast(static_cast(graph_out_ptr)); + copy_size = static_cast(graph_out_size); + LOG_INFO("Using packed output buffer for tensor %d", i); + first_output_tensor = false; + } + + int copy_rc = runtime->host_api.copy_from_device(pair.host_ptr, src_ptr, copy_size); + if (copy_rc != 0) { + LOG_ERROR("Failed to copy tensor %d from device: %d", i, copy_rc); + rc = copy_rc; + } else { + LOG_INFO("Tensor %d: %zu bytes copied to host", i, pair.size); + } + } + + // Cleanup device tensors + LOG_INFO("=== Cleaning Up ==="); + for (int i = 0; i < tensor_pair_count; i++) { + if (tensor_pairs[i].dev_ptr != nullptr) { + runtime->host_api.device_free(tensor_pairs[i].dev_ptr); + } + } + LOG_INFO("Freed %d device allocations", tensor_pair_count); + + // Cleanup kernel binaries + int kernel_count = runtime->get_registered_kernel_count(); + for (int i = 0; i < kernel_count; i++) { + int func_id = runtime->get_registered_kernel_func_id(i); + runtime->host_api.remove_kernel_binary(func_id); + runtime->set_function_bin_addr(func_id, 0); + } + if (kernel_count > 0) { + LOG_INFO("Freed %d kernel binaries", kernel_count); + } + runtime->clear_registered_kernels(); + + // Clear tensor pairs + runtime->clear_tensor_pairs(); + + LOG_INFO("=== Finalize Complete ==="); + + return rc; +} diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/orchestration/common.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/orchestration/common.cpp new file mode 100644 index 000000000..f0c666908 --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/orchestration/common.cpp @@ -0,0 +1,174 @@ +#include "common.h" +#include "pto_orchestration_api.h" + +#ifdef __linux__ +#include +#include +#include +#include + +#include +#include +#include +#endif + +struct PTO2Runtime; + +namespace { +// Plain global (not thread_local) to avoid glibc TLSDESC stale-resolution +// crash (BZ #32412) when the orchestration SO is dlclose'd/re-dlopen'd +// between execution rounds. All orchestrator threads bind the same rt +// value, so per-thread storage is unnecessary. +PTO2Runtime* g_pto2_current_runtime = nullptr; +} + +extern "C" __attribute__((visibility("default"))) void pto2_framework_bind_runtime(PTO2Runtime* rt) { + g_pto2_current_runtime = rt; +} + +// Keep current_runtime local to this .so so orchestration helpers do not +// accidentally bind to the AICPU binary's same-named symbol. +extern "C" __attribute__((visibility("hidden"))) PTO2Runtime* pto2_framework_current_runtime() { + return g_pto2_current_runtime; +} + +/** + * 使用 addr2line 将地址转换为 文件:行号 信息 + * 使用 -i 标志展开内联,返回第一行(最内层实际代码位置) + * 如果存在内联,同时通过 inline_chain 返回外层调用链 + */ +#ifdef __linux__ +static std::string addr_to_line(const char* executable, void* addr, + std::string* inline_chain = nullptr) { + char cmd[512]; + snprintf(cmd, sizeof(cmd), "addr2line -e %s -f -C -p -i %p 2>/dev/null", executable, addr); + + std::array buffer; + std::string raw_output; + + FILE* pipe = popen(cmd, "r"); + if (pipe) { + while (fgets(buffer.data(), buffer.size(), pipe) != nullptr) { + raw_output += buffer.data(); + } + pclose(pipe); + } + + if (raw_output.empty() || raw_output.find("??") != std::string::npos) { + return ""; + } + + // 按行分割 + std::vector lines; + size_t pos = 0; + while (pos < raw_output.size()) { + size_t nl = raw_output.find('\n', pos); + if (nl == std::string::npos) nl = raw_output.size(); + std::string line = raw_output.substr(pos, nl - pos); + while (!line.empty() && line.back() == '\r') line.pop_back(); + if (!line.empty()) lines.push_back(line); + pos = nl + 1; + } + + if (lines.empty()) return ""; + + // 第一行是最内层的实际代码位置,后续行是外层内联调用者 + if (inline_chain && lines.size() > 1) { + *inline_chain = ""; + for (size_t j = 1; j < lines.size(); j++) { + *inline_chain += " [inlined by] " + lines[j] + "\n"; + } + } + + return lines.front(); +} +#endif + +/** + * 获取当前调用栈信息(包含文件路径和行号) + * 通过 dladdr 定位每个栈帧所在的共享库,并用相对地址调用 addr2line + */ +std::string get_stacktrace(int skip_frames) { + (void)skip_frames; // May be unused on non-Linux platforms + std::string result; +#ifdef __linux__ + const int max_frames = 64; + void* buffer[max_frames]; + int nframes = backtrace(buffer, max_frames); + char** symbols = backtrace_symbols(buffer, nframes); + + if (symbols) { + result = "Stack trace:\n"; + for (int i = skip_frames; i < nframes; i++) { + std::string frame_info; + + void* addr = (void*)((char*)buffer[i] - 1); + + Dl_info dl_info; + std::string inline_chain; + if (dladdr(addr, &dl_info) && dl_info.dli_fname) { + void* rel_addr = (void*)((char*)addr - (char*)dl_info.dli_fbase); + std::string addr2line_result = addr_to_line(dl_info.dli_fname, rel_addr, &inline_chain); + + if (addr2line_result.empty()) { + addr2line_result = addr_to_line(dl_info.dli_fname, addr, &inline_chain); + } + + if (!addr2line_result.empty()) { + frame_info = std::string(dl_info.dli_fname) + ": " + addr2line_result; + } + } + + if (frame_info.empty()) { + std::string frame(symbols[i]); + + size_t start = frame.find('('); + size_t end = frame.find('+', start); + if (start != std::string::npos && end != std::string::npos) { + std::string mangled = frame.substr(start + 1, end - start - 1); + int status; + char* demangled = abi::__cxa_demangle(mangled.c_str(), nullptr, nullptr, &status); + if (status == 0 && demangled) { + frame = frame.substr(0, start + 1) + demangled + frame.substr(end); + free(demangled); + } + } + frame_info = frame; + } + + char buf[16]; + snprintf(buf, sizeof(buf), " #%d ", i - skip_frames); + result += buf + frame_info + "\n"; + if (!inline_chain.empty()) { + result += inline_chain; + } + } + free(symbols); + } +#else + result = "(调用栈仅在 Linux 上可用)\n"; +#endif + return result; +} + +// AssertionError 构造函数 +static std::string build_assert_message(const char* condition, const char* file, int line) { + std::string msg = "Assertion failed: " + std::string(condition) + "\n"; + msg += " Location: " + std::string(file) + ":" + std::to_string(line) + "\n"; + msg += get_stacktrace(3); + return msg; +} + +AssertionError::AssertionError(const char* condition, const char* file, int line) + : std::runtime_error(build_assert_message(condition, file, line)), + condition_(condition), file_(file), line_(line) {} + +[[noreturn]] void assert_impl(const char* condition, const char* file, int line) { + LOG_ERROR("\n========================================"); + LOG_ERROR("Assertion failed: %s", condition); + LOG_ERROR("Location: %s:%d", file, line); + LOG_ERROR("%s", get_stacktrace(2).c_str()); + LOG_ERROR("========================================\n"); + + throw AssertionError(condition, file, line); +} diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/orchestration/pto_orchestration_api.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/orchestration/pto_orchestration_api.h new file mode 100644 index 000000000..00b4899cb --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/orchestration/pto_orchestration_api.h @@ -0,0 +1,308 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ +/** + * PTO Orchestration API - Slim header for orchestration .so files + * + * This header provides everything an orchestration source needs without + * pulling in runtime implementation headers. The orchestration .so has + * zero link dependencies on runtime .cpp files; all runtime calls go + * through the PTO2RuntimeOps function-pointer table embedded in + * PTO2Runtime. + * + * Orchestration sources include ONLY this header: + * #include "pto_orchestration_api.h" + * + * Runtime sources continue to use pto_runtime2.h (which defines the + * full PTO2Runtime struct with all internal fields). + */ + +#ifndef SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_ORCHESTRATION_PTO_ORCHESTRATION_API_H_ +#define SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_ORCHESTRATION_PTO_ORCHESTRATION_API_H_ + +#include +#include +#include + +// Type headers needed by orchestration +#include "pto_submit_types.h" // MixedKernels, INVALID_KERNEL_ID, subtask slots // NOLINT(build/include_subdir) +#include "pto_types.h" // Arg, TaskOutputTensors, TensorArgType // NOLINT(build/include_subdir) +#include "task_args.h" // ChipStorageTaskArgs, ContinuousTensor // NOLINT(build/include_subdir) +#include "tensor.h" // Tensor, TensorCreateInfo // NOLINT(build/include_subdir) + +// ============================================================================= +// Tensor Factory Helpers +// ============================================================================= + +/** + * Create a Tensor for pre-allocated external memory. + */ +inline Tensor make_tensor_external(void* addr, + const uint32_t shapes[], + uint32_t ndims, + DataType dtype = DataType::FLOAT32, + bool manual_dep = false, + int32_t version = 0) { + static uint32_t zero_offsets[RUNTIME_MAX_TENSOR_DIMS] = {}; + uint64_t total = 1; + for (uint32_t i = 0; i < ndims; i++) { + total *= shapes[i]; + } + return Tensor(addr, + total * get_element_size(dtype), + shapes, + shapes, + zero_offsets, + ndims, + dtype, + version, + /*is_all_offset_zero=*/true, + /*is_raw_eq_shapes=*/true, + manual_dep); +} + +// Convert ContinuousTensor to Tensor +static_assert( + CONTINUOUS_TENSOR_MAX_DIMS == RUNTIME_MAX_TENSOR_DIMS, "ContinuousTensor and runtime max dims must match"); +inline Tensor from_tensor_arg(const ContinuousTensor& t, bool manual_dep = false, int32_t version = 0) { + return make_tensor_external( + reinterpret_cast(static_cast(t.data)), t.shapes, t.ndims, t.dtype, manual_dep, version); +} + +// ============================================================================= +// Ops Table and Opaque Runtime +// ============================================================================= + +/** + * Forward declaration — the orchestration sees PTO2Runtime as a partial + * struct whose first field is the ops pointer. The full definition + * lives in pto_runtime2.h (used only by runtime .cpp files). + */ +typedef struct PTO2Runtime PTO2Runtime; + +#ifdef __cplusplus +extern "C" { +#endif + +/** + * Framework-internal TLS bridge. + * + * The executor binds the current thread's runtime before invoking + * aicpu_orchestration_entry(), so orchestration helpers can fetch the + * current PTO2Runtime without explicit parameter threading. + */ +PTO2Runtime* pto2_framework_current_runtime(void); +void pto2_framework_bind_runtime(PTO2Runtime* rt); + +#ifdef __cplusplus +} +#endif + +/** + * Function-pointer table for runtime operations. + * Populated by the runtime; called by orchestration through inline wrappers. + */ +typedef struct PTO2RuntimeOps { + TaskOutputTensors (*submit_task)(PTO2Runtime* rt, const MixedKernels& mixed_kernels, const Arg& args); + void (*scope_begin)(PTO2Runtime* rt); + void (*scope_end)(PTO2Runtime* rt); + void (*orchestration_done)(PTO2Runtime* rt); + bool (*is_fatal)(PTO2Runtime* rt); + + // Logging (populated by runtime, called by orchestration) + void (*log_error)(const char* func, const char* fmt, ...); + void (*log_warn)(const char* func, const char* fmt, ...); + void (*log_info)(const char* func, const char* fmt, ...); + void (*log_debug)(const char* func, const char* fmt, ...); + void (*log_always)(const char* func, const char* fmt, ...); + + // Cross-layer data access (orchestration reads/writes tensor values via runtime) + // Placed after logging to avoid shifting hot-path field offsets. + uint64_t (*get_tensor_data)(PTO2Runtime* rt, const Tensor& tensor, uint32_t ndims, const uint32_t indices[]); + void (*set_tensor_data)( + PTO2Runtime* rt, const Tensor& tensor, uint32_t ndims, const uint32_t indices[], uint64_t value); +} PTO2RuntimeOps; + +/** + * Partial PTO2Runtime definition for orchestration. + * + * Only the ops pointer is visible. The real struct (in pto_runtime2.h) + * has the same first field, so accessing rt->ops through this definition + * is well-defined (C struct layout guarantee). + */ +struct PTO2Runtime { + const PTO2RuntimeOps* ops; +}; + +// ============================================================================= +// Inline Convenience Wrappers (call through ops table) +// ============================================================================= + +static inline PTO2Runtime* pto2_current_runtime() { return pto2_framework_current_runtime(); } + +static inline TaskOutputTensors pto2_rt_submit_task(const MixedKernels& mixed_kernels, const Arg& args) { + PTO2Runtime* rt = pto2_current_runtime(); + return rt->ops->submit_task(rt, mixed_kernels, args); +} + +/** + * Convenience wrapper: submit an AIC-only task. + */ +static inline TaskOutputTensors pto2_rt_submit_aic_task(int32_t kernel_id, const Arg& args) { + PTO2Runtime* rt = pto2_current_runtime(); + MixedKernels mk; + mk.aic_kernel_id = kernel_id; + return rt->ops->submit_task(rt, mk, args); +} + +/** + * Convenience wrapper: submit an AIV-only task (uses AIV0 slot). + */ +static inline TaskOutputTensors pto2_rt_submit_aiv_task(int32_t kernel_id, const Arg& args) { + PTO2Runtime* rt = pto2_current_runtime(); + MixedKernels mk; + mk.aiv0_kernel_id = kernel_id; + return rt->ops->submit_task(rt, mk, args); +} + +static inline void pto2_rt_scope_begin() { + PTO2Runtime* rt = pto2_current_runtime(); + rt->ops->scope_begin(rt); +} + +static inline void pto2_rt_scope_end() { + PTO2Runtime* rt = pto2_current_runtime(); + rt->ops->scope_end(rt); +} + +static inline void pto2_rt_orchestration_done() { + PTO2Runtime* rt = pto2_current_runtime(); + rt->ops->orchestration_done(rt); +} + +static inline bool pto2_rt_is_fatal() { + PTO2Runtime* rt = pto2_current_runtime(); + return rt->ops->is_fatal(rt); +} + +// ============================================================================= +// Logging Macros for Orchestration (call through ops table) +// ============================================================================= + +#define LOG_ERROR(fmt, ...) pto2_current_runtime()->ops->log_error(__FUNCTION__, fmt, ##__VA_ARGS__) +#define LOG_WARN(fmt, ...) pto2_current_runtime()->ops->log_warn(__FUNCTION__, fmt, ##__VA_ARGS__) +#define LOG_INFO(fmt, ...) pto2_current_runtime()->ops->log_info(__FUNCTION__, fmt, ##__VA_ARGS__) +#define LOG_DEBUG(fmt, ...) pto2_current_runtime()->ops->log_debug(__FUNCTION__, fmt, ##__VA_ARGS__) +#define LOG_ALWAYS(fmt, ...) pto2_current_runtime()->ops->log_always(__FUNCTION__, fmt, ##__VA_ARGS__) + +// ============================================================================= +// Cross-Layer Data Access +// ============================================================================= + +/** + * Read a value from a tensor at the given multi-dimensional indices. + * + * Default T = uint64_t preserves old behavior (raw bits). + * Specify T to get automatic type conversion: + * + * uint64_t raw = get_tensor_data(tensor, 1, idx); // old usage unchanged + * float val = get_tensor_data(tensor, 1, idx); // typed read + * + * If the tensor has a producer in TensorMap, spin-waits until the producer + * task completes before reading. External tensors (make_tensor_external) + * are read immediately without waiting. + */ +template +static inline T get_tensor_data(const Tensor& tensor, uint32_t ndims, const uint32_t indices[]) { + PTO2Runtime* rt = pto2_current_runtime(); + return from_u64(rt->ops->get_tensor_data(rt, tensor, ndims, indices)); +} + +/** + * Write a value to a tensor at the given multi-dimensional indices. + * + * Type is deduced from value argument; uint64_t by default: + * + * set_tensor_data(tensor, 1, idx, raw_u64); // old usage unchanged + * set_tensor_data(tensor, 1, idx, 42.0f); // typed write (T = float) + * + * If the tensor has a producer in TensorMap, spin-waits until the producer + * and all its consumers complete before writing (WAW + WAR safety). + * External tensors (make_tensor_external) with no TensorMap entry are + * written immediately without waiting. + * + * Limitation: TensorMap only tracks producers (OUTPUT/INOUT), not consumers + * that used the tensor as INPUT. If a kernel reads this tensor as INPUT + * (not INOUT) and the tensor has no TensorMap producer entry, set_tensor_data + * cannot detect the reader and may cause a data race. + * + * To ensure WAR safety for all access patterns, use add_inout() instead of + * add_input() for kernel parameters that may later be written via + * set_tensor_data. INOUT creates a TensorMap entry that enables automatic + * consumer tracking via fanout_refcount. + * + * The tensor must already have an allocated buffer (addr != 0). + * For runtime-created outputs, call this only on the Tensor returned by + * add_output(TensorCreateInfo) after submit returns. + */ +template +static inline void set_tensor_data(const Tensor& tensor, uint32_t ndims, const uint32_t indices[], T value) { + PTO2Runtime* rt = pto2_current_runtime(); + rt->ops->set_tensor_data(rt, tensor, ndims, indices, to_u64(value)); +} + +// ============================================================================= +// C++ Scope Guards and Macros +// ============================================================================= + +/** + * RAII Scope Guard (calls through ops table) + */ +class PTO2ScopeGuard { +public: // NOLINT(whitespace/indent) + PTO2ScopeGuard() : rt_(pto2_current_runtime()) { rt_->ops->scope_begin(rt_); } + ~PTO2ScopeGuard() { rt_->ops->scope_end(rt_); } + +private: // NOLINT(whitespace/indent) + PTO2Runtime* rt_; +}; + +#define _PTO2_CONCATENATE_IMPL(x, y) x##y +#define _PTO2_CONCATENATE(x, y) _PTO2_CONCATENATE_IMPL(x, y) + +#define PTO2_SCOPE_GUARD() [[maybe_unused]] PTO2ScopeGuard _PTO2_CONCATENATE(scope_guard_, __COUNTER__) + +/** + * Scoped block macro: + * PTO2_SCOPE() { + * pto2_rt_submit_task(...); + * } + */ +#define PTO2_SCOPE() if (PTO2_SCOPE_GUARD(); true) + +// ============================================================================= +// Orchestration Config +// ============================================================================= + +/** + * Configuration exported by orchestration .so via aicpu_orchestration_config(). + * The executor reads these values to set up shared memory and runtime. + * + * This struct is defined identically in pto_runtime2.h (with an include + * guard) so the executor can use the same type without including this header. + */ +#ifndef PTO2_ORCHESTRATION_CONFIG_DEFINED +#define PTO2_ORCHESTRATION_CONFIG_DEFINED +struct PTO2OrchestrationConfig { + int expected_arg_count; +}; +#endif + +#endif // SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_ORCHESTRATION_PTO_ORCHESTRATION_API_H_ diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/common.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/common.h new file mode 100644 index 000000000..1a5af9de3 --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/common.h @@ -0,0 +1,93 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ + +#pragma once + +#include +#include + +#include +#include + +/** + * Get the current stack trace, including file paths and line numbers. + * Implemented in common.cpp. + */ +std::string get_stacktrace(int skip_frames = 1); + +/** + * Assertion failure exception with condition, file, line, and stack trace. + */ +class AssertionError : public std::runtime_error { + public: + AssertionError(const char* condition, const char* file, int line); + + const char* condition() const { return condition_; } + const char* file() const { return file_; } + int line() const { return line_; } + + private: + const char* condition_; + const char* file_; + int line_; +}; + +/** + * Assertion failure handler. + * Implemented in common.cpp. + */ +[[noreturn]] void assert_impl(const char* condition, const char* file, int line); + +/** + * debug_assert macro: + * checks the condition in debug builds and throws with a stack trace on failure. + * It is a no-op in release builds (NDEBUG). + */ +#ifdef NDEBUG +#define debug_assert(cond) ((void)0) +#else +#define debug_assert(cond) \ + do { \ + if (!(cond)) { \ + assert_impl(#cond, __FILE__, __LINE__); \ + } \ + } while (0) +#endif + +/** + * always_assert macro: + * checks the condition in both debug and release builds. + */ +#define always_assert(cond) \ + do { \ + if (!(cond)) { \ + assert_impl(#cond, __FILE__, __LINE__); \ + } \ + } while (0) + +#define PTO_PRAGMA(x) _Pragma(#x) + +#if defined(__clang__) +#define MAYBE_UNINITIALIZED_BEGIN \ + PTO_PRAGMA(clang diagnostic push) \ + PTO_PRAGMA(clang diagnostic ignored "-Wuninitialized") \ + PTO_PRAGMA(clang diagnostic ignored "-Wsometimes-uninitialized") +#define MAYBE_UNINITIALIZED_END PTO_PRAGMA(clang diagnostic pop) +#elif defined(__GNUC__) +#define MAYBE_UNINITIALIZED_BEGIN \ + PTO_PRAGMA(GCC diagnostic push) \ + PTO_PRAGMA(GCC diagnostic ignored "-Wuninitialized") \ + PTO_PRAGMA(GCC diagnostic ignored "-Wmaybe-uninitialized") +#define MAYBE_UNINITIALIZED_END PTO_PRAGMA(GCC diagnostic pop) +#else +#define MAYBE_UNINITIALIZED_BEGIN +#define MAYBE_UNINITIALIZED_END +#endif diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto2_dispatch_payload.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto2_dispatch_payload.h new file mode 100644 index 000000000..914a3f92a --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto2_dispatch_payload.h @@ -0,0 +1,85 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ + +/** + * @file pto2_dispatch_payload.h + * @brief Per-core dispatch payload for AICore kernel execution + * + * PTO2DispatchPayload holds the kernel function address, a per-core args[] + * array, and embedded SPMD context (LocalContext + GlobalContext). AICPU + * maintains a static array of these (one per core). + * + * GlobalContext (sub_block_id) is initialized once at runtime startup via + * init_global_context() and never modified afterwards. + * + * LocalContext (block_idx, block_num) and args[] are rebuilt by + * build_payload() before each dispatch. Both context struct pointers are + * written into the args[] suffix on every dispatch (since args[] is rebuilt + * entirely each time). + * + * AICore caches a pointer to its per-core slot at startup and reads from + * it on each dispatch. The struct is cache-line aligned to avoid false + * sharing across concurrently dispatched cores. + * + * The DATA_MAIN_BASE register protocol is unchanged from the base runtime: + * a monotonically increasing reg_task_id signals new work to AICore. + */ + +#pragma once + +#include + +#include "intrinsic.h" +#include "pto_types.h" + +/** Max dispatch arguments: 128 scalars + up to 16 tensor pointers + ext params */ +#ifndef PTO2_DISPATCH_MAX_ARGS +#define PTO2_DISPATCH_MAX_ARGS (MAX_TENSOR_ARGS + MAX_SCALAR_ARGS + PTO2_EXT_PARAMS_COUNT) +#endif + +#ifndef PTO2_ALIGN_UP +#define PTO2_ALIGN_UP(x, align) (((x) + (align) - 1) & ~((align) - 1)) +#endif + +// Verify hardcoded indices in intrinsic.h match the computed values. +static_assert((MAX_TENSOR_ARGS + MAX_SCALAR_ARGS) == SPMD_LOCAL_CONTEXT_INDEX, + "LOCAL_CONTEXT_INDEX out of sync with intrinsic.h"); +static_assert((MAX_TENSOR_ARGS + MAX_SCALAR_ARGS + 1) == SPMD_GLOBAL_CONTEXT_INDEX, + "GLOBAL_CONTEXT_INDEX out of sync with intrinsic.h"); + +/** + * Per-core dispatch payload: function address + args[] + SPMD context. + * + * AICPU maintains a static array s_pto2_payload_per_core[RUNTIME_MAX_WORKER]. + * AICore caches a pointer to its per-core slot at startup (via Handshake.task) + * and reads from it on each dispatch. + * + * The struct is cache-line aligned to prevent false sharing across + * concurrently dispatched cores. + */ +struct alignas(64) PTO2DispatchPayload { + uint64_t function_bin_addr; /**< Kernel entry address in GM (set by Scheduler) */ + uint64_t args[PTO2_DISPATCH_MAX_ARGS]; /**< Kernel arguments (GM pointers + scalars + ext params) */ + + /** Per-dispatch context: block_idx and block_num. + * Written by build_payload() before each dispatch. + * args[SPMD_LOCAL_CONTEXT_INDEX] points here. */ + LocalContext local_context; + + /** Per-core global context: sub_block_id (AIV lane identity). + * Initialized once by init_global_context() at runtime startup. + * args[SPMD_GLOBAL_CONTEXT_INDEX] points here. */ + GlobalContext global_context; + + static_assert(sizeof(args[0]) == 8); + static_assert(PTO2_ALIGN_UP((MAX_TENSOR_ARGS + MAX_SCALAR_ARGS) * sizeof(args[0]), 64) == + (MAX_TENSOR_ARGS + MAX_SCALAR_ARGS) * sizeof(args[0])); +}; diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_orchestrator.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_orchestrator.cpp new file mode 100644 index 000000000..9a6b5fad8 --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_orchestrator.cpp @@ -0,0 +1,759 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ + +/** + * PTO Runtime2 - Orchestrator Implementation + * + * Implements orchestrator state management, scope handling, and task submission. + * + * Based on: docs/RUNTIME_LOGIC.md + */ + +#include "pto_orchestrator.h" + +#include +#include +#include +#include +#include + +#include "common/unified_log.h" +#include "pto_runtime2_types.h" +#include "pto_shared_memory.h" +#include "pto_tensormap.h" +#include "pto_types.h" +#include "tensor.h" + +// ============================================================================= +// Orchestrator Profiling (compile-time toggle) +// ============================================================================= +#if PTO2_ORCH_PROFILING +#include "aicpu/device_time.h" +#include "aicpu/performance_collector_aicpu.h" +// Weak fallback for builds that don't link device_time.cpp (e.g. host). +// The strong symbol from platform/.../device_time.cpp wins in the AICPU build. +// +// IMPORTANT: visibility("hidden") is required to prevent the HOST .so from +// exporting this weak fallback into the global dynamic symbol table via +// RTLD_GLOBAL. Without it, when the AICPU .so is loaded and its PLT entry +// for get_sys_cnt_aicpu is resolved, the dynamic linker finds the HOST .so's +// weak definition first (already in global table) and uses it — returning 0. +// With hidden visibility, the HOST .so does not export this symbol globally, +// so the AICPU .so's PLT resolves to its own strong definition from +// device_time.cpp. +__attribute__((weak, visibility("hidden"))) uint64_t get_sys_cnt_aicpu() { return 0; } +// Weak fallback for builds that don't link performance_collector_aicpu.cpp. +// The strong symbol from the AICPU build wins when profiling is available. +// Also hidden to prevent HOST .so from polluting the global symbol table. +__attribute__((weak, visibility("hidden"))) void perf_aicpu_record_orch_phase( + AicpuPhaseId, uint64_t, uint64_t, uint32_t, uint64_t) {} +// Accumulated cycles per sub-step (only needed for ORCH_PROFILING export) +static uint64_t g_orch_sync_cycle = 0; // tensormap sync +static uint64_t g_orch_alloc_cycle = 0; // unified task+heap alloc +static uint64_t g_orch_args_cycle = 0; // param copy +static uint64_t g_orch_lookup_cycle = 0; // tensormap lookup + dep building +static uint64_t g_orch_insert_cycle = 0; // tensormap insert +static uint64_t g_orch_fanin_cycle = 0; // fanin list + early-return check +static uint64_t g_orch_scope_end_cycle = 0; // scope_end overhead +static int64_t g_orch_submit_count = 0; +static uint32_t g_orch_submit_idx = 0; +uint64_t g_orch_alloc_wait_cycle = 0; +uint64_t g_orch_fanin_wait_cycle = 0; +uint64_t g_orch_alloc_atomic_count = 0; +uint64_t g_orch_args_atomic_count = 0; +uint64_t g_orch_fanin_atomic_count = 0; +uint64_t g_orch_finalize_atomic_count = 0; +uint64_t g_orch_scope_end_atomic_count = 0; +#define CYCLE_COUNT_START() uint64_t _t0 = get_sys_cnt_aicpu(), _t1 +#define CYCLE_COUNT_LAP(acc) \ + do { \ + _t1 = get_sys_cnt_aicpu(); \ + acc += (_t1 - _t0); \ + _t0 = _t1; \ + } while (0) +#define CYCLE_COUNT_LAP_RECORD(acc, phase_id, tid) \ + do { \ + _t1 = get_sys_cnt_aicpu(); \ + acc += (_t1 - _t0); \ + perf_aicpu_record_orch_phase((phase_id), _t0, _t1, g_orch_submit_idx, (tid)); \ + _t0 = _t1; \ + } while (0) +#elif PTO2_PROFILING +#include "aicpu/device_time.h" +#include "aicpu/performance_collector_aicpu.h" +__attribute__((weak, visibility("hidden"))) uint64_t get_sys_cnt_aicpu() { return 0; } +__attribute__((weak, visibility("hidden"))) void perf_aicpu_record_orch_phase( + AicpuPhaseId, uint64_t, uint64_t, uint32_t, uint64_t) {} +// submit_idx needed for swimlane task_id tagging (no cycle accumulation at this level) +static uint32_t g_orch_submit_idx = 0; +#define CYCLE_COUNT_START() \ + bool _prof_active = orch->enable_profiling; \ + uint64_t _t0 = _prof_active ? get_sys_cnt_aicpu() : 0, _t1 = 0 +#define CYCLE_COUNT_LAP(acc) \ + do { \ + } while (0) +#define CYCLE_COUNT_LAP_RECORD(acc, phase_id, tid) \ + do { \ + if (_prof_active) { \ + _t1 = get_sys_cnt_aicpu(); \ + perf_aicpu_record_orch_phase((phase_id), _t0, _t1, g_orch_submit_idx, (tid)); \ + _t0 = _t1; \ + } \ + } while (0) +#else +#define CYCLE_COUNT_START() +#define CYCLE_COUNT_LAP(acc) +#define CYCLE_COUNT_LAP_RECORD(acc, phase_id, tid) +#endif + +static bool pto2_append_fanin_or_fail(PTO2OrchestratorState* orch, + PTO2TaskId task_id, + int32_t tensor_arg_index, + TensorArgType ptype, + PTO2TaskSlotState* prod_state, + PTO2TaskSlotState* fanin_states[], + int32_t* fanin_count, + const char* reason) { + for (int32_t j = 0; j < *fanin_count; j++) { + if (fanin_states[j] == prod_state) { + return true; + } + } + + if (*fanin_count >= PTO2_MAX_INPUTS) { + LOG_ERROR("========================================"); + LOG_ERROR("FATAL: Dependency Overflow Detected!"); + LOG_ERROR("========================================"); + LOG_ERROR("Task requires more than PTO2_MAX_INPUTS unique fanin dependencies."); + LOG_ERROR(" task_id.raw: %" PRIu64, task_id.raw); + LOG_ERROR(" tensor_arg_index: %d", tensor_arg_index); + LOG_ERROR(" tensor_arg_type: %d", static_cast(ptype)); + LOG_ERROR(" fanin_count: %d / %d", *fanin_count, PTO2_MAX_INPUTS); + LOG_ERROR(" reason: %s", reason); + LOG_ERROR("This is a runtime dependency-tracking limit."); + LOG_ERROR("========================================"); + orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_DEPENDENCY_OVERFLOW, std::memory_order_release); + orch->fatal = true; + return false; + } + + fanin_states[(*fanin_count)++] = prod_state; + return true; +} + +// ============================================================================= +// Orchestrator Initialization +// ============================================================================= + +bool pto2_orchestrator_init(PTO2OrchestratorState* orch, + PTO2SharedMemoryHandle* sm_handle, + void* gm_heap, + uint64_t heap_size, + int32_t dep_pool_capacity) { + *orch = PTO2OrchestratorState{}; + + orch->sm_handle = sm_handle; + orch->gm_heap_base = gm_heap; + orch->gm_heap_size = heap_size * PTO2_MAX_RING_DEPTH; + orch->fatal = false; + + // Initialize per-ring resources + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + void* ring_heap_base = reinterpret_cast(gm_heap) + r * heap_size; + auto& fc = sm_handle->header->rings[r].fc; + + // Initialize unified task allocator + orch->rings[r].task_allocator.init(sm_handle->task_descriptors[r], + sm_handle->header->rings[r].task_window_size, + &fc.current_task_index, + &fc.last_task_alive, + ring_heap_base, + heap_size, + &sm_handle->header->orch_error_code); + + // Allocate and initialize dependency list pool (per-ring) + PTO2DepListEntry* dep_entries = + reinterpret_cast(calloc(dep_pool_capacity, sizeof(PTO2DepListEntry))); + if (!dep_entries) { + // Cleanup previously allocated rings + for (int j = 0; j < r; j++) { + free(orch->rings[j].dep_pool.base); + } + return false; + } + orch->rings[r].dep_pool.init(dep_entries, dep_pool_capacity, &sm_handle->header->orch_error_code); + } + + // Initialize TensorMap with per-ring task window sizes + int32_t task_window_sizes[PTO2_MAX_RING_DEPTH]; + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + task_window_sizes[r] = sm_handle->header->rings[r].task_window_size; + } + if (!orch->tensor_map.init_default(task_window_sizes)) { + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + free(orch->rings[r].dep_pool.base); + } + return false; + } + orch->tensor_map.orch = orch; + + // Initialize scope stack: one flat buffer for task IDs + one array for begin offsets + uint64_t max_depth = PTO2_MAX_SCOPE_DEPTH; + int32_t init_cap = PTO2_SCOPE_TASKS_INIT_CAP; + orch->scope_tasks = reinterpret_cast(malloc(init_cap * sizeof(PTO2TaskSlotState*))); + orch->scope_begins = reinterpret_cast(malloc(max_depth * sizeof(int32_t))); + if (!orch->scope_tasks || !orch->scope_begins) { + free(orch->scope_tasks); + free(orch->scope_begins); + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + free(orch->rings[r].dep_pool.base); + } + orch->tensor_map.destroy(); + return false; + } + orch->scope_tasks_size = 0; + orch->scope_tasks_capacity = init_cap; + orch->scope_stack_top = -1; + orch->scope_stack_capacity = max_depth; + + return true; +} + +void pto2_orchestrator_destroy(PTO2OrchestratorState* orch) { + orch->tensor_map.destroy(); + + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + free(orch->rings[r].dep_pool.base); + orch->rings[r].dep_pool.base = NULL; + } + + free(orch->scope_tasks); + orch->scope_tasks = NULL; + free(orch->scope_begins); + orch->scope_begins = NULL; +} + +void pto2_orchestrator_set_scheduler(PTO2OrchestratorState* orch, PTO2SchedulerState* scheduler) { + orch->scheduler = scheduler; +} + +// ============================================================================= +// Scope Management +// ============================================================================= + +static void scope_tasks_push(PTO2OrchestratorState* orch, PTO2TaskSlotState* task_slot_state) { + if (orch->scope_tasks_size >= orch->scope_tasks_capacity) { + int32_t new_cap = orch->scope_tasks_capacity * 2; + PTO2TaskSlotState** new_buf = + reinterpret_cast(realloc(orch->scope_tasks, new_cap * sizeof(PTO2TaskSlotState*))); + assert(new_buf && "Failed to grow scope task buffer"); + orch->scope_tasks = new_buf; + orch->scope_tasks_capacity = new_cap; + } + orch->scope_tasks[orch->scope_tasks_size++] = task_slot_state; +} + +void pto2_scope_begin(PTO2OrchestratorState* orch) { + if (orch->fatal) { + return; + } + assert(orch->scope_stack_top < static_cast(orch->scope_stack_capacity - 1) && "Scope stack overflow"); + + ++orch->scope_stack_top; + orch->scope_begins[orch->scope_stack_top] = orch->scope_tasks_size; +} + +void pto2_scope_end(PTO2OrchestratorState* orch) { + if (orch->fatal) { + return; + } + assert(orch->scope_stack_top >= 0 && "Scope stack underflow"); + +#if PTO2_ORCH_PROFILING + uint64_t _se0 = get_sys_cnt_aicpu(); +#endif + + int32_t begin = orch->scope_begins[orch->scope_stack_top--]; + int32_t count = orch->scope_tasks_size - begin; + + if (orch->scheduler && count > 0) { + orch->scheduler->on_scope_end(&orch->scope_tasks[begin], count); + } + + // Rewind the task buffer — these entries are no longer needed + orch->scope_tasks_size = begin; + +#if PTO2_ORCH_PROFILING + uint64_t _se1 = get_sys_cnt_aicpu(); + g_orch_scope_end_cycle += (_se1 - _se0); + // perf_aicpu_record_orch_phase(AicpuPhaseId::ORCH_SCOPE_END, _se0, _se1, g_orch_submit_idx, -1); +#endif +} + +// ============================================================================= +// Task Submission +// ============================================================================= +TaskOutputTensors pto2_submit_mixed_task( + PTO2OrchestratorState* orch, const MixedKernels& mixed_kernels, const Arg& args) { + CYCLE_COUNT_START(); + + TaskOutputTensors result; + + // Fast path after fatal error — all subsequent submits are no-ops + if (orch->fatal) { + return result; + } + + // Validate Arg construction (errors recorded by add_input/add_output/etc.) + if (args.has_error) { + LOG_ERROR("========================================"); + LOG_ERROR("FATAL: Invalid Arg Detected!"); + LOG_ERROR("========================================"); + LOG_ERROR("Error: %s", args.error_msg ? args.error_msg : "(unknown)"); + LOG_ERROR(" tensor_count: %d, scalar_count: %d", args.tensor_count(), args.scalar_count()); + LOG_ERROR("This is a bug in the orchestration code."); + LOG_ERROR("========================================"); + orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_INVALID_ARGS, std::memory_order_release); + orch->fatal = true; + return result; + } + + // Determine which ring this task belongs to + uint8_t ring_id = orch->current_ring_id(); + auto& allocator = orch->rings[ring_id].task_allocator; + PTO2SchedulerState* sched = orch->scheduler; + PTO2RingFlowControl& fc = orch->sm_handle->header->rings[ring_id].fc; + + // === Validate submit inputs === + uint8_t active_mask = pto2_mixed_kernels_to_active_mask(mixed_kernels); + always_assert(active_mask != 0 && "MixedKernels must have at least one active slot"); + + int16_t block_num = args.launch_spec.block_num(); + always_assert(block_num >= 1 && "block_num must be >= 1"); + + // Normalize single-AIV tasks: if only aiv1 is set (no aic, no aiv0), move + // it to the aiv0 slot. This guarantees the dispatch path can always use + // PTO2SubtaskSlot::AIV0 for single-AIV shapes without inspecting active_mask. + // Mixed tasks (AIC+AIV) keep their original AIV identity so the correct + // hardware channel (AIV0→AIC vs AIV1→AIC) is used at dispatch time. + MixedKernels normalized = mixed_kernels; + bool has_aic = (active_mask & PTO2_SUBTASK_MASK_AIC) != 0; + bool has_aiv0 = (active_mask & PTO2_SUBTASK_MASK_AIV0) != 0; + bool has_aiv1 = (active_mask & PTO2_SUBTASK_MASK_AIV1) != 0; + if (!has_aic && has_aiv1 && !has_aiv0) { + normalized.aiv0_kernel_id = normalized.aiv1_kernel_id; + normalized.aiv1_kernel_id = INVALID_KERNEL_ID; + active_mask = pto2_mixed_kernels_to_active_mask(normalized); + } + + // Submission without an open scope is illegal + always_assert(orch->scope_stack_top >= 0 && "Cannot submit task outside a scope"); + + // === Scope deadlock pre-check === + // Tasks within a scope hold a fanout_count reference released only at scope_end. + // If scope task count >= window_size, no slots can ever be reclaimed → deadlock. + { + int32_t scope_task_count = orch->scope_tasks_size - orch->scope_begins[orch->scope_stack_top]; + if (scope_task_count >= allocator.window_size() - 1) { + int32_t active_count = allocator.active_count(); + + LOG_ERROR("========================================"); + LOG_ERROR("FATAL: Scope Deadlock Detected! (ring %d)", ring_id); + LOG_ERROR("========================================"); + LOG_ERROR( + "Tasks in current scope (%d) >= task_window_size (%d).", scope_task_count, allocator.window_size()); + LOG_ERROR(" scope_depth: %d", orch->scope_stack_top + 1); + LOG_ERROR(" ring_id: %d", ring_id); + LOG_ERROR(" scope_task_count: %d", scope_task_count); + LOG_ERROR(" active_tasks: %d / %d", active_count, allocator.window_size()); + LOG_ERROR("Root Cause:"); + LOG_ERROR(" Tasks within a scope hold a fanout_count reference that is only"); + LOG_ERROR(" released at scope_end. When scope task count >= window_size,"); + LOG_ERROR(" no slots can be reclaimed -> deadlock."); + LOG_ERROR("Solution:"); + LOG_ERROR(" 1. Reduce tasks per scope (use batching/unroll)"); + LOG_ERROR(" 2. Increase task window (current: %d)", allocator.window_size()); + LOG_ERROR(" Compile-time: PTO2_TASK_WINDOW_SIZE in pto_runtime2_types.h"); + LOG_ERROR(" Runtime env: PTO2_RING_TASK_WINDOW="); + LOG_ERROR(" 3. Split work across multiple scopes"); + LOG_ERROR("========================================"); + orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_SCOPE_DEADLOCK, std::memory_order_release); + orch->fatal = true; + return result; + } + } + + // === Calculate output size (from runtime-created OUTPUT args) === + uint64_t offsets[MAX_TENSOR_ARGS] = {}; + uint64_t buffer_sizes[MAX_TENSOR_ARGS] = {}; + int32_t total_output_size = 0; + for (int i = 0; i < args.tensor_count(); i++) { + if (args.tag(i) == TensorArgType::OUTPUT) { + offsets[i] = total_output_size; + buffer_sizes[i] = PTO2_ALIGN_UP(args.tensor(i).create_info->buffer_size_bytes(), PTO2_PACKED_OUTPUT_ALIGN); + total_output_size += buffer_sizes[i]; + } + } + + // === STEP 1: Unified alloc — task slot + packed output buffer (blocks until available) === + PTO2TaskAllocResult alloc_result = allocator.alloc(total_output_size); + if (alloc_result.failed()) { + orch->fatal = true; + return result; + } + + int32_t local_id = alloc_result.task_id; + int32_t slot = alloc_result.slot; + PTO2TaskId task_id = PTO2TaskId::make(ring_id, static_cast(local_id)); + + PTO2TaskDescriptor& task = allocator.task_by_slot(slot); + PTO2TaskPayload* payload = &orch->sm_handle->task_payloads[ring_id][slot]; + + // Early write-prefetch payload GM cache lines to issue RFO in background. + // ~130 lines of computation (lookup, insert) follow before + // param_copy writes, giving ample time for prefetch to complete. + // Use locality=3 (PSTL1KEEP) so prefetched CLs survive lookup/insert eviction. + for (int32_t i = 0; i < args.tensor_count(); i++) { + __builtin_prefetch(&payload->tensors[i], 1, 3); + __builtin_prefetch(reinterpret_cast(&payload->tensors[i]) + 64, 1, 3); + } + for (int32_t i = 0; i < args.scalar_count(); i += 8) { + __builtin_prefetch(&payload->scalars[i], 1, 3); + } + __builtin_prefetch(payload, 1, 3); + __builtin_prefetch(reinterpret_cast(payload) + 64, 1, 3); + __builtin_prefetch(reinterpret_cast(payload) + 128, 1, 3); + + // Initialize slot state (scheduler-private) + if (sched) { + auto& rs = sched->ring_sched_states[ring_id]; + PTO2TaskSlotState& slot_state = rs.get_slot_state_by_slot(slot); + slot_state.fanin_count = 0; + slot_state.fanout_head = nullptr; + slot_state.fanout_lock.store(0, std::memory_order_relaxed); + // Initial fanout_count = 1 (the owning scope holds one reference) + slot_state.fanout_count = 1; + slot_state.fanout_refcount.store(0, std::memory_order_release); + slot_state.fanin_refcount.store(0, std::memory_order_release); + slot_state.payload = payload; + slot_state.task = &task; + slot_state.active_mask = active_mask; + slot_state.subtask_done_mask.store(0, std::memory_order_relaxed); + slot_state.ring_id = ring_id; + scope_tasks_push(orch, &slot_state); + } else { + scope_tasks_push(orch, nullptr); + } + + // Temporary storage for fanin (cached slot state pointers, avoids repeated ring/slot lookups) + PTO2TaskSlotState* fanin_states[PTO2_MAX_INPUTS]; + int32_t fanin_count = 0; + + CYCLE_COUNT_LAP_RECORD(g_orch_alloc_cycle, AicpuPhaseId::ORCH_ALLOC, task_id.raw); + +#if PTO2_PROFILING + if (total_output_size > 0) { + orch->buffers_allocated++; + orch->bytes_allocated += total_output_size; + } +#endif + + // === STEP 2: Sync TensorMap validity and optional cleanup === + // Read current last_task_alive from shared memory for this ring + int32_t sm_last_task_alive = fc.last_task_alive.load(std::memory_order_acquire); + + orch->tensor_map.sync_tensormap(ring_id, sm_last_task_alive); + + if (sched) { + orch->rings[ring_id].dep_pool.reclaim(*sched, ring_id, sm_last_task_alive); + } + + CYCLE_COUNT_LAP_RECORD(g_orch_sync_cycle, AicpuPhaseId::ORCH_SYNC, task_id.raw); + + // === STEP 3: Lookup inputs + materialize runtime-created outputs === + for (int i = 0; i < args.tensor_count(); i++) { + TensorArgType ptype = args.tag(i); + if (ptype == TensorArgType::OUTPUT) { + // Runtime-created OUTPUT tensors are not looked up in the TensorMap since they have no dependencies. + continue; + } + + const Tensor* tensor = args.tensor(i).ptr; + + // Step A: creator retention — all existing tensors extend their creator lifetime. + PTO2TaskId owner = tensor->owner_task_id; + if (owner.is_valid() && sched != nullptr) { + PTO2TaskSlotState* prod_state = + &sched->ring_sched_states[owner.ring()].get_slot_state_by_task_id(owner.local()); + if (!pto2_append_fanin_or_fail( + orch, task_id, i, ptype, prod_state, fanin_states, &fanin_count, "creator retention")) { + return result; + } + } + + // Step B: only INPUT/INOUT need modifier dependency lookup. + if (ptype != TensorArgType::INPUT && ptype != TensorArgType::INOUT) { + continue; + } + if (tensor->manual_dep) { + continue; + } + + PTO2LookupResult lookup_result; + orch->tensor_map.lookup(*tensor, lookup_result); + + for (int r = 0; r < lookup_result.count; r++) { + PTO2TensorMapEntry& entry = *lookup_result.entries[r].entry; + auto overlap_status = lookup_result.entries[r].overlap_status; + auto prod_ring = entry.producer_task_id.ring(); + auto prod_local = entry.producer_task_id.local(); + PTO2TaskSlotState* prod_state = &sched->ring_sched_states[prod_ring].get_slot_state_by_task_id(prod_local); + if (!pto2_append_fanin_or_fail( + orch, task_id, i, ptype, prod_state, fanin_states, &fanin_count, "overlap lookup")) { + return result; + } + if (ptype == TensorArgType::INOUT && overlap_status == OverlapStatus::COVERED) { + orch->tensor_map.remove_entry(entry); + } + } + } + + CYCLE_COUNT_LAP_RECORD(g_orch_lookup_cycle, AicpuPhaseId::ORCH_LOOKUP, task_id.raw); + + // === STEP 4: Register outputs/inouts in TensorMap (must be separate from lookup) === + { + for (int i = 0; i < args.tensor_count(); i++) { + TensorArgType ptype = args.tag(i); + if (ptype == TensorArgType::INOUT || ptype == TensorArgType::OUTPUT_EXISTING) { + if (!args.tensor(i).ptr->manual_dep) { + orch->tensor_map.insert(*args.tensor(i).ptr, task_id); + } + } + } + } + + CYCLE_COUNT_LAP_RECORD(g_orch_insert_cycle, AicpuPhaseId::ORCH_INSERT, task_id.raw); + + // === STEP 5: Batch-write to GM (single cache line burst) === + // Deferred from allocation phase to avoid scattered GM writes that get + // evicted by TensorMap lookup/insert cache pressure. + __builtin_prefetch(&task, 1, 1); + task.task_id = task_id; + task.kernel_id[static_cast(PTO2SubtaskSlot::AIC)] = normalized.aic_kernel_id; + task.kernel_id[static_cast(PTO2SubtaskSlot::AIV0)] = normalized.aiv0_kernel_id; + task.kernel_id[static_cast(PTO2SubtaskSlot::AIV1)] = normalized.aiv1_kernel_id; + task.packed_buffer_base = alloc_result.packed_base; + task.packed_buffer_end = alloc_result.packed_end; + + // Prefetch producer slot_states and cur_slot_state (written at init but likely + // evicted by lookup/insert/heap). param_copy below provides hide time. + if (sched) { + auto& rs = sched->ring_sched_states[ring_id]; + __builtin_prefetch(&rs.get_slot_state_by_slot(slot), 1, 0); + for (int i = 0; i < fanin_count; i++) { + __builtin_prefetch(fanin_states[i], 1, 0); + } + } + + payload->init(args, result, alloc_result.packed_base, offsets, buffer_sizes); + + // Write owner_task_id into materialized OUTPUT tensors so creator-only dependency + // tracking remains available even when manual_dep skips OverlapMap publication. + for (int i = 0; i < args.tensor_count(); i++) { + if (args.tag(i) == TensorArgType::OUTPUT) { + payload->tensors[i].owner_task_id = task_id; + } + } + + CYCLE_COUNT_LAP_RECORD(g_orch_args_cycle, AicpuPhaseId::ORCH_PARAMS, task_id.raw); +#if PTO2_ORCH_PROFILING + g_orch_args_atomic_count += 2; // fanout_lock.store + fanout_count.store +#endif + + // === STEP 6: Finalize fanin list === + // First build the fanin list + if (sched) { + auto& rs = sched->ring_sched_states[ring_id]; + PTO2TaskSlotState& cur_slot_state = rs.get_slot_state_by_slot(slot); + // Initialize scheduler state BEFORE adding to producer fanout lists, + // so concurrent on_mixed_task_complete can safely access task_state/fanout_refcount. + cur_slot_state.task_state.store(PTO2_TASK_PENDING, std::memory_order_relaxed); + cur_slot_state.fanout_refcount.store(0, std::memory_order_relaxed); + cur_slot_state.completed_subtasks.store(0, std::memory_order_relaxed); + cur_slot_state.total_required_subtasks = static_cast(block_num * __builtin_popcount(active_mask)); + cur_slot_state.block_num = block_num; + cur_slot_state.next_block_idx = 0; + + auto& dep_pool = orch->rings[ring_id].dep_pool; + // Ensure dep pool has space: fanin_count entries + 1 pre-alloc + dep_pool.ensure_space(*sched, fc, ring_id, fanin_count + 1); + + int32_t early_finished = 0; + cur_slot_state.fanin_count = fanin_count + 1; // +1 redundance for not being ready too early + payload->fanin_actual_count = fanin_count; + for (int i = 0; i < fanin_count; i++) { + payload->fanin_slot_states[i] = fanin_states[i]; + } + for (int i = 0; i < fanin_count; i++) { + PTO2TaskSlotState& producer_slot_state = *fanin_states[i]; +#if PTO2_ORCH_PROFILING + pto2_fanout_lock(producer_slot_state, g_orch_fanin_atomic_count, g_orch_fanin_wait_cycle); +#else + pto2_fanout_lock(producer_slot_state); +#endif + // Normal path: prepend consumer to producer's fanout list + producer_slot_state.fanout_count += 1; + int32_t prod_state = producer_slot_state.task_state.load(std::memory_order_acquire); + if (prod_state >= PTO2_TASK_COMPLETED) { + // Early return optimization: if producer already completed, we can skip adding dependency and directly + // decrement fanin_count + early_finished++; + } else { + producer_slot_state.fanout_head = dep_pool.prepend(producer_slot_state.fanout_head, &cur_slot_state); + } + pto2_fanout_unlock(producer_slot_state); + } + // Combined release: merge early_finished batch with the +1 init release + // into a single atomic fetch_add (saves one acq_rel cache-line bounce per task). + int32_t initial_refcount = early_finished + 1; // +1 for the init release + int32_t new_rc = + cur_slot_state.fanin_refcount.fetch_add(initial_refcount, std::memory_order_acq_rel) + initial_refcount; + if (new_rc >= fanin_count + 1) { + PTO2ResourceShape shape = pto2_active_mask_to_shape(active_mask); + sched->ready_queues[static_cast(shape)].push(&cur_slot_state); + } + // Record dep pool watermark in local slot state (used by tail reclamation) + cur_slot_state.dep_pool_mark = orch->rings[ring_id].dep_pool.top; +#if PTO2_ORCH_PROFILING + // Per producer: fetch_add(fanout_count) + load(task_state) + store(unlock) = 3 atomics + // Lock atomics (loads + CAS) are counted inside pto2_fanout_lock + g_orch_fanin_atomic_count += fanin_count * 3; + if (early_finished > 0) { + g_orch_fanin_atomic_count += 1; // fanin_refcount.fetch_add + } +#endif + } + + CYCLE_COUNT_LAP_RECORD(g_orch_fanin_cycle, AicpuPhaseId::ORCH_FANIN, task_id.raw); + +#if PTO2_PROFILING + orch->tasks_submitted++; +#if PTO2_ORCH_PROFILING + g_orch_submit_count++; +#endif + g_orch_submit_idx++; +#endif + return result; +} + +// ============================================================================= +// Flow Control +// ============================================================================= + +void pto2_orchestrator_done(PTO2OrchestratorState* orch) { + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + int32_t total_tasks = orch->rings[r].task_allocator.active_count(); + if (total_tasks > 0) { + LOG_INFO("=== [Orchestrator] ring %d: total_tasks=%d ===", r, total_tasks); + } + auto& pool = orch->rings[r].dep_pool; + if (pool.top > 0) { + LOG_INFO("=== [DepPool %d] top=%d tail=%d used=%d high_water=%d capacity=%d ===", + r, + pool.top, + pool.tail, + pool.top - pool.tail, + pool.high_water, + pool.capacity); + } + } + orch->sm_handle->header->orchestrator_done.store(1, std::memory_order_release); +#if !PTO2_ORCH_PROFILING && PTO2_PROFILING + g_orch_submit_idx = 0; +#endif +} + +// ============================================================================= +// Debug Utilities +// ============================================================================= + +void pto2_orchestrator_print_stats(PTO2OrchestratorState* orch) { + LOG_INFO("=== Orchestrator Statistics ==="); +#if PTO2_PROFILING + LOG_INFO("Tasks submitted: %" PRId64, orch->tasks_submitted); + LOG_INFO("Buffers allocated: %" PRId64, orch->buffers_allocated); + LOG_INFO("Bytes allocated: %" PRId64, orch->bytes_allocated); +#endif + LOG_INFO("Current scope depth: %d", orch->scope_stack_top + 1); + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + int32_t active = orch->rings[r].task_allocator.active_count(); + if (active > 0) { + LOG_INFO("Ring %d task active: %d", r, active); + LOG_INFO("Ring %d heap used: %" PRIu64 " / %" PRIu64, + r, + orch->rings[r].task_allocator.heap_top(), + orch->rings[r].task_allocator.heap_capacity()); + LOG_INFO( + "Ring %d dep pool: %d / %d", r, orch->rings[r].dep_pool.used(), orch->rings[r].dep_pool.capacity); + } + } + LOG_INFO("TensorMap valid: %d", orch->tensor_map.valid_count()); + LOG_INFO("==============================="); +} + +void pto2_orchestrator_print_scope_stack(PTO2OrchestratorState* orch) { + LOG_INFO("=== Scope Stack ==="); + LOG_INFO("Depth: %d", orch->scope_stack_top + 1); + + for (int i = 0; i <= orch->scope_stack_top; i++) { + int32_t begin = orch->scope_begins[i]; + int32_t end = (i < orch->scope_stack_top) ? orch->scope_begins[i + 1] : orch->scope_tasks_size; + LOG_INFO(" [%d] tasks_owned = %d", i, end - begin); + } + + LOG_INFO("=================="); +} + +#if PTO2_ORCH_PROFILING +PTO2OrchProfilingData pto2_orchestrator_get_profiling() { + PTO2OrchProfilingData d; + d.sync_cycle = g_orch_sync_cycle; + d.alloc_cycle = g_orch_alloc_cycle; + d.args_cycle = g_orch_args_cycle; + d.lookup_cycle = g_orch_lookup_cycle; + d.insert_cycle = g_orch_insert_cycle; + d.fanin_cycle = g_orch_fanin_cycle; + d.scope_end_cycle = g_orch_scope_end_cycle; + d.submit_count = g_orch_submit_count; + d.alloc_wait_cycle = g_orch_alloc_wait_cycle; + d.fanin_wait_cycle = g_orch_fanin_wait_cycle; + d.alloc_atomic_count = g_orch_alloc_atomic_count; + d.args_atomic_count = g_orch_args_atomic_count; + d.fanin_atomic_count = g_orch_fanin_atomic_count; + d.finalize_atomic_count = g_orch_finalize_atomic_count; + d.scope_end_atomic_count = g_orch_scope_end_atomic_count; + + // Reset + g_orch_sync_cycle = g_orch_alloc_cycle = g_orch_args_cycle = 0; + g_orch_lookup_cycle = g_orch_insert_cycle = 0; + g_orch_fanin_cycle = g_orch_scope_end_cycle = 0; + g_orch_submit_count = 0; + g_orch_submit_idx = 0; + g_orch_alloc_wait_cycle = 0; + g_orch_fanin_wait_cycle = 0; + g_orch_alloc_atomic_count = 0; + g_orch_args_atomic_count = 0; + g_orch_fanin_atomic_count = 0; + g_orch_finalize_atomic_count = 0; + g_orch_scope_end_atomic_count = 0; + return d; +} +#endif diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_orchestrator.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_orchestrator.h new file mode 100644 index 000000000..80d33e4f2 --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_orchestrator.h @@ -0,0 +1,225 @@ +/** + * PTO Runtime2 - Orchestrator Interface + * + * The Orchestrator is responsible for: + * 1. Executing the orchestration function (Turing-complete control flow) + * 2. Allocating intermediate buffers from the heap + * 3. Submitting tasks via async InCore function calls + * 4. Building the dependency graph using TensorMap + * 5. Managing buffer scopes for lifecycle control + * + * The Orchestrator can run on either: + * - Host CPU (lower latency for complex control, easier debugging) + * - Device AI_CPU (lower latency for task submission) + * + * Based on: docs/RUNTIME_LOGIC.md + */ + +#ifndef PTO_ORCHESTRATOR_H +#define PTO_ORCHESTRATOR_H + +#include "pto_ring_buffer.h" +#include "pto_runtime2_types.h" +#include "pto_submit_types.h" +#include "pto_scheduler.h" +#include "pto_shared_memory.h" +#include "pto_tensormap.h" +#include "pto_types.h" + +// ============================================================================= +// Orchestrator State +// ============================================================================= + +/** + * Orchestrator state structure (private to Orchestrator) + * + * Contains all state needed for task graph construction and buffer management. + */ +struct PTO2OrchestratorState { + // === SHARED MEMORY ACCESS === + PTO2SharedMemoryHandle* sm_handle; + + // === PER-RING RESOURCES === + PTO2RingSet rings[PTO2_MAX_RING_DEPTH]; + + // === TENSOR MAP (Private) === + PTO2TensorMap tensor_map; // Producer lookup + + // === SCOPE STACK (Private) === + // Single contiguous buffer of task IDs, partitioned by scope level. + // scope_begins[i] is the index into scope_tasks where scope i starts. + // Tasks for the top scope occupy [scope_begins[top], scope_tasks_size). + PTO2TaskSlotState** scope_tasks; // Flat buffer of taskSlotState (all scopes concatenated) + int32_t scope_tasks_size; // Number of task IDs currently in the buffer + int32_t scope_tasks_capacity; // Allocated capacity of scope_tasks + int32_t* scope_begins; // scope_begins[i] = start index of scope i in scope_tasks + int32_t scope_stack_top; // Current top of stack (-1 = no scope open) + uint64_t scope_stack_capacity; // Max nesting depth (PTO2_MAX_SCOPE_DEPTH) + + // === SCHEDULER REFERENCE === + // Note: In simulated mode, orchestrator and scheduler share address space + // In real mode, they communicate via shared memory only + PTO2SchedulerState* scheduler; // For simulated mode only +#if PTO2_PROFILING + // Runtime profiling switch copied from Runtime::enable_profiling. + bool enable_profiling; +#endif + + // === GM HEAP (for output buffers) === + void* gm_heap_base; // Base address of GM heap + uint64_t gm_heap_size; // Total size of GM heap (all rings) + + // === FATAL ERROR === + // Fatal error flag (single-thread access by orchestrator, no atomic needed) + // Cross-thread notification uses shared memory orch_error_code (atomic) + bool fatal; + + // === STATISTICS === +#if PTO2_PROFILING + int64_t tasks_submitted; + int64_t buffers_allocated; + int64_t bytes_allocated; +#endif + + /** + * Get current ring index from scope depth. + * Maps scope depth to ring_id: min(scope_depth, PTO2_MAX_RING_DEPTH - 1) + */ + uint8_t current_ring_id() const { + int32_t depth = scope_stack_top; + if (depth < 0) depth = 0; + return depth < PTO2_MAX_RING_DEPTH ? static_cast(depth) : PTO2_MAX_RING_DEPTH - 1; + } + +}; + +// ============================================================================= +// Orchestrator API +// ============================================================================= + +/** + * Initialize orchestrator state + * + * @param orch Orchestrator state to initialize + * @param sm_handle Shared memory handle + * @param gm_heap GM heap memory for output buffers + * @param heap_size Size of GM heap + * @return true on success + */ +bool pto2_orchestrator_init( + PTO2OrchestratorState* orch, PTO2SharedMemoryHandle* sm_handle, void* gm_heap, uint64_t heap_size, + int32_t dep_pool_capacity = PTO2_DEP_LIST_POOL_SIZE); + +/** + * Destroy orchestrator state and free resources + */ +void pto2_orchestrator_destroy(PTO2OrchestratorState* orch); + +/** + * Set scheduler reference (for simulated mode) + */ +void pto2_orchestrator_set_scheduler(PTO2OrchestratorState* orch, PTO2SchedulerState* scheduler); + + +// ============================================================================= +// Scope Management +// ============================================================================= + +/** + * Begin a new scope + * + * Pushes a new empty task list onto the scope stack. + * Tasks submitted while this scope is at the top of the stack are + * owned by it and have their fanout_count initialized to 1. + */ +void pto2_scope_begin(PTO2OrchestratorState* orch); + +/** + * End current scope + * + * Pops the top scope and increments fanout_refcount for each task + * directly owned by that scope. + * May trigger buffer release for tasks that are now fully consumed. + */ +void pto2_scope_end(PTO2OrchestratorState* orch); + +// ============================================================================= +// Task Submission +// ============================================================================= + +/** + * Submit a task with InCore function and parameters + * + * This is the main API for building the task graph: + * 1. Allocates task slot + packed output buffer via TaskAllocator (blocks until available) + * 2. Looks up inputs in TensorMap to find dependencies + * 3. Updates producer's fanout_count/list (with spinlock) + * 4. Registers outputs in TensorMap + * 5. Initializes task state in scheduler + * + * @param orch Orchestrator state + * @param mixed_kernels Kernel IDs for AIC/AIV0/AIV1 slots + * @param args Aggregated tensor and scalar parameters + */ +TaskOutputTensors pto2_submit_mixed_task(PTO2OrchestratorState* orch, + const MixedKernels& mixed_kernels, + const Arg& args); + +// ============================================================================= +// Flow Control +// ============================================================================= + +/** + * Mark orchestration as complete + * + * Signals to scheduler that no more tasks will be submitted. + */ +void pto2_orchestrator_done(PTO2OrchestratorState* orch); + +// ============================================================================= +// Debug Utilities +// ============================================================================= + +/** + * Print orchestrator statistics + */ +void pto2_orchestrator_print_stats(PTO2OrchestratorState* orch); + +/** + * Print scope stack state + */ +void pto2_orchestrator_print_scope_stack(PTO2OrchestratorState* orch); + +// ============================================================================= +// Orchestrator Profiling Data +// ============================================================================= + +#if PTO2_ORCH_PROFILING +struct PTO2OrchProfilingData { + uint64_t sync_cycle; + uint64_t alloc_cycle; // Combined task slot + heap allocation + uint64_t args_cycle; + uint64_t lookup_cycle; + uint64_t insert_cycle; + uint64_t fanin_cycle; + uint64_t scope_end_cycle; + int64_t submit_count; + // Wait time tracking for blocking phases + uint64_t alloc_wait_cycle; // Cycles spent waiting in unified alloc + uint64_t fanin_wait_cycle; // Cycles spent waiting in fanout_lock + // Atomic operation counts per phase + uint64_t alloc_atomic_count; + uint64_t args_atomic_count; + uint64_t fanin_atomic_count; + uint64_t finalize_atomic_count; + uint64_t scope_end_atomic_count; +}; + +/** + * Get and reset orchestrator profiling data. + * Returns accumulated profiling data and resets counters. + */ +PTO2OrchProfilingData pto2_orchestrator_get_profiling(); +#endif + +#endif // PTO_ORCHESTRATOR_H diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_ring_buffer.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_ring_buffer.cpp new file mode 100644 index 000000000..47a0ec1a6 --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_ring_buffer.cpp @@ -0,0 +1,78 @@ +/** + * PTO Runtime2 - Ring Buffer Implementation + * + * Implements DepListPool ring buffer for zero-overhead dependency management. + * TaskAllocator methods are defined inline in pto_ring_buffer.h. + * + * Based on: docs/RUNTIME_LOGIC.md + */ + +#include "pto_ring_buffer.h" +#include +#include +#include // for exit() +#include "common/unified_log.h" +#include "pto_scheduler.h" + +// ============================================================================= +// Dependency List Pool Implementation +// ============================================================================= +void PTO2DepListPool::reclaim(PTO2SchedulerState& sched, uint8_t ring_id, int32_t sm_last_task_alive) { + if (sm_last_task_alive >= last_reclaimed + PTO2_DEP_POOL_CLEANUP_INTERVAL && sm_last_task_alive > 0) { + int32_t mark = sched.ring_sched_states[ring_id].get_slot_state_by_task_id(sm_last_task_alive - 1).dep_pool_mark; + if (mark > 0) { + advance_tail(mark); + } + last_reclaimed = sm_last_task_alive; + } +} + +void PTO2DepListPool::ensure_space( + PTO2SchedulerState& sched, PTO2RingFlowControl& fc, uint8_t ring_id, int32_t needed) { + if (available() >= needed) return; + + int spin_count = 0; + int32_t prev_last_alive = fc.last_task_alive.load(std::memory_order_acquire); + while (available() < needed) { + reclaim(sched, ring_id, prev_last_alive); + if (available() >= needed) return; + + spin_count++; + + // Progress detection: reset spin counter if last_task_alive advances + int32_t cur_last_alive = fc.last_task_alive.load(std::memory_order_acquire); + if (cur_last_alive > prev_last_alive) { + spin_count = 0; + prev_last_alive = cur_last_alive; + } + + if (spin_count >= PTO2_DEP_POOL_SPIN_LIMIT) { + int32_t current = fc.current_task_index.load(std::memory_order_acquire); + LOG_ERROR("========================================"); + LOG_ERROR("FATAL: Dependency Pool Deadlock Detected! (ring %d)", ring_id); + LOG_ERROR("========================================"); + LOG_ERROR("DepListPool cannot reclaim space after %d spins (no progress).", spin_count); + LOG_ERROR(" - Pool used: %d / %d (%.1f%%)", + used(), + capacity, + (capacity > 0) ? (100.0 * used() / capacity) : 0.0); + LOG_ERROR(" - Pool top: %d (linear)", top); + LOG_ERROR(" - Pool tail: %d (linear)", tail); + LOG_ERROR(" - High water: %d", high_water); + LOG_ERROR(" - Needed: %d entries", needed); + LOG_ERROR(" - last_task_alive: %d (stuck here)", cur_last_alive); + LOG_ERROR(" - current_task: %d", current); + LOG_ERROR(" - In-flight tasks: %d", current - cur_last_alive); + LOG_ERROR("Diagnosis:"); + LOG_ERROR(" last_task_alive is not advancing, so dep pool tail"); + LOG_ERROR(" cannot reclaim. Check TaskRing diagnostics for root cause."); + LOG_ERROR("Solution:"); + LOG_ERROR(" Increase dep pool capacity (current: %d, recommended: %d)", capacity, high_water * 2); + LOG_ERROR(" Compile-time: PTO2_DEP_LIST_POOL_SIZE in pto_runtime2_types.h"); + LOG_ERROR(" Runtime env: PTO2_RING_DEP_POOL=%d", high_water * 2); + LOG_ERROR("========================================"); + exit(1); + } + SPIN_WAIT_HINT(); + } +} diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_ring_buffer.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_ring_buffer.h new file mode 100644 index 000000000..6f9f655ba --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_ring_buffer.h @@ -0,0 +1,508 @@ +/** + * PTO Runtime2 - Ring Buffer Data Structures + * + * Implements ring buffer designs for zero-overhead memory management: + * + * 1. TaskAllocator - Unified task slot + output buffer allocation + * - Combines task ring (slot allocation) and heap ring (output buffer allocation) + * - Single spin-wait loop with unified back-pressure and deadlock detection + * - O(1) bump allocation for both task slots and heap buffers + * + * 2. DepListPool - Dependency list entry allocation + * - Ring buffer for linked list entries + * - O(1) prepend operation + * - Implicit reclamation with task ring + * + * Based on: docs/RUNTIME_LOGIC.md + */ + +#ifndef PTO_RING_BUFFER_H +#define PTO_RING_BUFFER_H + +#include + +#include "pto_runtime2_types.h" +#include "pto_shared_memory.h" +#include "common/unified_log.h" + +struct PTO2SchedulerState; // Forward declaration for dep_pool reclaim + +// Set to 1 to enable periodic BLOCKED/Unblocked messages during spin-wait. +#ifndef PTO2_SPIN_VERBOSE_LOGGING +#define PTO2_SPIN_VERBOSE_LOGGING 1 +#endif + +// Block notification interval (in spin counts) +#define PTO2_BLOCK_NOTIFY_INTERVAL 10000 +// Alloc spin limit - after this, report deadlock and exit +#define PTO2_ALLOC_SPIN_LIMIT 100000 + +// Dep pool spin limit - if exceeded, dep pool capacity too small for workload +#define PTO2_DEP_POOL_SPIN_LIMIT 100000 + +// ============================================================================= +// Task Allocator (unified task slot + heap buffer allocation) +// ============================================================================= + +/** + * Result of a unified task allocation. + */ +struct PTO2TaskAllocResult { + int32_t task_id; // Absolute task ID (not wrapped), -1 on failure + int32_t slot; // task_id & (window_size - 1) + void* packed_base; // Heap allocation result (nullptr if output_size == 0) + void* packed_end; // packed_base + aligned output_size + + bool failed() const { return task_id < 0; } +}; + +/** + * Unified task slot + heap buffer allocator. + * + * Since task and heap are always allocated together and the orchestrator is + * single-threaded, both pointers (task index, heap top) are tracked locally + * and published to shared memory via plain store — no fetch_add or CAS needed. + * + * The alloc() method checks both resources BEFORE committing to either, + * eliminating the need for rollback on partial failure. + */ +class PTO2TaskAllocator { +public: + /** + * Initialize the allocator with task ring and heap ring resources. + */ + void init(PTO2TaskDescriptor* descriptors, int32_t window_size, + std::atomic* current_index_ptr, + std::atomic* last_alive_ptr, + void* heap_base, uint64_t heap_size, + std::atomic* error_code_ptr) { + descriptors_ = descriptors; + window_size_ = window_size; + window_mask_ = window_size - 1; + current_index_ptr_ = current_index_ptr; + last_alive_ptr_ = last_alive_ptr; + heap_base_ = heap_base; + heap_size_ = heap_size; + error_code_ptr_ = error_code_ptr; + local_task_id_ = current_index_ptr->load(std::memory_order_relaxed); + heap_top_ = 0; + heap_tail_ = 0; + last_alive_seen_ = 0; + } + + /** + * Allocate a task slot and its associated output buffer in one call. + * + * Both task index and heap top are maintained as local counters and + * published to shared memory only on success. Since the orchestrator is + * single-threaded, no CAS or fetch_add is needed — just check-then-commit. + * + * @param output_size Total packed output size in bytes (0 = no heap needed) + * @return Allocation result; check failed() for errors + */ + PTO2TaskAllocResult alloc(int32_t output_size) { + uint64_t aligned_size = output_size > 0 + ? PTO2_ALIGN_UP(static_cast(output_size), PTO2_ALIGN_SIZE) : 0; + + int spin_count = 0; + int32_t prev_last_alive = last_alive_ptr_->load(std::memory_order_acquire); + int32_t last_alive = prev_last_alive; + update_heap_tail(last_alive); + bool blocked_on_heap = false; +#if PTO2_ORCH_PROFILING + uint64_t wait_start = 0; + bool waiting = false; +#endif + + while (true) { + // Check both resources; commit only if both available + if (local_task_id_ - last_alive + 1 < window_size_) { + void* heap_ptr = try_bump_heap(aligned_size); + if (heap_ptr) { + int32_t task_id = commit_task(); +#if PTO2_ORCH_PROFILING + record_wait(spin_count, wait_start, waiting); +#endif + return {task_id, task_id & window_mask_, + heap_ptr, static_cast(heap_ptr) + aligned_size}; + } + blocked_on_heap = true; + } else { + blocked_on_heap = false; + } + + // Spin: wait for scheduler to advance last_task_alive + spin_count++; +#if PTO2_ORCH_PROFILING + if (!waiting) { wait_start = get_sys_cnt_aicpu(); waiting = true; } +#endif + last_alive = last_alive_ptr_->load(std::memory_order_acquire); + update_heap_tail(last_alive); + if (last_alive > prev_last_alive) { + spin_count = 0; + prev_last_alive = last_alive; + } else { +#if PTO2_SPIN_VERBOSE_LOGGING + if (spin_count % PTO2_BLOCK_NOTIFY_INTERVAL == 0) { + LOG_WARN("[TaskAllocator] BLOCKED: tasks=%d/%d, heap=%" PRIu64 "/%" PRIu64 ", on=%s, spins=%d", + local_task_id_ - last_alive, + window_size_, + heap_top_, + heap_size_, + blocked_on_heap ? "heap" : "task", + spin_count); + } +#endif + if (spin_count >= PTO2_ALLOC_SPIN_LIMIT) { + report_deadlock(output_size, blocked_on_heap); + return {-1, -1, nullptr, nullptr}; + } + } + SPIN_WAIT_HINT(); + } + } + + // ========================================================================= + // Task descriptor accessors + // ========================================================================= + + PTO2TaskDescriptor& task(int32_t task_id) const { + return descriptors_[task_id & window_mask_]; + } + + PTO2TaskDescriptor& task_by_slot(int32_t slot) const { + return descriptors_[slot]; + } + + // ========================================================================= + // State queries + // ========================================================================= + + int32_t active_count() const { + int32_t last_alive = last_alive_ptr_->load(std::memory_order_acquire); + return local_task_id_ - last_alive; + } + + int32_t window_size() const { return window_size_; } + + uint64_t heap_available() const { + uint64_t tail = heap_tail_; + if (heap_top_ >= tail) { + uint64_t at_end = heap_size_ - heap_top_; + uint64_t at_begin = tail; + return at_end > at_begin ? at_end : at_begin; + } + return tail - heap_top_; + } + + uint64_t heap_top() const { return heap_top_; } + uint64_t heap_capacity() const { return heap_size_; } + +private: + // --- Task Ring --- + PTO2TaskDescriptor* descriptors_ = nullptr; + int32_t window_size_ = 0; + int32_t window_mask_ = 0; + std::atomic* current_index_ptr_ = nullptr; + std::atomic* last_alive_ptr_ = nullptr; + + // --- Heap --- + void* heap_base_ = nullptr; + uint64_t heap_size_ = 0; + + // --- Local state (single-writer, no atomics needed) --- + int32_t local_task_id_ = 0; // Next task ID to allocate + uint64_t heap_top_ = 0; // Current heap allocation pointer + uint64_t heap_tail_ = 0; // Heap reclamation pointer (derived from consumed tasks) + int32_t last_alive_seen_ = 0; // last_task_alive at last heap_tail derivation + + // --- Shared --- + std::atomic* error_code_ptr_ = nullptr; + + // ========================================================================= + // Internal helpers + // ========================================================================= + + /** + * Commit a task slot: bump local counter and publish to shared memory. + * Must only be called after space check has passed. + */ + int32_t commit_task() { + int32_t task_id = local_task_id_++; + current_index_ptr_->store(local_task_id_, std::memory_order_release); + return task_id; + } + + /** + * Derive heap_tail_ from the last consumed task's packed_buffer_end. + * + * Every task has a valid packed_buffer_end (equal to packed_buffer_base + * for zero-size allocations), so the last consumed task always determines + * the correct heap_tail — no backward scan needed. + */ + void update_heap_tail(int32_t last_alive) { + if (last_alive <= last_alive_seen_) return; + last_alive_seen_ = last_alive; + + PTO2TaskDescriptor& desc = descriptors_[(last_alive - 1) & window_mask_]; + heap_tail_ = static_cast( + static_cast(desc.packed_buffer_end) - static_cast(heap_base_)); + } + + /** + * Bump the heap pointer for the given allocation size. + * Returns the allocated pointer, or nullptr if insufficient space. + * When alloc_size == 0, returns current position without advancing. + */ + void* try_bump_heap(uint64_t alloc_size) { + uint64_t top = heap_top_; + if (alloc_size == 0) { + return static_cast(heap_base_) + top; + } + uint64_t tail = heap_tail_; + void* result; + + if (top >= tail) { + uint64_t space_at_end = heap_size_ - top; + if (space_at_end >= alloc_size) { + result = static_cast(heap_base_) + top; + heap_top_ = top + alloc_size; + } else if (tail > alloc_size) { + result = heap_base_; + heap_top_ = alloc_size; + } else { + return nullptr; + } + } else { + if (tail - top >= alloc_size) { + result = static_cast(heap_base_) + top; + heap_top_ = top + alloc_size; + } else { + return nullptr; + } + } + + return result; + } + +#if PTO2_ORCH_PROFILING + void record_wait(int spin_count, uint64_t wait_start, bool waiting) { + if (waiting) { + extern uint64_t g_orch_alloc_wait_cycle; + g_orch_alloc_wait_cycle += (get_sys_cnt_aicpu() - wait_start); + } + { + extern uint64_t g_orch_alloc_atomic_count; + g_orch_alloc_atomic_count += spin_count + 1; + } + } +#endif + + /** + * Report deadlock with targeted diagnostics. + */ + void report_deadlock(int32_t requested_output_size, bool heap_blocked) { + int32_t last_alive = last_alive_ptr_->load(std::memory_order_acquire); + int32_t active_tasks = local_task_id_ - last_alive; + uint64_t htail = heap_tail_; + + LOG_ERROR("========================================"); + if (heap_blocked) { + LOG_ERROR("FATAL: Task Allocator Deadlock - Heap Exhausted!"); + } else { + LOG_ERROR("FATAL: Task Allocator Deadlock - Task Ring Full!"); + } + LOG_ERROR("========================================"); + LOG_ERROR("No progress after %d spins.", PTO2_ALLOC_SPIN_LIMIT); + LOG_ERROR(" Task ring: current=%d, last_alive=%d, active=%d/%d (%.1f%%)", + local_task_id_, last_alive, active_tasks, window_size_, + 100.0 * active_tasks / window_size_); + LOG_ERROR(" Heap ring: top=%" PRIu64 ", tail=%" PRIu64 ", size=%" PRIu64 + ", available=%" PRIu64, + heap_top_, htail, heap_size_, heap_available()); + if (heap_blocked) { + LOG_ERROR(" Requested: %d bytes", requested_output_size); + } + LOG_ERROR("Diagnosis:"); + LOG_ERROR(" last_task_alive is stuck at %d, meaning task %d", + last_alive, last_alive); + LOG_ERROR(" cannot transition to CONSUMED. Possible causes:"); + LOG_ERROR(" 1. Task %d still executing (subtasks not complete)", last_alive); + LOG_ERROR(" 2. Task %d fanout not fully released (downstream not done)", last_alive); + LOG_ERROR(" 3. Scope reference not released (scope_end not called)"); + LOG_ERROR(" 4. Orchestrator blocked here -> can't call scope_end -> circular wait"); + LOG_ERROR("Solution:"); + if (heap_blocked) { + LOG_ERROR(" Increase heap size (current: %" PRIu64 ", recommended: %" PRIu64 ")", + heap_size_, heap_size_ * 2); + LOG_ERROR(" Compile-time: PTO2_HEAP_SIZE in pto_runtime2_types.h"); + LOG_ERROR(" Runtime env: PTO2_RING_HEAP= (e.g. %" PRIu64 ")", + heap_size_ * 2); + } else { + LOG_ERROR(" Increase task window size (current: %d, recommended: %d)", + window_size_, active_tasks * 2); + LOG_ERROR(" Compile-time: PTO2_TASK_WINDOW_SIZE in pto_runtime2_types.h"); + LOG_ERROR(" Runtime env: PTO2_RING_TASK_WINDOW= (e.g. %d)", + active_tasks * 2); + } + LOG_ERROR("========================================"); + if (error_code_ptr_) { + int32_t code = heap_blocked ? PTO2_ERROR_HEAP_RING_DEADLOCK + : PTO2_ERROR_FLOW_CONTROL_DEADLOCK; + error_code_ptr_->store(code, std::memory_order_release); + } + } +}; + +// ============================================================================= +// Dependency List Pool +// ============================================================================= + +/** + * Dependency list pool structure + * + * True ring buffer for allocating linked list entries. + * Entries are reclaimed when their producer tasks become CONSUMED, + * as tracked by the orchestrator via dep_pool_mark per task. + * + * Linear counters (top, tail) grow monotonically; the physical index + * is obtained via modulo: base[linear_index % capacity]. + */ +struct PTO2DepListPool { + PTO2DepListEntry* base; // Pool base address + int32_t capacity; // Total number of entries + int32_t top; // Linear next-allocation counter (starts from 1) + int32_t tail; // Linear first-alive counter (entries before this are dead) + int32_t high_water; // Peak concurrent usage (top - tail) + int32_t last_reclaimed{0}; // last_task_alive at last successful reclamation + + // Error code pointer for fatal error reporting (→ sm_header->orch_error_code) + std::atomic* error_code_ptr = nullptr; + + /** + * Initialize dependency list pool + * + * @param base Pool base address from shared memory + * @param capacity Total number of entries + */ + void init(PTO2DepListEntry* in_base, int32_t in_capacity, std::atomic* in_error_code_ptr) { + base = in_base; + capacity = in_capacity; + top = 1; // Start from 1, 0 means NULL/empty + tail = 1; // Match initial top (no reclaimable entries yet) + high_water = 0; + last_reclaimed = 0; + + // Initialize entry 0 as NULL marker + base[0].slot_state = nullptr; + base[0].next = nullptr; + + error_code_ptr = in_error_code_ptr; + } + + /** + * Reclaim dead entries based on scheduler's slot state dep_pool_mark. + * Safe to call multiple times — only advances tail forward. + * + * @param sched Scheduler state (for reading slot dep_pool_mark) + * @param ring_id Ring layer index + * @param sm_last_task_alive Current last_task_alive from shared memory + */ + void reclaim(PTO2SchedulerState& sched, uint8_t ring_id, int32_t sm_last_task_alive); + + /** + * Ensure dep pool for a specific ring has at least `needed` entries available. + * Spin-waits for reclamation if under pressure. Detects deadlock if no progress. + */ + void ensure_space(PTO2SchedulerState& sched, PTO2RingFlowControl &fc, uint8_t ring_id, int32_t needed); + + /** + * Allocate a single entry from the pool (single-thread per pool instance) + * + * @return Pointer to allocated entry, or nullptr on fatal error + */ + PTO2DepListEntry* alloc() { + int32_t used = top - tail; + if (used >= capacity) { + LOG_ERROR("========================================"); + LOG_ERROR("FATAL: Dependency Pool Overflow!"); + LOG_ERROR("========================================"); + LOG_ERROR("DepListPool exhausted: %d entries alive (capacity=%d).", used, capacity); + LOG_ERROR(" - Pool top: %d (linear)", top); + LOG_ERROR(" - Pool tail: %d (linear)", tail); + LOG_ERROR(" - High water: %d", high_water); + LOG_ERROR("Solution:"); + LOG_ERROR(" Increase dep pool capacity (current: %d, recommended: %d).", capacity, capacity * 2); + LOG_ERROR(" Compile-time: PTO2_DEP_LIST_POOL_SIZE in pto_runtime2_types.h"); + LOG_ERROR(" Runtime env: PTO2_RING_DEP_POOL=%d", capacity * 2); + LOG_ERROR("========================================"); + if (error_code_ptr) { + error_code_ptr->store(PTO2_ERROR_DEP_POOL_OVERFLOW, std::memory_order_release); + } + return nullptr; + } + int32_t idx = top % capacity; + top++; + used++; + if (used > high_water) high_water = used; + return &base[idx]; + } + + /** + * Advance the tail pointer, reclaiming dead entries. + * Called by the orchestrator based on last_task_alive advancement. + */ + void advance_tail(int32_t new_tail) { + if (new_tail > tail) { + tail = new_tail; + } + } + + /** + * Prepend a task ID to a dependency list + * + * O(1) operation: allocates new entry and links to current head. + * + * @param current_head Current list head offset (0 = empty list) + * @param task_slot Task slot to prepend + * @return New head offset + */ + PTO2DepListEntry* prepend(PTO2DepListEntry* cur, PTO2TaskSlotState* slot_state) { + PTO2DepListEntry* new_entry = alloc(); + if (!new_entry) return nullptr; + new_entry->slot_state = slot_state; + new_entry->next = cur; + return new_entry; + } + + /** + * Get entry by offset + */ + PTO2DepListEntry* pto2_dep_pool_get(int32_t offset) { + if (offset <= 0) return NULL; + return &base[offset]; + } + + int32_t used() const { + return top - tail; + } + + int32_t available() const { + return capacity - used(); + } +}; + +// ============================================================================= +// Ring Set (per-depth aggregate) +// ============================================================================= + +/** + * Groups a TaskAllocator and DepPool into one per-depth unit. + * PTO2_MAX_RING_DEPTH instances provide independent reclamation per scope depth. + */ +struct PTO2RingSet { + PTO2TaskAllocator task_allocator; + PTO2DepListPool dep_pool; +}; + +#endif // PTO_RING_BUFFER_H diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_runtime2.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_runtime2.cpp new file mode 100644 index 000000000..bd049ecf8 --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_runtime2.cpp @@ -0,0 +1,337 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ + +/** + * PTO Runtime2 - Main Implementation + * + * Implements the unified runtime API that combines orchestrator and scheduler. + * + * Based on: docs/RUNTIME_LOGIC.md + */ + +#include "pto_runtime2.h" + +#include +#include +#include + +#include + +#include "aicpu/device_time.h" +#include "common/unified_log.h" + +// Weak fallback for HOST .so builds (never called, but satisfies linker). +// The AICPU build links the strong symbol from platform/.../device_time.cpp. +// Hidden visibility prevents HOST .so from polluting global symbol table. +__attribute__((weak, visibility("hidden"))) uint64_t get_sys_cnt_aicpu() { return 0; } + +// ============================================================================= +// Thread-local orchestrator index for multi-orchestrator dispatch +// ============================================================================= + +thread_local int pto2_current_orch_idx = 0; + +void pto2_set_orch_thread_idx(int idx) { pto2_current_orch_idx = idx; } + +// ============================================================================= +// Orchestration Ops Table (function-pointer dispatch for orchestration .so) +// ============================================================================= + +static TaskOutputTensors submit_task_impl(PTO2Runtime* rt, const MixedKernels& mixed_kernels, const Arg& args) { + return pto2_submit_mixed_task(&rt->orchestrators[pto2_current_orch_idx], mixed_kernels, args); +} + +void pto2_rt_scope_begin(PTO2Runtime* rt) { pto2_scope_begin(&rt->orchestrators[pto2_current_orch_idx]); } + +void pto2_rt_scope_end(PTO2Runtime* rt) { pto2_scope_end(&rt->orchestrators[pto2_current_orch_idx]); } + +void pto2_rt_orchestration_done(PTO2Runtime* rt) { pto2_orchestrator_done(&rt->orchestrators[pto2_current_orch_idx]); } + +static bool is_fatal_impl(PTO2Runtime* rt) { return rt->orchestrators[pto2_current_orch_idx].fatal; } + +// Wait for all producers of this tensor to be safe for data access. +// Checks owner metadata (lifecycle anchor) and OverlapMap (modifier writers). +// For reads: wait until each producer COMPLETED (done writing). +// For writes: also wait until all consumers done reading +// (fanout_refcount >= fanout_count - 1, excluding scope reference). +// Uses cycle-based timeout (checked every 1024 spins). +// Returns false on timeout (sets orch.fatal). +MAYBE_UNINITIALIZED_BEGIN +static bool wait_for_tensor_ready(PTO2Runtime* rt, const Tensor& tensor, bool wait_for_consumers, const char* caller) { + PTO2OrchestratorState& orch = rt->orchestrators[pto2_current_orch_idx]; + + // Collect producer slot states from both maps, deduplicated by pointer. + // +1: one creator slot + up to PTO2_LOOKUP_MAX_RESULTS modifier slots. + constexpr int kMaxWait = PTO2_LOOKUP_MAX_RESULTS + 1; + PTO2TaskSlotState* slots[kMaxWait]; + int slot_count = 0; + + // Step A: creator retention — read owner directly from tensor metadata + PTO2TaskId owner = tensor.owner_task_id; + if (owner.is_valid()) { + slots[slot_count++] = &rt->scheduler.ring_sched_states[owner.ring()].get_slot_state_by_task_id(owner.local()); + } + + // Step B: modifier writer lookup (OverlapMap) + PTO2LookupResult lookup_result; + orch.tensor_map.lookup(tensor, lookup_result); + for (int r = 0; r < lookup_result.count; r++) { + PTO2TaskId pid = lookup_result.entries[r].entry->producer_task_id; + PTO2TaskSlotState* s = &rt->scheduler.ring_sched_states[pid.ring()].get_slot_state_by_task_id(pid.local()); + bool already = false; + for (int j = 0; j < slot_count; j++) { + if (slots[j] == s) { + already = true; + break; + } + } + if (!already && slot_count < kMaxWait) { + slots[slot_count++] = s; + } + } + + // Wait for each producer + for (int p = 0; p < slot_count; p++) { + PTO2TaskSlotState& slot = *slots[p]; + uint8_t ring_id = slot.ring_id; + int32_t local_id = static_cast(slot.task->task_id.local()); + + uint64_t t0 = get_sys_cnt_aicpu(); + int32_t spin_count = 0; + while (slot.task_state.load(std::memory_order_acquire) < PTO2_TASK_COMPLETED) { + SPIN_WAIT_HINT(); + if ((++spin_count & 1023) == 0 && get_sys_cnt_aicpu() - t0 > PTO2_TENSOR_DATA_TIMEOUT_CYCLES) { + orch.fatal = true; + unified_log_error(caller, + "Timeout (%llu cycles): producer (ring=%d, local=%d) not completed", + (unsigned long long)PTO2_TENSOR_DATA_TIMEOUT_CYCLES, // NOLINT(runtime/int) + ring_id, + local_id); + return false; + } + } + + if (wait_for_consumers) { + t0 = get_sys_cnt_aicpu(); + spin_count = 0; + while (slot.fanout_refcount.load(std::memory_order_acquire) < slot.fanout_count - 1) { + SPIN_WAIT_HINT(); + if ((++spin_count & 1023) == 0 && get_sys_cnt_aicpu() - t0 > PTO2_TENSOR_DATA_TIMEOUT_CYCLES) { + orch.fatal = true; + unified_log_error(caller, + "Timeout (%llu cycles): consumers of producer (ring=%d, local=%d) not done", + (unsigned long long)PTO2_TENSOR_DATA_TIMEOUT_CYCLES, // NOLINT(runtime/int) + ring_id, + local_id); + return false; + } + } + } + } + return true; +} +MAYBE_UNINITIALIZED_END + +uint64_t pto2_get_tensor_data(PTO2Runtime* rt, const Tensor& tensor, uint32_t ndims, const uint32_t indices[]) { + if (tensor.buffer.addr == 0) { + unified_log_error(__FUNCTION__, + "get_tensor_data: buffer not allocated (addr=0). " + "Use the Tensor returned by add_output(TensorCreateInfo) after submit returns."); + return 0; + } + + if (!wait_for_tensor_ready(rt, tensor, false, __FUNCTION__)) { + return 0; + } + + uint64_t flat_offset = tensor.compute_flat_offset(indices, ndims); + uint64_t elem_size = get_element_size(tensor.dtype); + const void* ptr = reinterpret_cast(tensor.buffer.addr + flat_offset * elem_size); + uint64_t result = 0; + memcpy(&result, ptr, elem_size); + return result; +} + +void pto2_set_tensor_data( + PTO2Runtime* rt, const Tensor& tensor, uint32_t ndims, const uint32_t indices[], uint64_t value) { + if (tensor.buffer.addr == 0) { + unified_log_error(__FUNCTION__, + "set_tensor_data: buffer not allocated (addr=0). " + "Use the Tensor returned by add_output(TensorCreateInfo) after submit returns."); + return; + } + + // Wait for producer + all consumers before writing (WAW + WAR safety) + if (!wait_for_tensor_ready(rt, tensor, true, __FUNCTION__)) { + return; + } + + uint64_t flat_offset = tensor.compute_flat_offset(indices, ndims); + uint64_t elem_size = get_element_size(tensor.dtype); + void* ptr = reinterpret_cast(tensor.buffer.addr + flat_offset * elem_size); + memcpy(ptr, &value, elem_size); +} + +static const PTO2RuntimeOps s_runtime_ops = { + .submit_task = submit_task_impl, + .scope_begin = pto2_rt_scope_begin, + .scope_end = pto2_rt_scope_end, + .orchestration_done = pto2_rt_orchestration_done, + .is_fatal = is_fatal_impl, + .log_error = unified_log_error, + .log_warn = unified_log_warn, + .log_info = unified_log_info, + .log_debug = unified_log_debug, + .log_always = unified_log_always, + .get_tensor_data = pto2_get_tensor_data, + .set_tensor_data = pto2_set_tensor_data, +}; + +// ============================================================================= +// Runtime Creation and Destruction +// ============================================================================= + +PTO2Runtime* pto2_runtime_create(PTO2RuntimeMode mode) { + return pto2_runtime_create_custom(mode, PTO2_TASK_WINDOW_SIZE, PTO2_HEAP_SIZE); +} + +PTO2Runtime* pto2_runtime_create_custom( + PTO2RuntimeMode mode, uint64_t task_window_size, uint64_t heap_size, int32_t dep_pool_capacity) { + // Allocate runtime context + PTO2Runtime* rt = static_cast(calloc(1, sizeof(PTO2Runtime))); + if (!rt) { + return NULL; + } + + rt->ops = &s_runtime_ops; + rt->mode = mode; + rt->orch_count = 1; + rt->sm_handle = pto2_sm_create(task_window_size, heap_size); + if (!rt->sm_handle) { + free(rt); + return NULL; + } + + // Allocate GM heap for output buffers (all rings combined) + uint64_t total_heap_size = heap_size * PTO2_MAX_RING_DEPTH; + rt->gm_heap_size = total_heap_size; +#if defined(_POSIX_C_SOURCE) && _POSIX_C_SOURCE >= 200112L + if (posix_memalign(&rt->gm_heap, PTO2_ALIGN_SIZE, total_heap_size) != 0) { + pto2_sm_destroy(rt->sm_handle); + free(rt); + return NULL; + } +#else + rt->gm_heap = aligned_alloc(PTO2_ALIGN_SIZE, total_heap_size); + if (!rt->gm_heap) { + pto2_sm_destroy(rt->sm_handle); + free(rt); + return NULL; + } +#endif + rt->gm_heap_owned = true; + + // Initialize first orchestrator + if (!pto2_orchestrator_init(&rt->orchestrators[0], rt->sm_handle, rt->gm_heap, heap_size, dep_pool_capacity)) { + free(rt->gm_heap); + pto2_sm_destroy(rt->sm_handle); + free(rt); + return NULL; + } + + // Initialize scheduler (heap_size = per-ring heap size) + if (!pto2_scheduler_init(&rt->scheduler, rt->sm_handle)) { + pto2_orchestrator_destroy(&rt->orchestrators[0]); + free(rt->gm_heap); + pto2_sm_destroy(rt->sm_handle); + free(rt); + return NULL; + } + + // Connect orchestrator to scheduler (for simulated mode) + pto2_orchestrator_set_scheduler(&rt->orchestrators[0], &rt->scheduler); + + return rt; +} + +PTO2Runtime* pto2_runtime_create_from_sm(PTO2RuntimeMode mode, + PTO2SharedMemoryHandle* sm_handle, + void* gm_heap, + uint64_t heap_size, + int orch_count, + int32_t dep_pool_capacity) { + if (!sm_handle) return NULL; + if (orch_count < 1) orch_count = 1; + if (orch_count > PTO2_MAX_ORCH_THREADS) orch_count = PTO2_MAX_ORCH_THREADS; + + PTO2Runtime* rt = static_cast(calloc(1, sizeof(PTO2Runtime))); + if (!rt) return NULL; + + rt->ops = &s_runtime_ops; + rt->mode = mode; + rt->sm_handle = sm_handle; + rt->gm_heap = gm_heap; + rt->gm_heap_size = heap_size > 0 ? heap_size * PTO2_MAX_RING_DEPTH : 0; + rt->gm_heap_owned = false; + rt->orch_count = orch_count; + + // Initialize all orchestrator states + for (int i = 0; i < orch_count; i++) { + if (!pto2_orchestrator_init(&rt->orchestrators[i], rt->sm_handle, rt->gm_heap, heap_size, dep_pool_capacity)) { + for (int j = 0; j < i; j++) { + pto2_orchestrator_destroy(&rt->orchestrators[j]); + } + free(rt); + return NULL; + } + } + + // Initialize scheduler (heap_size = per-ring heap size) + if (!pto2_scheduler_init(&rt->scheduler, rt->sm_handle)) { + for (int i = 0; i < orch_count; i++) { + pto2_orchestrator_destroy(&rt->orchestrators[i]); + } + free(rt); + return NULL; + } + + // Connect all orchestrators to scheduler + for (int i = 0; i < orch_count; i++) { + pto2_orchestrator_set_scheduler(&rt->orchestrators[i], &rt->scheduler); + } + + return rt; +} + +void pto2_runtime_destroy(PTO2Runtime* rt) { + if (!rt) return; + + pto2_scheduler_destroy(&rt->scheduler); + for (int i = 0; i < rt->orch_count; i++) { + pto2_orchestrator_destroy(&rt->orchestrators[i]); + } + + if (rt->gm_heap_owned && rt->gm_heap) { + free(rt->gm_heap); + } + + if (rt->sm_handle) { + pto2_sm_destroy(rt->sm_handle); + } + + free(rt); +} + +void pto2_runtime_set_mode(PTO2Runtime* rt, PTO2RuntimeMode mode) { + if (rt) { + rt->mode = mode; + } +} diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_runtime2.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_runtime2.h new file mode 100644 index 000000000..31b28acd4 --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_runtime2.h @@ -0,0 +1,225 @@ +/** + * PTO Runtime2 - Main Interface + * + * This is the main header for the PTO Runtime2 system. + * It provides a unified API for task graph construction and execution. + * + * Key Features: + * - Ring buffer based memory management (zero allocation overhead) + * - Lazy invalidation TensorMap for dependency discovery + * - Scope-based buffer lifecycle management + * - Per-task spinlocks for concurrent fanout updates + * - Orchestrator-Scheduler decoupling via shared memory + * + * Usage: + * 1. Create runtime: pto2_runtime_create() + * 2. Build task graph in orchestration function: + * - pto2_scope_begin() / pto2_scope_end() + * - pto2_submit_task() + * 3. Mark orchestration complete: pto2_orchestrator_done() + * 4. Destroy runtime: pto2_runtime_destroy() + * + * Based on: docs/RUNTIME_LOGIC.md + */ + +#ifndef PTO_RUNTIME2_H +#define PTO_RUNTIME2_H + +#include "pto_runtime2_types.h" +#include "pto_submit_types.h" +#include "pto_shared_memory.h" +#include "pto_ring_buffer.h" +#include "pto_tensormap.h" +#include "pto_scheduler.h" +#include "pto_orchestrator.h" + +// Maximum number of orchestrator threads supported +constexpr int PTO2_MAX_ORCH_THREADS = 4; + +// ============================================================================= +// Runtime Context +// ============================================================================= + +/** + * Runtime execution mode + */ +enum PTO2RuntimeMode { + PTO2_MODE_EXECUTE = 0, // Execute tasks on workers + PTO2_MODE_SIMULATE = 1, // Simulate task execution with cycle counting + PTO2_MODE_GRAPH_ONLY = 2 // Build graph only, no execution +}; + +/** + * Function-pointer ops table for runtime operations. + * + * The orchestration .so calls runtime functions through this table + * (via pto_orchestration_api.h inline wrappers), so it has zero link + * dependencies on runtime .cpp files. + */ +typedef struct PTO2Runtime PTO2Runtime; // forward declare for ops signatures + +struct PTO2RuntimeOps { + TaskOutputTensors (*submit_task)(PTO2Runtime* rt, const MixedKernels& mixed_kernels, + const Arg& args); + void (*scope_begin)(PTO2Runtime* rt); + void (*scope_end)(PTO2Runtime* rt); + void (*orchestration_done)(PTO2Runtime* rt); + bool (*is_fatal)(PTO2Runtime* rt); + + // Logging (populated by runtime, called by orchestration) + void (*log_error)(const char* func, const char* fmt, ...); + void (*log_warn)(const char* func, const char* fmt, ...); + void (*log_info)(const char* func, const char* fmt, ...); + void (*log_debug)(const char* func, const char* fmt, ...); + void (*log_always)(const char* func, const char* fmt, ...); + + // Cross-layer data access (orchestration reads/writes tensor values via runtime) + // Placed after logging to avoid shifting hot-path field offsets. + uint64_t (*get_tensor_data)(PTO2Runtime* rt, const Tensor& tensor, + uint32_t ndims, const uint32_t indices[]); + void (*set_tensor_data)(PTO2Runtime* rt, const Tensor& tensor, + uint32_t ndims, const uint32_t indices[], + uint64_t value); +}; + +/** + * PTO Runtime2 context + * + * Contains all state for orchestration and scheduling. + * In simulated mode, runs in single process with shared address space. + */ +struct PTO2Runtime { + // Ops table (first field — used by orchestration .so via function pointers) + const PTO2RuntimeOps* ops; + + // Components + PTO2SharedMemoryHandle* sm_handle; + PTO2OrchestratorState orchestrators[PTO2_MAX_ORCH_THREADS]; + int orch_count; // Number of active orchestrator states + PTO2SchedulerState scheduler; + + // GM Heap for output buffers + void* gm_heap; + uint64_t gm_heap_size; + bool gm_heap_owned; // True if we allocated it + + // Mode + PTO2RuntimeMode mode; + + // Statistics + int64_t total_cycles; +}; + +// ============================================================================= +// Runtime Lifecycle API +// ============================================================================= + +/** + * Create a new runtime instance + * + * @param mode Execution mode + * @return Runtime context, or NULL on failure + */ +PTO2Runtime* pto2_runtime_create(PTO2RuntimeMode mode); + +/** + * Create runtime with custom sizes + * + * @param mode Execution mode + * @param task_window_size Number of task slots + * @param heap_size Size of GM heap + * @return Runtime context, or NULL on failure + */ +PTO2Runtime* pto2_runtime_create_custom(PTO2RuntimeMode mode, + uint64_t task_window_size, + uint64_t heap_size, + int32_t dep_pool_capacity = PTO2_DEP_LIST_POOL_SIZE); + +/** + * Create runtime from existing shared memory and GM heap (e.g. on device). + * Does not allocate sm_handle or gm_heap; caller owns them. + * + * @param mode Execution mode + * @param sm_handle Pre-created shared memory handle (e.g. from pto2_sm_create_from_buffer) + * @param gm_heap GM heap base for output buffers (or NULL if not used) + * @param heap_size GM heap size in bytes + * @return Runtime context, or NULL on failure + */ +PTO2Runtime* pto2_runtime_create_from_sm(PTO2RuntimeMode mode, + PTO2SharedMemoryHandle* sm_handle, + void* gm_heap, + uint64_t heap_size, + int orch_count = 1, + int32_t dep_pool_capacity = PTO2_DEP_LIST_POOL_SIZE); + +/** + * Destroy runtime and free all resources + */ +void pto2_runtime_destroy(PTO2Runtime* rt); + +/** + * Set execution mode + */ +void pto2_runtime_set_mode(PTO2Runtime* rt, PTO2RuntimeMode mode); + +/** + * Set the orchestrator index for the current thread. + * Must be called before any orchestration API calls on a given thread. + */ +void pto2_set_orch_thread_idx(int idx); + +// ============================================================================= +// Orchestration API (called by orchestration function) +// ============================================================================= + +/** + * Begin a new scope + * + * All tasks submitted within this scope will have their lifetime + * bounded by the scope. When scope_end() is called, the scope + * releases its reference to all enclosed tasks. + */ +void pto2_rt_scope_begin(PTO2Runtime* rt); + +/** + * End current scope + * + * Releases scope reference for all tasks submitted since scope_begin(). + * Tasks whose refcount reaches zero will have their buffers released. + */ +void pto2_rt_scope_end(PTO2Runtime* rt); + +/** + * Mark orchestration as complete + * + * Signals that no more tasks will be submitted. + */ +void pto2_rt_orchestration_done(PTO2Runtime* rt); + +/** + * Cross-layer data access: read a tensor value by waiting for its producer. + */ +uint64_t pto2_get_tensor_data(PTO2Runtime* rt, const Tensor& tensor, + uint32_t ndims, const uint32_t indices[]); + +/** + * Cross-layer data access: write a value to a tensor at given indices. + * Waits for producer completion (WAW) and all consumers (WAR) via TensorMap. + * See set_tensor_data in pto_orchestration_api.h for full documentation. + */ +void pto2_set_tensor_data(PTO2Runtime* rt, const Tensor& tensor, + uint32_t ndims, const uint32_t indices[], + uint64_t value); + +/** + * Slim config struct exported by orchestration .so via aicpu_orchestration_config(). + * Shared definition with pto_orchestration_api.h (same layout, guarded). + */ +#ifndef PTO2_ORCHESTRATION_CONFIG_DEFINED +#define PTO2_ORCHESTRATION_CONFIG_DEFINED +struct PTO2OrchestrationConfig { + int expected_arg_count; +}; +#endif + +#endif // PTO_RUNTIME2_H diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_runtime2_types.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_runtime2_types.h new file mode 100644 index 000000000..8f89afb25 --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_runtime2_types.h @@ -0,0 +1,557 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ + +/** + * PTO Runtime2 - Core Type Definitions + * + * This header defines all fundamental types used by the PTO Runtime2 system: + * - Configuration constants + * - Worker types and task states + * - Tensor regions and task parameters + * - Task descriptors with fanin/fanout tracking + * - Dependency list entries + * + * Based on: docs/RUNTIME_LOGIC.md + */ + +#ifndef SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_RUNTIME2_TYPES_H_ +#define SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_RUNTIME2_TYPES_H_ + +#include +#include +#include + +#include + +#include "pto2_dispatch_payload.h" +#include "pto_submit_types.h" +#include "pto_task_id.h" +#include "pto_types.h" + +// ============================================================================= +// Profiling Configuration +// ============================================================================= + +#ifndef PTO2_PROFILING +#define PTO2_PROFILING 1 +#endif + +#ifndef PTO2_ORCH_PROFILING +#define PTO2_ORCH_PROFILING 0 +#endif + +#ifndef PTO2_SCHED_PROFILING +#define PTO2_SCHED_PROFILING 0 +#endif + +#ifndef PTO2_TENSORMAP_PROFILING +#define PTO2_TENSORMAP_PROFILING 0 +#endif + +#if PTO2_ORCH_PROFILING && !PTO2_PROFILING +#error "PTO2_ORCH_PROFILING requires PTO2_PROFILING=1" +#endif + +#if PTO2_SCHED_PROFILING && !PTO2_PROFILING +#error "PTO2_SCHED_PROFILING requires PTO2_PROFILING=1" +#endif + +#if PTO2_TENSORMAP_PROFILING && !PTO2_ORCH_PROFILING +#error "PTO2_TENSORMAP_PROFILING requires PTO2_ORCH_PROFILING=1" +#endif + +// ============================================================================= +// AICPU Error Codes (written to shared memory for Host-side diagnosis) +// ============================================================================= + +// Orchestrator errors (1-99): detected in orchestrator thread +#define PTO2_ERROR_NONE 0 +#define PTO2_ERROR_SCOPE_DEADLOCK 1 +#define PTO2_ERROR_HEAP_RING_DEADLOCK 2 +#define PTO2_ERROR_FLOW_CONTROL_DEADLOCK 3 +#define PTO2_ERROR_DEP_POOL_OVERFLOW 4 +#define PTO2_ERROR_INVALID_ARGS 5 // Arg construction error (invalid args) +#define PTO2_ERROR_DEPENDENCY_OVERFLOW 6 // Too many unique fanin dependencies for one task + +// Scheduler errors (100+): detected in scheduler threads +#define PTO2_ERROR_SCHEDULER_TIMEOUT 100 + +// ============================================================================= +// Configuration Constants +// ============================================================================= + +// Task management +// NOTE: PTO2_TASK_WINDOW_SIZE is now a per-ring default value. +// Actual window size is passed at runtime to pto2_runtime_create_threaded_custom(). +// Use pto2_task_slot(sched, task_id) for slot calculation. +#define PTO2_TASK_WINDOW_SIZE 16384 // Default per-ring task window size (power of 2) + +// Multi-ring: number of independent ring layers (HeapRing + TaskRing + DepPool per layer) +// Scope depth maps to ring index via: min(scope_depth, PTO2_MAX_RING_DEPTH - 1) +#define PTO2_MAX_RING_DEPTH 4 + +// Memory pools (per-ring defaults; total = value × PTO2_MAX_RING_DEPTH) +#define PTO2_HEAP_SIZE (256 * 1024 * 1024) // 256MB per ring (1GB total) +#define PTO2_DEP_LIST_POOL_SIZE 16384 // Per-ring dependency list pool entries +#define PTO2_TENSORMAP_POOL_SIZE (65536) // TensorMap entry pool +#define PTO2_TENSORMAP_NUM_BUCKETS 4096 // Power of 2 for fast hash (4096×8B=32KB fits L1) + +// Scope management +#define PTO2_MAX_SCOPE_DEPTH 64 // Maximum nesting depth +#define PTO2_SCOPE_TASKS_INIT_CAP 65536 // Initial capacity for scope task buffer + +// Ready queue +#define PTO2_READY_QUEUE_SIZE 65536 // Per-shape queue size + +// Memory alignment +#define PTO2_ALIGN_SIZE 64 // Cache line alignment +#define PTO2_PACKED_OUTPUT_ALIGN 1024 // Each output in packed buffer aligned to 1024B; gap is padding +#define PTO2_ALIGN_UP(x, align) (((x) + (align) - 1) & ~((align) - 1)) + +// TensorMap cleanup interval +#define PTO2_TENSORMAP_CLEANUP_INTERVAL 64 // Cleanup every N retired tasks +#define PTO2_DEP_POOL_CLEANUP_INTERVAL 64 // Cleanup every N retired tasks + +// get_tensor_data/set_tensor_data spin wait timeout in cycles. +// ~10s on hardware (1.5 GHz counter), ~10s on simulation (chrono-based). +constexpr uint64_t PTO2_TENSOR_DATA_TIMEOUT_CYCLES = 15 * 1000 * 1000 * 1000ULL; + +// ============================================================================= +// Multi-Ring task_id Encoding +// ============================================================================= + +/** + * TaskId: defined in pto_task_id.h (included above). + */ + +// ============================================================================= +// Worker Types +// ============================================================================= + +/** + * Worker type enumeration + * Each worker type has its own ready queue for load balancing + */ +typedef enum { + PTO2_WORKER_CUBE = 0, // AICore CUBE unit (matrix ops) + PTO2_WORKER_VECTOR = 1, // AICore VECTOR unit (element-wise ops) + PTO2_WORKER_AI_CPU = 2, // AI_CPU (scalar ops, control flow) + PTO2_WORKER_ACCELERATOR = 3, // Fixed-function accelerators (DMA, etc.) + PTO2_NUM_WORKER_TYPES = 4 +} PTO2WorkerType; + +// ============================================================================= +// Task States +// ============================================================================= + +/** + * Task state enumeration + * + * State transitions: + * PENDING -> READY -> RUNNING -> COMPLETED -> CONSUMED + * + * Conditions: + * PENDING->READY: fanin_refcount == fanin_count + * COMPLETED->CONSUMED: fanout_refcount == fanout_count && state == COMPLETED + */ +typedef enum { + PTO2_TASK_PENDING = 0, // Waiting for dependencies (fanin_refcount < fanin_count) + PTO2_TASK_READY = 1, // All dependencies satisfied, waiting in ready queue + PTO2_TASK_RUNNING = 2, // Currently executing on a worker + PTO2_TASK_COMPLETED = 3, // Execution finished, output may still be in use + PTO2_TASK_CONSUMED = 4 // Output fully consumed, buffers can be released +} PTO2TaskState; + +// ============================================================================= +// Logical Tensor (for view/reshape/transpose operations) +// ============================================================================= + +/** + * Maximum dimensions supported for logical tensors + */ +#define PTO2_MAX_TENSOR_DIM 8 + +/** + * Maximum depth of layout history for HBB overlap detection + * Simple (contiguous) tensor has depth=1, non-contiguous has depth>1 + */ +#define PTO2_MAX_LAYOUT_DEPTH 8 + +/** + * Layout operation type for HBB + */ +typedef enum { + PTO2_LAYOUT_VIEW = 0, // View/slice: records bounding box + PTO2_LAYOUT_RESHAPE = 1, // Reshape: records new shape + PTO2_LAYOUT_TRANSPOSE = 2 // Transpose: records permutation +} PTO2LayoutOpType; + +/** + * Layout operation entry for HBB + * Each entry records one derivation step from the parent tensor. + */ +typedef struct { + PTO2LayoutOpType type; + union { + struct { // PTO2_LAYOUT_VIEW + int64_t bbox_min; // First byte accessed + int64_t bbox_max; // Last byte accessed + } view; + struct { // PTO2_LAYOUT_RESHAPE + int32_t ndim; + int64_t shape[PTO2_MAX_TENSOR_DIM]; + } reshape; + struct { // PTO2_LAYOUT_TRANSPOSE + int32_t ndim; + int32_t perm[PTO2_MAX_TENSOR_DIM]; + } transpose; + }; +} PTO2LayoutOp; + +/** + * Tensor extraction type (for tracking how tensor was created) + */ +typedef enum { + PTO2_TENSOR_RAW = 0, // Original raw tensor (owns storage) + PTO2_TENSOR_VIEW = 1, // view() - subset selection, shared storage + PTO2_TENSOR_RESHAPE = 2, // reshape() - shape change, shared storage + PTO2_TENSOR_TRANSPOSE = 3, // transpose() - dimension permute, shared storage + PTO2_TENSOR_DEEP_VIEW = 4, // deep_view() - copied subset, new storage + PTO2_TENSOR_DEEP_RESHAPE = 5, // deep_reshape() - copied reshape, new storage + PTO2_TENSOR_DEEP_TRANSPOSE = 6 // deep_transpose() - copied transpose, new storage +} PTO2TensorExtractionType; + +/** + * Raw tensor (storage provider) + * + * The raw tensor owns the actual memory allocation. + * Multiple logical tensors can share the same raw tensor (aliasing). + */ +typedef struct { + void* base_ptr; // Base pointer of allocated memory + int64_t total_size; // Total size in bytes + int32_t refcount; // Number of logical tensors referencing this storage + // (for memory management, 0 = can be freed) +} PTO2RawTensor; + +/** + * Logical tensor structure + * + * A "view" into raw tensor storage with specific layout. + * Supports multi-dimensional tensors with strides (for view/reshape/transpose). + * + * Memory footprint is determined by: + * - storage_offset: byte offset from raw_base to first element + * - shape[d]: number of elements in dimension d + * - strides[d]: byte offset between consecutive elements in dimension d + * + * For element at indices [i0, i1, ..., i_{n-1}]: + * byte_offset = storage_offset + sum(i_d * strides[d]) + * + * Examples: + * - Contiguous row-major (3,4): shape=[3,4], strides=[4*elem_size, elem_size] + * - Transposed (4,3): shape=[4,3], strides=[elem_size, 4*elem_size] + * - Sliced [1:3, 1:3]: offset adjusted, shape=[2,2], strides unchanged + */ +typedef struct { + // === Raw tensor reference (shared storage) === + void* raw_base; // Pointer to raw tensor's base (for aliasing check) + int64_t raw_total_size; // Total size of raw tensor in bytes + + // === Storage offset === + int64_t storage_offset; // Byte offset from raw_base to first element + + // === Shape and strides === + int64_t shape[PTO2_MAX_TENSOR_DIM]; // Size in each dimension + int64_t strides[PTO2_MAX_TENSOR_DIM]; // Byte stride in each dimension + int32_t ndim; // Number of dimensions (0 = scalar) + + // === Precomputed bounding box (for fast overlap detection) === + int64_t min_byte_offset; // First byte accessed (relative to raw_base) + int64_t max_byte_offset; // Last byte accessed (relative to raw_base) + + // === Element info === + int64_t elem_size; // Size of each element in bytes + int64_t numel; // Total number of elements + + // === Extraction tracking === + PTO2TensorExtractionType extraction_type; // How this tensor was created + bool is_contiguous; // True if memory is contiguous (no gaps) + // Equivalent to layout_depth == 1 + + // === Layout history for HBB overlap detection === + int32_t layout_depth; // Number of layout ops (1=simple) + PTO2LayoutOp layout_ops[PTO2_MAX_LAYOUT_DEPTH]; // Derivation history +} PTO2LogicalTensor; + +// ============================================================================= +// Dependency List Entry +// ============================================================================= + +/** + * Dependency list entry (singly-linked list node) + * Stored in DepListPool ring buffer + * + * Used for both fanin_list and fanout_list + */ +struct PTO2TaskSlotState; // Forward declaration +struct PTO2DepListEntry { + PTO2TaskSlotState* slot_state; // Consumer slot state (direct pointer) + PTO2DepListEntry* next; // next entry +}; + +// ============================================================================= +// Task Descriptor +// ============================================================================= + +/** + * Task descriptor structure (shared memory) + * + * Stored in the TaskDescriptor ring buffer in shared memory. + * Contains static identification and buffer pointers only. + * Dynamic scheduling state (fanin/fanout/task_state) is in PTO2TaskSlotState. + * + * Fields set by Orchestrator at submission, read by Scheduler for dispatch. + */ +struct PTO2TaskDescriptor { + // Mixed-task identification (encodes ring_id in upper 32 bits) + PTO2TaskId task_id; // raw: (ring_id << 32) | local_id + + // Per-slot kernel IDs (INVALID_KERNEL_ID = inactive) + int32_t kernel_id[PTO2_SUBTASK_SLOT_COUNT]; + + // Packed output buffer (all outputs packed into single contiguous buffer) + void* packed_buffer_base; // Start of packed buffer in GM Heap + void* packed_buffer_end; // End of packed buffer (for heap reclamation) +}; + +// ============================================================================= +// Per-Slot Scheduling State +// ============================================================================= + +/** + * Task payload data (cold path - only accessed during orchestration and dispatch) + * + * Layout: metadata (counts, fanin pointers) packed in the first 3 cache lines, + * followed by bulk tensor and scalar data. This gives sequential write access + * during orchestration and groups scheduler-hot fields (fanin_actual_count + + * fanin_slot_states) together for on_task_release. + */ +struct PTO2TaskPayload { + // === Cache lines 0-2 (192B) — metadata === + int32_t tensor_count{0}; + int32_t scalar_count{0}; + int32_t fanin_actual_count{0}; // Actual fanin count (without the +1 redundance) + int32_t _reserved{0}; // Reserved (dep_pool_mark moved to SlotState for local access) + PTO2TaskSlotState* fanin_slot_states[PTO2_MAX_INPUTS]; // Producer slot states (used by on_task_release) + // === Cache lines 3-34 (2048B) — tensors (alignas(64) forces alignment) === + Tensor tensors[MAX_TENSOR_ARGS]; + // === Cache lines 35-50 (1024B) — scalars === + uint64_t scalars[MAX_SCALAR_ARGS]; + + // Layout verification (size checks that don't need offsetof). + static_assert(sizeof(Tensor) == 128, "Tensor must be 2 cache lines"); + static_assert(MAX_SCALAR_ARGS * sizeof(uint64_t) == 1024, "scalar region must be 1024B (16 cache lines)"); + + /** + * Initialize payload: copy tensors, store scalars. + * + * For each param slot, the tensor source is determined by TensorArgType: + * - OUTPUT -> use materialized_outputs.output_ptr(out_idx++) + * - INPUT / INOUT -> use refs[i].tensor + * + * @param args Task arguments (tensors + scalars) + * @param materialized_outputs Materialized output tensors (from TensorCreateInfo path) + */ + void init( + const Arg& args, TaskOutputTensors& result, void* base_addr, uint64_t offsets[], uint64_t buffer_sizes[]) { + tensor_count = args.tensor_count(); + scalar_count = args.scalar_count(); + + // int32_t out_idx = 0; + for (int32_t i = 0; i < args.tensor_count(); i++) { + if (args.tag(i) != TensorArgType::OUTPUT) { + tensors[i].copy(*args.tensor(i).ptr); + } else { + tensors[i].init_from_create_info(*args.tensor(i).create_info, + reinterpret_cast(reinterpret_cast(base_addr) + offsets[i]), + buffer_sizes[i]); + result.materialize_output(tensors[i]); + } + tensors[i].update_start_offset(); + } + // Round up to cache line boundary. Both arrays are 1024B so no overrun. + // Eliminates branches; extra bytes within the same CL have zero additional cost. + memcpy(scalars, args.scalars(), PTO2_ALIGN_UP(args.scalar_count() * sizeof(uint64_t), 64)); + } +}; + +// PTO2TaskPayload layout verification (offsetof requires complete type). +static_assert(offsetof(PTO2TaskPayload, tensors) == 192, "tensors must start at byte 192 (cache line 3)"); +static_assert(offsetof(PTO2TaskPayload, scalars) == 192 + MAX_TENSOR_ARGS * sizeof(Tensor), + "scalars must immediately follow tensors"); + +/** + * Per-task slot scheduling state (scheduler-private, NOT in shared memory) + * + * Consolidates all hot-path scheduling fields into a single cache-friendly + * structure (32 bytes = half a cache line). Accessing any field of a task's + * slot state brings all related fields into the same cache line. + * + * Concurrency notes: + * - fanout_head, fanout_count protected by fanout_lock (per-task spinlock) + * - fanin_count set once at submission, read-only after (hot path for ready check) + * - task_state, fanin_refcount, fanout_refcount updated atomically + */ +struct alignas(64) PTO2TaskSlotState { + // Fanout lock + list (accessed together under lock in on_task_complete) + std::atomic fanout_lock; // Per-task spinlock (0=unlocked, 1=locked) + int32_t fanout_count; // 1 (owning scope) + number of consumers + + PTO2DepListEntry* fanout_head; // Pointer to first fanout entry (nullptr = empty) + + // Task state (completion, consumed check, ready check) + std::atomic task_state; // PENDING/READY/RUNNING/COMPLETED/CONSUMED + + // Fanin (accessed together in release_fanin_and_check_ready) + std::atomic fanin_refcount; // Dynamic: counts completed producers + int32_t fanin_count; // Number of producer dependencies (set once) + + // Fanout refcount (accessed with fanout_count in check_and_handle_consumed) + std::atomic fanout_refcount; // Dynamic: counts released references + + PTO2TaskPayload* payload; + + PTO2TaskDescriptor* task; + + // Hot-path completion fields (moved from TaskDescriptor to avoid cross-struct access) + uint8_t active_mask; // Bitmask of active subtask slots (set once) + std::atomic subtask_done_mask; // Deprecated: superseded by completed_subtasks + uint8_t ring_id; // Ring layer this task belongs to (for per-ring reclamation) + int32_t dep_pool_mark{0}; // Dep pool top after this task's submission (orchestrator-only, local memory) + + // SPMD multi-block (occupies the 8 tail bytes previously implicit padding) + std::atomic completed_subtasks{0}; // Each core completion increments by 1 + int16_t total_required_subtasks{0}; // = block_num * popcount(active_mask) + int16_t block_num{1}; // Total logical blocks (set by orchestrator) + int16_t next_block_idx{0}; // Next block to dispatch (scheduler state) +}; + +static_assert(sizeof(PTO2TaskSlotState) == 64); + +// ============================================================================= +// Cycle Cost Function Type +// ============================================================================= + +/** + * Cycle cost function pointer type + * Returns estimated cycle count for the InCore function + */ +typedef int64_t (*PTO2CycleCostFunc)(void** args, int32_t num_args); + +// ============================================================================= +// InCore Function Type +// ============================================================================= + +/** + * InCore function signature + * All InCore functions must match this signature + */ +typedef void (*PTO2InCoreFunc)(void** args, int32_t num_args); + +// ============================================================================= +// Utility Macros +// ============================================================================= + +/** + * Memory barrier macros for different architectures + */ +#if defined(__aarch64__) +#define PTO2_MEMORY_BARRIER() __asm__ __volatile__("dmb sy" ::: "memory") +#elif defined(__x86_64__) +#define PTO2_MEMORY_BARRIER() __asm__ __volatile__("mfence" ::: "memory") +#else +#define PTO2_MEMORY_BARRIER() __sync_synchronize() +#endif + +// Spin-wait hint for AICPU threads. On real hardware the AICPU has dedicated +// ARM A55 cores — no OS yield is needed, so the hint is a no-op. In simulation +// all threads share host CPU cores, so we yield to prevent starvation. +// This header is also compiled into the Host .so (for struct definitions only), +// where the hint is never called — the fallback no-op keeps Host builds clean. +#if __has_include("spin_hint.h") +#include "spin_hint.h" +#else +#define SPIN_WAIT_HINT() ((void)0) +#endif + +// ============================================================================= +// Per-task fanout spinlock helpers +// +// Used by BOTH the orchestrator (pto_orchestrator.cpp) and the scheduler +// (aicpu_executor.cpp). Placing them here ensures both translation units use +// identical acquire/release semantics. +// +// The fanout_lock MUST be held whenever reading or writing fanout_head / +// fanout_count, because the orchestrator adds consumers concurrently with the +// scheduler traversing the list after task completion. +// ============================================================================= + +#if PTO2_ORCH_PROFILING || PTO2_SCHED_PROFILING +#include "aicpu/device_time.h" +#endif + +#if PTO2_ORCH_PROFILING || PTO2_SCHED_PROFILING +static inline void pto2_fanout_lock(PTO2TaskSlotState& slot_state, uint64_t& atomic_count, uint64_t& wait_cycle) { + uint64_t t0 = get_sys_cnt_aicpu(); + bool contended = false; + uint32_t atomic_ops = 0; + + for (;;) { + while (slot_state.fanout_lock.load(std::memory_order_acquire) != 0) { + contended = true; + atomic_ops++; // each load = 1 atomic + SPIN_WAIT_HINT(); + } + int32_t expected = 0; + if (slot_state.fanout_lock.compare_exchange_weak( + expected, 1, std::memory_order_acquire, std::memory_order_relaxed)) { + atomic_ops++; // successful CAS = 1 atomic + atomic_count += atomic_ops; + if (contended) { + wait_cycle += (get_sys_cnt_aicpu() - t0); + } + return; + } + contended = true; + atomic_ops++; // failed CAS = 1 atomic + } +} +#endif + +static inline void pto2_fanout_lock(PTO2TaskSlotState& slot_state) { + for (;;) { + while (slot_state.fanout_lock.load(std::memory_order_acquire) != 0) { + SPIN_WAIT_HINT(); + } + int32_t expected = 0; + if (slot_state.fanout_lock.compare_exchange_weak( + expected, 1, std::memory_order_acquire, std::memory_order_relaxed)) { + return; + } + } +} + +static inline void pto2_fanout_unlock(PTO2TaskSlotState& slot_state) { + slot_state.fanout_lock.store(0, std::memory_order_release); +} + +#endif // SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_RUNTIME2_TYPES_H_ diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_scheduler.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_scheduler.cpp new file mode 100644 index 000000000..2fa7b0cd3 --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_scheduler.cpp @@ -0,0 +1,220 @@ +/** + * PTO Runtime2 - Scheduler Implementation + * + * Implements scheduler state management, ready queues, and task lifecycle. + * + * Based on: docs/RUNTIME_LOGIC.md + */ + +#include "pto_scheduler.h" +#include +#include +#include +#include +#include "common/unified_log.h" + +// ============================================================================= +// Scheduler Profiling Counters +// ============================================================================= + +#if PTO2_SCHED_PROFILING +#include "common/platform_config.h" + +uint64_t g_sched_lock_cycle[PLATFORM_MAX_AICPU_THREADS] = {}; +uint64_t g_sched_fanout_cycle[PLATFORM_MAX_AICPU_THREADS] = {}; +uint64_t g_sched_fanin_cycle[PLATFORM_MAX_AICPU_THREADS] = {}; +uint64_t g_sched_self_consumed_cycle[PLATFORM_MAX_AICPU_THREADS] = {}; +uint64_t g_sched_lock_wait_cycle[PLATFORM_MAX_AICPU_THREADS] = {}; +uint64_t g_sched_push_wait_cycle[PLATFORM_MAX_AICPU_THREADS] = {}; +uint64_t g_sched_pop_wait_cycle[PLATFORM_MAX_AICPU_THREADS] = {}; +uint64_t g_sched_lock_atomic_count[PLATFORM_MAX_AICPU_THREADS] = {}; +uint64_t g_sched_fanout_atomic_count[PLATFORM_MAX_AICPU_THREADS] = {}; +uint64_t g_sched_fanin_atomic_count[PLATFORM_MAX_AICPU_THREADS] = {}; +uint64_t g_sched_self_atomic_count[PLATFORM_MAX_AICPU_THREADS] = {}; +uint64_t g_sched_pop_atomic_count[PLATFORM_MAX_AICPU_THREADS] = {}; +uint64_t g_sched_complete_count[PLATFORM_MAX_AICPU_THREADS] = {}; + +PTO2SchedProfilingData pto2_scheduler_get_profiling(int thread_idx) { + PTO2SchedProfilingData d; + d.lock_cycle = std::exchange(g_sched_lock_cycle[thread_idx], 0); + d.fanout_cycle = std::exchange(g_sched_fanout_cycle[thread_idx], 0); + d.fanin_cycle = std::exchange(g_sched_fanin_cycle[thread_idx], 0); + d.self_consumed_cycle = std::exchange(g_sched_self_consumed_cycle[thread_idx], 0); + d.lock_wait_cycle = std::exchange(g_sched_lock_wait_cycle[thread_idx], 0); + d.push_wait_cycle = std::exchange(g_sched_push_wait_cycle[thread_idx], 0); + d.pop_wait_cycle = std::exchange(g_sched_pop_wait_cycle[thread_idx], 0); + d.lock_atomic_count = std::exchange(g_sched_lock_atomic_count[thread_idx], 0); + d.fanout_atomic_count = std::exchange(g_sched_fanout_atomic_count[thread_idx], 0); + d.fanin_atomic_count = std::exchange(g_sched_fanin_atomic_count[thread_idx], 0); + d.self_atomic_count = std::exchange(g_sched_self_atomic_count[thread_idx], 0); + d.pop_atomic_count = std::exchange(g_sched_pop_atomic_count[thread_idx], 0); + d.complete_count = std::exchange(g_sched_complete_count[thread_idx], 0); + return d; +} +#endif + +// ============================================================================= +// Task State Names +// ============================================================================= + +const char* pto2_task_state_name(PTO2TaskState state) { + switch (state) { + case PTO2_TASK_PENDING: return "PENDING"; + case PTO2_TASK_READY: return "READY"; + case PTO2_TASK_RUNNING: return "RUNNING"; + case PTO2_TASK_COMPLETED: return "COMPLETED"; + case PTO2_TASK_CONSUMED: return "CONSUMED"; + default: return "UNKNOWN"; + } +} + +// ============================================================================= +// Ready Queue Implementation +// ============================================================================= + +bool pto2_ready_queue_init(PTO2ReadyQueue* queue, uint64_t capacity) { + queue->slots = (PTO2ReadyQueueSlot*)malloc(capacity * sizeof(PTO2ReadyQueueSlot)); + if (!queue->slots) { + return false; + } + + queue->capacity = capacity; + queue->mask = capacity - 1; + queue->enqueue_pos.store(0, std::memory_order_relaxed); + queue->dequeue_pos.store(0, std::memory_order_relaxed); + + for (uint64_t i = 0; i < capacity; i++) { + queue->slots[i].sequence.store((int64_t)i, std::memory_order_relaxed); + queue->slots[i].slot_state = nullptr; + } + + return true; +} + +void pto2_ready_queue_destroy(PTO2ReadyQueue* queue) { + if (queue->slots) { + free(queue->slots); + queue->slots = NULL; + } +} + +// ============================================================================= +// Scheduler Initialization +// ============================================================================= + +bool PTO2SchedulerState::RingSchedState::init( + PTO2SharedMemoryHandle* sm_handle, int32_t ring_id) { + task_descriptors = sm_handle->task_descriptors[ring_id]; + task_window_size = sm_handle->header->rings[ring_id].task_window_size; + task_window_mask = static_cast(task_window_size - 1); + last_task_alive = 0; + slot_states = nullptr; + advance_lock.store(0, std::memory_order_relaxed); + + // Allocate per-task slot state array (dynamically sized based on runtime window_size) + slot_states = new (std::nothrow) PTO2TaskSlotState[task_window_size]; + if (!slot_states) { + return false; + } + + // Zero-initialize all per-task slot state fields. + for (uint64_t i = 0; i < task_window_size; i++) { + slot_states[i].fanout_lock.store(0, std::memory_order_relaxed); + slot_states[i].fanout_count = 0; + slot_states[i].fanout_head = nullptr; + slot_states[i].task_state.store(static_cast(0), std::memory_order_relaxed); + slot_states[i].fanin_refcount.store(0, std::memory_order_relaxed); + slot_states[i].fanin_count = 0; + slot_states[i].fanout_refcount.store(0, std::memory_order_relaxed); + slot_states[i].payload = nullptr; + slot_states[i].task = nullptr; + slot_states[i].active_mask = 0; + slot_states[i].subtask_done_mask.store(0, std::memory_order_relaxed); + slot_states[i].ring_id = 0; + } + + return true; +} + +void PTO2SchedulerState::RingSchedState::destroy() { + if (!slot_states) return; + delete[] slot_states; + slot_states = nullptr; +} + +bool pto2_scheduler_init(PTO2SchedulerState* sched, + PTO2SharedMemoryHandle* sm_handle) { + sched->sm_handle = sm_handle; +#if PTO2_SCHED_PROFILING + sched->tasks_completed.store(0, std::memory_order_relaxed); + sched->tasks_consumed.store(0, std::memory_order_relaxed); +#endif + + // Initialize per-ring state + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + if (!sched->ring_sched_states[r].init(sm_handle, r)) { + for (int j = 0; j < r; j++) { + sched->ring_sched_states[j].destroy(); + } + return false; + } + } + + // Initialize ready queues (one per resource shape, global) + for (int i = 0; i < PTO2_NUM_RESOURCE_SHAPES; i++) { + if (!pto2_ready_queue_init(&sched->ready_queues[i], PTO2_READY_QUEUE_SIZE)) { + // Cleanup on failure + for (int j = 0; j < i; j++) { + pto2_ready_queue_destroy(&sched->ready_queues[j]); + } + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + sched->ring_sched_states[r].destroy(); + } + return false; + } + } + + return true; +} + +void pto2_scheduler_destroy(PTO2SchedulerState* sched) { + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + sched->ring_sched_states[r].destroy(); + } + + for (int i = 0; i < PTO2_NUM_RESOURCE_SHAPES; i++) { + pto2_ready_queue_destroy(&sched->ready_queues[i]); + } +} + +// ============================================================================= +// Debug Utilities +// ============================================================================= + +void pto2_scheduler_print_stats(PTO2SchedulerState* sched) { + LOG_INFO("=== Scheduler Statistics ==="); + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + if (sched->ring_sched_states[r].last_task_alive > 0) { + LOG_INFO("Ring %d:", r); + LOG_INFO(" last_task_alive: %d", sched->ring_sched_states[r].last_task_alive); + } + } +#if PTO2_SCHED_PROFILING + LOG_INFO("tasks_completed: %lld", (long long)sched->tasks_completed.load(std::memory_order_relaxed)); + LOG_INFO("tasks_consumed: %lld", (long long)sched->tasks_consumed.load(std::memory_order_relaxed)); +#endif + LOG_INFO("============================"); +} + +void pto2_scheduler_print_queues(PTO2SchedulerState* sched) { + LOG_INFO("=== Ready Queues ==="); + + const char* shape_names[] = {"AIC", "AIV", "MIX"}; + + for (int i = 0; i < PTO2_NUM_RESOURCE_SHAPES; i++) { + LOG_INFO(" %s: count=%" PRIu64, shape_names[i], + sched->ready_queues[i].size()); + } + + LOG_INFO("===================="); +} diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_scheduler.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_scheduler.h new file mode 100644 index 000000000..d9b984ce0 --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_scheduler.h @@ -0,0 +1,819 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ + +/** + * PTO Runtime2 - Scheduler Interface + * + * The Scheduler is responsible for: + * 1. Maintaining per-resource-shape ready queues + * 2. Tracking task state (PENDING -> READY -> RUNNING -> COMPLETED -> CONSUMED) + * 3. Managing fanin/fanout refcounts for dependency resolution + * 4. Advancing last_task_alive for heap reclamation + * 5. Two-stage mixed-task completion (subtask done bits → mixed-task complete) + * + * The Scheduler runs on Device AI_CPU and processes: + * - Task state transitions based on fanin_refcount + * - Buffer lifecycle based on fanout_refcount + * - Ring pointer advancement for flow control + * + * Based on: docs/RUNTIME_LOGIC.md + */ + +#pragma once + +#include + +#include "common/core_type.h" +#include "pto_ring_buffer.h" +#include "pto_runtime2_types.h" +#include "pto_shared_memory.h" + +#if PTO2_SCHED_PROFILING +#include "aicpu/device_time.h" +#define PTO2_SCHED_CYCLE_START() uint64_t _st0 = get_sys_cnt_aicpu(), _st1 +#define PTO2_SCHED_CYCLE_LAP(acc) \ + do { \ + _st1 = get_sys_cnt_aicpu(); \ + acc += (_st1 - _st0); \ + _st0 = _st1; \ + } while (0) +#endif + +// ============================================================================= +// Ready Queue (Lock-free bounded MPMC — Vyukov design) +// ============================================================================= + +/** + * Per-slot entry: sequence counter for ABA safety + task payload + */ +struct PTO2ReadyQueueSlot { + std::atomic sequence; + PTO2TaskSlotState* slot_state; +}; + +/** + * Thread-local ready buffer for local-first dispatch optimization. + * + * Two buffers per scheduling thread, one per CoreType (AIC=0, AIV=1). + * Initialized once before the scheduling loop; must be empty at + * the start of each iteration (verified by always_assert). + * + * Phase 1 fills per-CoreType buffers via on_task_complete(). + * dispatch_ready_tasks_to_idle_cores drains them: local-first via + * get_ready_task_batch, then remaining tasks pushed to global readyQ. + */ +// Number of CoreType values eligible for local dispatch (AIC=0, AIV=1) +static constexpr int PTO2_LOCAL_DISPATCH_TYPE_NUM = 2; + +struct PTO2LocalReadyBuffer { + PTO2TaskSlotState** slot_states = nullptr; + int count = 0; + int capacity = 0; + + void reset(PTO2TaskSlotState** buf, int cap) { + slot_states = buf; + count = 0; + capacity = cap; + } + + bool try_push(PTO2TaskSlotState* s) { + if (slot_states && count < capacity) { + slot_states[count++] = s; + return true; + } + return false; + } + + PTO2TaskSlotState* pop() { return (count > 0) ? slot_states[--count] : nullptr; } +}; + +/** + * Lock-free bounded MPMC queue (Dmitry Vyukov design) + * + * Key properties: + * - enqueue_pos and dequeue_pos on separate cache lines (no false sharing) + * - Per-slot sequence counter prevents ABA problem + * - Empty queue pop returns immediately (single atomic load, no lock) + * - CAS contention is split: producers only touch enqueue_pos, + * consumers only touch dequeue_pos + */ +struct alignas(64) PTO2ReadyQueue { + PTO2ReadyQueueSlot* slots; + uint64_t capacity; + uint64_t mask; // capacity - 1 + char _pad0[64 - 24]; // Pad to own cache line + + std::atomic enqueue_pos; + char _pad1[64 - sizeof(std::atomic)]; // Own cache line + + std::atomic dequeue_pos; + char _pad2[64 - sizeof(std::atomic)]; // Own cache line + + uint64_t size() { + uint64_t e = enqueue_pos.load(std::memory_order_relaxed); + uint64_t d = dequeue_pos.load(std::memory_order_relaxed); + return (e >= d) ? (e - d) : 0; + } + + bool push(PTO2TaskSlotState* slot_state) { + uint64_t pos; + PTO2ReadyQueueSlot* slot; + while (true) { + pos = enqueue_pos.load(std::memory_order_relaxed); + slot = &slots[pos & mask]; + int64_t seq = slot->sequence.load(std::memory_order_acquire); + int64_t diff = seq - static_cast(pos); + if (diff == 0) { + if (enqueue_pos.compare_exchange_weak( + pos, pos + 1, std::memory_order_relaxed, std::memory_order_relaxed)) { + break; + } + } else if (diff < 0) { + return false; // Queue full + } + } + + slot->slot_state = slot_state; + slot->sequence.store(static_cast(pos + 1), std::memory_order_release); + return true; + } + + // Batch push: reserve count slots with a single CAS after confirming + // every target slot is available under the usual Vyukov sequence check. + void push_batch(PTO2TaskSlotState** items, int count) { + if (count == 0) return; + + uint64_t pos; + while (true) { + pos = enqueue_pos.load(std::memory_order_relaxed); + bool ready = true; + for (int i = 0; i < count; i++) { + PTO2ReadyQueueSlot* slot = &slots[(pos + i) & mask]; + int64_t seq = slot->sequence.load(std::memory_order_acquire); + int64_t diff = seq - static_cast(pos + i); + if (diff != 0) { + ready = false; + break; + } + } + if (!ready) { + continue; + } + if (enqueue_pos.compare_exchange_weak( + pos, pos + count, std::memory_order_relaxed, std::memory_order_relaxed)) { + break; + } + } + + for (int i = 0; i < count; i++) { + PTO2ReadyQueueSlot* slot = &slots[(pos + i) & mask]; + slot->slot_state = items[i]; + slot->sequence.store(static_cast(pos + i + 1), std::memory_order_release); + } + } + +#if PTO2_ORCH_PROFILING || PTO2_SCHED_PROFILING + bool push(PTO2TaskSlotState* slot_state, uint64_t& atomic_count, uint64_t& wait_cycle) { + uint64_t pos; + PTO2ReadyQueueSlot* slot; + uint64_t t0 = get_sys_cnt_aicpu(); + bool contended = false; + uint32_t atomic_ops = 0; + while (true) { + pos = enqueue_pos.load(std::memory_order_relaxed); + slot = &slots[pos & mask]; + int64_t seq = slot->sequence.load(std::memory_order_acquire); + int64_t diff = seq - static_cast(pos); + atomic_ops += 2; // enqueue_pos.load + sequence.load + if (diff == 0) { + if (enqueue_pos.compare_exchange_weak( + pos, pos + 1, std::memory_order_relaxed, std::memory_order_relaxed)) { + atomic_ops++; // successful CAS + break; + } + contended = true; + atomic_ops++; // failed CAS + } else if (diff < 0) { + return false; // Queue full + } else { + contended = true; // diff > 0: slot not yet released, spin + } + } + atomic_ops++; // final sequence.store + atomic_count += atomic_ops; + if (contended) { + wait_cycle += (get_sys_cnt_aicpu() - t0); + } + + slot->slot_state = slot_state; + slot->sequence.store(static_cast(pos + 1), std::memory_order_release); + return true; + } +#endif + + PTO2TaskSlotState* pop() { + // Fast-path: skip slot load when queue is clearly empty + uint64_t d = dequeue_pos.load(std::memory_order_relaxed); + uint64_t e = enqueue_pos.load(std::memory_order_relaxed); + if (d >= e) { + return nullptr; + } + + uint64_t pos; + PTO2ReadyQueueSlot* slot; + while (true) { + pos = dequeue_pos.load(std::memory_order_relaxed); + slot = &slots[pos & mask]; + int64_t seq = slot->sequence.load(std::memory_order_acquire); + int64_t diff = seq - static_cast(pos + 1); + if (diff == 0) { + if (dequeue_pos.compare_exchange_weak( + pos, pos + 1, std::memory_order_relaxed, std::memory_order_relaxed)) + break; + } else if (diff < 0) { + return nullptr; // Queue empty + } + } + + PTO2TaskSlotState* result = slot->slot_state; + slot->sequence.store(static_cast(pos + mask + 1), std::memory_order_release); + return result; + } + +#if PTO2_SCHED_PROFILING + PTO2TaskSlotState* pop(uint64_t& atomic_count, uint64_t& wait_cycle) { + // Fast-path: skip slot load when queue is clearly empty + uint64_t d = dequeue_pos.load(std::memory_order_relaxed); + uint64_t e = enqueue_pos.load(std::memory_order_relaxed); + atomic_count += 2; // dequeue_pos.load + enqueue_pos.load + if (d >= e) { + return nullptr; + } + + uint64_t pos; + PTO2ReadyQueueSlot* slot; + uint64_t t0 = get_sys_cnt_aicpu(); + bool contended = false; + uint32_t atomic_ops = 0; + while (true) { + pos = dequeue_pos.load(std::memory_order_relaxed); + slot = &slots[pos & mask]; + int64_t seq = slot->sequence.load(std::memory_order_acquire); + int64_t diff = seq - static_cast(pos + 1); + atomic_ops += 2; // dequeue_pos.load + sequence.load + if (diff == 0) { + if (dequeue_pos.compare_exchange_weak( + pos, pos + 1, std::memory_order_relaxed, std::memory_order_relaxed)) { + atomic_ops++; // successful CAS + break; + } + contended = true; + atomic_ops++; // failed CAS + } else if (diff < 0) { + atomic_count += atomic_ops; + return nullptr; // Queue empty + } else { + contended = true; + } + } + atomic_ops++; // final sequence.store + atomic_count += atomic_ops; + if (contended) { + wait_cycle += (get_sys_cnt_aicpu() - t0); + } + + PTO2TaskSlotState* result = slot->slot_state; + slot->sequence.store(static_cast(pos + mask + 1), std::memory_order_release); + return result; + } +#endif + + // Batch pop: reserve a contiguous run of ready slots with a single CAS. + // Returns actual number of items popped (may be less than max_count). + int pop_batch(PTO2TaskSlotState** out, int max_count) { + uint64_t pos; + int count; + while (true) { + pos = dequeue_pos.load(std::memory_order_relaxed); + count = 0; + while (count < max_count) { + PTO2ReadyQueueSlot* slot = &slots[(pos + count) & mask]; + int64_t seq = slot->sequence.load(std::memory_order_acquire); + int64_t diff = seq - static_cast(pos + count + 1); + if (diff == 0) { + count++; + continue; + } + if (diff < 0) { + break; + } + count = -1; + break; + } + if (count == 0) return 0; + if (count < 0) continue; + if (dequeue_pos.compare_exchange_weak( + pos, pos + count, std::memory_order_relaxed, std::memory_order_relaxed)) { + break; + } + } + + for (int i = 0; i < count; i++) { + PTO2ReadyQueueSlot* slot = &slots[(pos + i) & mask]; + out[i] = slot->slot_state; + slot->sequence.store(static_cast(pos + i + mask + 1), std::memory_order_release); + } + return count; + } + +#if PTO2_SCHED_PROFILING + int pop_batch(PTO2TaskSlotState** out, int max_count, uint64_t& atomic_count, uint64_t& wait_cycle) { + uint64_t pos; + int count; + uint64_t t0 = get_sys_cnt_aicpu(); + bool contended = false; + uint32_t atomic_ops = 0; + while (true) { + pos = dequeue_pos.load(std::memory_order_relaxed); + atomic_ops++; // dequeue_pos.load + count = 0; + while (count < max_count) { + PTO2ReadyQueueSlot* slot = &slots[(pos + count) & mask]; + int64_t seq = slot->sequence.load(std::memory_order_acquire); + int64_t diff = seq - static_cast(pos + count + 1); + atomic_ops++; // sequence.load + if (diff == 0) { + count++; + continue; + } + if (diff < 0) { + break; + } + contended = true; + count = -1; + break; + } + if (count == 0) { + atomic_count += atomic_ops; + return 0; + } + if (count < 0) { + continue; + } + if (dequeue_pos.compare_exchange_weak( + pos, pos + count, std::memory_order_relaxed, std::memory_order_relaxed)) { + atomic_ops++; // successful CAS + break; + } + contended = true; + atomic_ops++; // failed CAS + } + + for (int i = 0; i < count; i++) { + PTO2ReadyQueueSlot* slot = &slots[(pos + i) & mask]; + out[i] = slot->slot_state; + slot->sequence.store(static_cast(pos + i + mask + 1), std::memory_order_release); + atomic_ops++; // sequence.store + } + atomic_count += atomic_ops; + if (contended) { + wait_cycle += (get_sys_cnt_aicpu() - t0); + } + return count; + } +#endif +}; + +// Cold-path ready queue operations (defined in pto_scheduler.cpp) +bool pto2_ready_queue_init(PTO2ReadyQueue* queue, uint64_t capacity); +void pto2_ready_queue_destroy(PTO2ReadyQueue* queue); + +// ============================================================================= +// Scheduler State +// ============================================================================= + +/** + * Statistics returned by mixed-task completion processing + */ +struct PTO2CompletionStats { + int32_t fanout_edges; // Number of fanout edges traversed (notify consumers) + int32_t tasks_enqueued; // Number of consumers that became READY + int32_t fanin_edges; // Number of fanin edges traversed (release producers) + bool mixed_task_completed; // True only when this callback completed a mixed task +}; + +/** + * Scheduler state structure + * + * Contains dynamic state updated during task execution. + * Separated from shared memory for cache efficiency. + * Hot-path methods are defined inline (implicitly inline as member functions). + */ +struct PTO2SchedulerState { + // Shared memory access + PTO2SharedMemoryHandle* sm_handle; + + // Per-ring state + struct RingSchedState { + PTO2TaskDescriptor* task_descriptors; + PTO2TaskSlotState* slot_states; + int32_t last_task_alive; + int32_t task_window_mask; + uint64_t task_window_size; + // Try-lock used to advance this ring's last_task_alive pointer. + std::atomic advance_lock; + + bool init(PTO2SharedMemoryHandle* sm_handle, int32_t ring_id); + void destroy(); + + PTO2TaskSlotState& get_slot_state_by_task_id(int32_t local_id) { + return slot_states[local_id & task_window_mask]; + } + PTO2TaskSlotState& get_slot_state_by_slot(int32_t slot) { return slot_states[slot]; } + + void sync_to_sm(PTO2SharedMemoryRingHeader& ring) { + ring.fc.last_task_alive.store(last_task_alive, std::memory_order_release); + } + + void advance_ring_pointers(PTO2SharedMemoryRingHeader& ring) { + int32_t current_task_index = ring.fc.current_task_index.load(std::memory_order_acquire); + + while (last_task_alive < current_task_index) { + PTO2TaskSlotState& slot_state = get_slot_state_by_task_id(last_task_alive); + if (slot_state.task_state.load(std::memory_order_acquire) != PTO2_TASK_CONSUMED) { + break; + } + last_task_alive++; + } + + sync_to_sm(ring); + } + } ring_sched_states[PTO2_MAX_RING_DEPTH]; + + // Ready queues remain global (scheduling is ring-agnostic) + PTO2ReadyQueue ready_queues[PTO2_NUM_RESOURCE_SHAPES]; + + // Statistics +#if PTO2_SCHED_PROFILING + std::atomic tasks_completed; + std::atomic tasks_consumed; +#endif + // ========================================================================= + // Inline hot-path methods + // ========================================================================= + PTO2TaskSlotState& get_slot_state(int32_t ring_id, int32_t local_id) { + return ring_sched_states[ring_id].get_slot_state_by_task_id(local_id); + } + PTO2TaskSlotState& get_slot_state_by_slot(int32_t ring_id, int32_t slot) { + return ring_sched_states[ring_id].get_slot_state_by_slot(slot); + } + + void check_and_handle_consumed(PTO2TaskSlotState& slot_state) { + if (slot_state.fanout_refcount.load(std::memory_order_acquire) != slot_state.fanout_count) return; + + PTO2TaskState expected = PTO2_TASK_COMPLETED; + if (!slot_state.task_state.compare_exchange_strong( + expected, PTO2_TASK_CONSUMED, std::memory_order_acq_rel, std::memory_order_acquire)) { + return; + } + +#if PTO2_SCHED_PROFILING + tasks_consumed.fetch_add(1, std::memory_order_relaxed); +#endif + + int32_t ring_id = slot_state.ring_id; + // Try-lock — if another thread is advancing this ring, it will scan our CONSUMED task + int32_t expected_lock = 0; + if (ring_sched_states[ring_id].advance_lock.compare_exchange_strong( + expected_lock, 1, std::memory_order_acquire, std::memory_order_relaxed)) { + ring_sched_states[ring_id].advance_ring_pointers(sm_handle->header->rings[ring_id]); + ring_sched_states[ring_id].advance_lock.store(0, std::memory_order_release); + } + } + +#if PTO2_ORCH_PROFILING || PTO2_SCHED_PROFILING + void check_and_handle_consumed(PTO2TaskSlotState& slot_state, uint64_t& atomic_count) { + int32_t fc = slot_state.fanout_count; + int32_t rc = slot_state.fanout_refcount.load(std::memory_order_acquire); + + atomic_count += 2; // fanout_count.load + fanout_refcount.load + + if (rc != fc) return; + + PTO2TaskState expected = PTO2_TASK_COMPLETED; + if (!slot_state.task_state.compare_exchange_strong( + expected, PTO2_TASK_CONSUMED, std::memory_order_acq_rel, std::memory_order_acquire)) { + atomic_count += 1; // failed CAS + return; + } + + atomic_count += 1; // successful CAS + +#if PTO2_SCHED_PROFILING + tasks_consumed.fetch_add(1, std::memory_order_relaxed); +#endif + + int32_t ring_id = slot_state.ring_id; + // Try-lock — if another thread is advancing this ring, it will scan our CONSUMED task + int32_t expected_lock = 0; + if (ring_sched_states[ring_id].advance_lock.compare_exchange_strong( + expected_lock, 1, std::memory_order_acquire, std::memory_order_relaxed)) { + ring_sched_states[ring_id].advance_ring_pointers(sm_handle->header->rings[ring_id]); + ring_sched_states[ring_id].advance_lock.store(0, std::memory_order_release); + atomic_count += 2; // try-lock CAS + unlock store + } else { + atomic_count += 1; // failed try-lock CAS + } + } +#endif + + void release_producer(PTO2TaskSlotState& slot_state) { + slot_state.fanout_refcount.fetch_add(1, std::memory_order_acq_rel); + check_and_handle_consumed(slot_state); + } + +#if PTO2_ORCH_PROFILING || PTO2_SCHED_PROFILING + void release_producer(PTO2TaskSlotState& slot_state, uint64_t& atomic_count) { + slot_state.fanout_refcount.fetch_add(1, std::memory_order_acq_rel); + atomic_count += 1; // fanout_refcount.fetch_add + check_and_handle_consumed(slot_state, atomic_count); + } +#endif + + bool release_fanin_and_check_ready(PTO2TaskSlotState& slot_state, PTO2LocalReadyBuffer* local_bufs = nullptr) { + // Atomically increment fanin_refcount and check if all producers are done + // ACQ_REL on fanin_refcount already synchronizes with the orchestrator's + // init release, making fanin_count visible — plain load suffices. + int32_t new_refcount = slot_state.fanin_refcount.fetch_add(1, std::memory_order_acq_rel) + 1; + + if (new_refcount == slot_state.fanin_count) { + // Local-first: try per-CoreType thread-local buffer before global queue + // Route by active_mask: AIC-containing tasks → buf[0], AIV-only → buf[1] + PTO2ResourceShape shape = pto2_active_mask_to_shape(slot_state.active_mask); + if (!local_bufs || !local_bufs[static_cast(shape)].try_push(&slot_state)) { + ready_queues[static_cast(shape)].push(&slot_state); + } + return true; + } + return false; + } + +#if PTO2_ORCH_PROFILING || PTO2_SCHED_PROFILING + bool release_fanin_and_check_ready(PTO2TaskSlotState& slot_state, + uint64_t& atomic_count, + uint64_t& push_wait, + PTO2LocalReadyBuffer* local_bufs = nullptr) { + int32_t new_refcount = slot_state.fanin_refcount.fetch_add(1, std::memory_order_acq_rel) + 1; + atomic_count += 1; // fanin_refcount.fetch_add + + if (new_refcount == slot_state.fanin_count) { + PTO2TaskState expected = PTO2_TASK_PENDING; + if (slot_state.task_state.compare_exchange_strong( + expected, PTO2_TASK_READY, std::memory_order_acq_rel, std::memory_order_acquire)) { + atomic_count += 1; // CAS(task_state PENDING→READY) + // Local-first: try per-CoreType thread-local buffer before global queue + PTO2ResourceShape shape = pto2_active_mask_to_shape(slot_state.active_mask); + if (!local_bufs || !local_bufs[static_cast(shape)].try_push(&slot_state)) { + ready_queues[static_cast(shape)].push(&slot_state, atomic_count, push_wait); + } + return true; + } + } + return false; + } +#endif + + int get_ready_tasks_batch( + PTO2ResourceShape shape, PTO2LocalReadyBuffer& local_buf, PTO2TaskSlotState** out, int max_count) { + int count = 0; + while (count < max_count && local_buf.count > 0) { + out[count++] = local_buf.slot_states[--local_buf.count]; + } + int remaining = max_count - count; + if (remaining > 0) { + count += ready_queues[static_cast(shape)].pop_batch(out + count, remaining); + } + return count; + } + +#if PTO2_SCHED_PROFILING + int get_ready_tasks_batch(PTO2ResourceShape shape, + PTO2LocalReadyBuffer& local_buf, + PTO2TaskSlotState** out, + int max_count, + uint64_t& atomic_count, + uint64_t& wait_cycle, + uint64_t& local_dispatch_count) { + int count = 0; + while (count < max_count && local_buf.count > 0) { + local_dispatch_count++; + out[count++] = local_buf.slot_states[--local_buf.count]; + } + int remaining = max_count - count; + if (remaining > 0) { + count += + ready_queues[static_cast(shape)].pop_batch(out + count, remaining, atomic_count, wait_cycle); + } + return count; + } +#endif + + void on_scope_end(PTO2TaskSlotState** task_slot_states, int32_t count) { +#if PTO2_ORCH_PROFILING + extern uint64_t g_orch_scope_end_atomic_count; + if (count > 0) __builtin_prefetch(task_slot_states[0], 1, 0); + for (int32_t i = 0; i < count; i++) { + if (i + 1 < count) __builtin_prefetch(task_slot_states[i + 1], 1, 0); + release_producer(*task_slot_states[i], g_orch_scope_end_atomic_count); + } +#else + if (count > 0) __builtin_prefetch(task_slot_states[0], 1, 0); + for (int32_t i = 0; i < count; i++) { + if (i + 1 < count) __builtin_prefetch(task_slot_states[i + 1], 1, 0); + release_producer(*task_slot_states[i]); + } +#endif + } + + /** + * Subtask completion: atomic counter model. + * Called when a single subtask (AIC, AIV0, or AIV1) finishes on any block. + * Atomically increments completed_subtasks and checks whether all subtasks + * across all blocks are done. + * + * @return true if this was the last subtask, completing the entire task. + */ + bool on_subtask_complete(PTO2TaskSlotState& slot_state) { + int16_t prev = slot_state.completed_subtasks.fetch_add(1, std::memory_order_acq_rel); + return (prev + 1) == slot_state.total_required_subtasks; + } + + /** + * Two-stage completion: second stage. + * Called exactly once when all subtasks of a mixed task are done + * (i.e., on_subtask_complete returned true). + * Handles fanout notification, fanin release, and self-consumption check. + */ +#if PTO2_SCHED_PROFILING + PTO2CompletionStats +#else + void +#endif + on_mixed_task_complete(PTO2TaskSlotState& slot_state, +#if PTO2_SCHED_PROFILING + int thread_idx, +#endif + + PTO2LocalReadyBuffer* local_bufs = nullptr) { +#if PTO2_SCHED_PROFILING + PTO2CompletionStats stats = {0, 0, 0, true}; +#endif +#if PTO2_SCHED_PROFILING + extern uint64_t g_sched_lock_cycle[], g_sched_fanout_cycle[]; + extern uint64_t g_sched_lock_atomic_count[], g_sched_lock_wait_cycle[]; + extern uint64_t g_sched_fanout_atomic_count[], g_sched_push_wait_cycle[]; + uint64_t lock_atomics = 0, lock_wait = 0; + PTO2_SCHED_CYCLE_START(); +#endif + +#if PTO2_SCHED_PROFILING + pto2_fanout_lock(slot_state, lock_atomics, lock_wait); +#else + pto2_fanout_lock(slot_state); +#endif + slot_state.task_state.store(PTO2_TASK_COMPLETED, std::memory_order_release); + PTO2DepListEntry* current = slot_state.fanout_head; // Protected by fanout_lock + pto2_fanout_unlock(slot_state); + +#if PTO2_SCHED_PROFILING + lock_atomics += 2; // state.store + unlock.store + g_sched_lock_atomic_count[thread_idx] += lock_atomics; + g_sched_lock_wait_cycle[thread_idx] += lock_wait; + PTO2_SCHED_CYCLE_LAP(g_sched_lock_cycle[thread_idx]); +#endif + + // Fanout: notify consumers +#if PTO2_SCHED_PROFILING + uint64_t fanout_atomics = 0, push_wait = 0; +#endif + while (current != nullptr) { + PTO2TaskSlotState& consumer_slot = *current->slot_state; +#if PTO2_SCHED_PROFILING + stats.fanout_edges++; + if (release_fanin_and_check_ready(consumer_slot, fanout_atomics, push_wait, local_bufs)) { + stats.tasks_enqueued++; + } +#else + release_fanin_and_check_ready(consumer_slot, local_bufs); +#endif + current = current->next; + } + +#if PTO2_SCHED_PROFILING + g_sched_fanout_atomic_count[thread_idx] += fanout_atomics; + g_sched_push_wait_cycle[thread_idx] += push_wait; + PTO2_SCHED_CYCLE_LAP(g_sched_fanout_cycle[thread_idx]); + return stats; +#endif + } + + /** + * Cold path: release producers (fanin traversal) + check self for CONSUMED. + * Returns fanin edge count for profiling. + */ + +#if PTO2_SCHED_PROFILING + int32_t on_task_release(PTO2TaskSlotState& slot_state, int32_t thread_idx) { + PTO2_SCHED_CYCLE_START(); + extern uint64_t g_sched_fanin_cycle[], g_sched_fanin_atomic_count[]; + extern uint64_t g_sched_self_atomic_count[]; + extern uint64_t g_sched_self_consumed_cycle[]; + extern uint64_t g_sched_complete_count[]; + uint64_t fanin_atomics = 0; +#else + int32_t on_task_release(PTO2TaskSlotState& slot_state) { +#endif + PTO2TaskPayload* payload = slot_state.payload; + int32_t fanin_edges = payload->fanin_actual_count; + for (int32_t i = 0; i < fanin_edges; i++) { +#if PTO2_SCHED_PROFILING + release_producer(*payload->fanin_slot_states[i], fanin_atomics); +#else + release_producer(*payload->fanin_slot_states[i]); +#endif + } +#if PTO2_SCHED_PROFILING + g_sched_fanin_atomic_count[thread_idx] += fanin_atomics; + PTO2_SCHED_CYCLE_LAP(g_sched_fanin_cycle[thread_idx]); +#endif + + // Self consumed check +#if PTO2_SCHED_PROFILING + uint64_t self_atomics = 0; + check_and_handle_consumed(slot_state, self_atomics); + g_sched_self_atomic_count[thread_idx] += self_atomics; + PTO2_SCHED_CYCLE_LAP(g_sched_self_consumed_cycle[thread_idx]); + g_sched_complete_count[thread_idx]++; +#else + check_and_handle_consumed(slot_state); +#endif + return fanin_edges; + } +}; // NOLINT(readability/braces) + +// ============================================================================= +// Scheduler API (cold path, defined in pto_scheduler.cpp) +// ============================================================================= + +bool pto2_scheduler_init(PTO2SchedulerState* sched, PTO2SharedMemoryHandle* sm_handle); +void pto2_scheduler_destroy(PTO2SchedulerState* sched); + +// ============================================================================= +// Debug Utilities (cold path, defined in pto_scheduler.cpp) +// ============================================================================= + +void pto2_scheduler_print_stats(PTO2SchedulerState* sched); +void pto2_scheduler_print_queues(PTO2SchedulerState* sched); +const char* pto2_task_state_name(PTO2TaskState state); + +// ============================================================================= +// Scheduler Profiling Data +// ============================================================================= + +#if PTO2_SCHED_PROFILING +struct PTO2SchedProfilingData { + // Sub-phase cycle breakdown within on_mixed_task_complete + uint64_t lock_cycle; // pto2_fanout_lock + state store + unlock + uint64_t fanout_cycle; // fanout traversal + uint64_t fanin_cycle; // fanin traversal + uint64_t self_consumed_cycle; // self check_and_handle_consumed + + // Wait times + uint64_t lock_wait_cycle; // spin-wait in fanout_lock + uint64_t push_wait_cycle; // CAS contention in push() + uint64_t pop_wait_cycle; // CAS contention in pop() + + // Atomic counts per sub-phase + uint64_t lock_atomic_count; + uint64_t fanout_atomic_count; + uint64_t fanin_atomic_count; + uint64_t self_atomic_count; + uint64_t pop_atomic_count; + + int64_t complete_count; +}; + +/** + * Get and reset scheduler profiling data for a specific thread. + * Returns accumulated profiling data and resets counters. + */ +PTO2SchedProfilingData pto2_scheduler_get_profiling(int thread_idx); +#endif diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_shared_memory.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_shared_memory.cpp new file mode 100644 index 000000000..633caa048 --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_shared_memory.cpp @@ -0,0 +1,273 @@ +/** + * PTO Runtime2 - Shared Memory Implementation + * + * Implements shared memory allocation, initialization, and management + * for Orchestrator-Scheduler communication. + * + * Based on: docs/RUNTIME_LOGIC.md + */ + +#include "pto_shared_memory.h" +#include +#include +#include +#include "common/unified_log.h" + +// ============================================================================= +// Size Calculation +// ============================================================================= + +uint64_t pto2_sm_calculate_size(uint64_t task_window_size) { + uint64_t task_window_sizes[PTO2_MAX_RING_DEPTH]; + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + task_window_sizes[r] = task_window_size; + } + return pto2_sm_calculate_size_per_ring(task_window_sizes); +} + +uint64_t pto2_sm_calculate_size_per_ring(const uint64_t task_window_sizes[PTO2_MAX_RING_DEPTH]) { + uint64_t size = 0; + + // Header (aligned to cache line) + size += PTO2_ALIGN_UP(sizeof(PTO2SharedMemoryHeader), PTO2_ALIGN_SIZE); + + // Per-ring task descriptors and payloads + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + size += PTO2_ALIGN_UP(task_window_sizes[r] * sizeof(PTO2TaskDescriptor), PTO2_ALIGN_SIZE); + size += PTO2_ALIGN_UP(task_window_sizes[r] * sizeof(PTO2TaskPayload), PTO2_ALIGN_SIZE); + } + + return size; +} + +// ============================================================================= +// Creation and Destruction +// ============================================================================= + +static void pto2_sm_setup_pointers_per_ring( + PTO2SharedMemoryHandle* handle, + const uint64_t task_window_sizes[PTO2_MAX_RING_DEPTH]) { + char* ptr = (char*)handle->sm_base; + + // Header + handle->header = (PTO2SharedMemoryHeader*)ptr; + ptr += PTO2_ALIGN_UP(sizeof(PTO2SharedMemoryHeader), PTO2_ALIGN_SIZE); + + // Per-ring task descriptors and payloads + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + handle->task_descriptors[r] = (PTO2TaskDescriptor*)ptr; + ptr += PTO2_ALIGN_UP(task_window_sizes[r] * sizeof(PTO2TaskDescriptor), PTO2_ALIGN_SIZE); + + handle->task_payloads[r] = (PTO2TaskPayload*)ptr; + ptr += PTO2_ALIGN_UP(task_window_sizes[r] * sizeof(PTO2TaskPayload), PTO2_ALIGN_SIZE); + } +} + +static void pto2_sm_setup_pointers(PTO2SharedMemoryHandle* handle, uint64_t task_window_size) { + uint64_t task_window_sizes[PTO2_MAX_RING_DEPTH]; + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + task_window_sizes[r] = task_window_size; + } + pto2_sm_setup_pointers_per_ring(handle, task_window_sizes); +} + +PTO2SharedMemoryHandle* pto2_sm_create(uint64_t task_window_size, + uint64_t heap_size) { + // Allocate handle + PTO2SharedMemoryHandle* handle = (PTO2SharedMemoryHandle*)calloc(1, sizeof(PTO2SharedMemoryHandle)); + if (!handle) { + return NULL; + } + + // Calculate total size + uint64_t sm_size = pto2_sm_calculate_size(task_window_size); + + // Allocate shared memory (aligned for DMA efficiency) + #if defined(_POSIX_C_SOURCE) && _POSIX_C_SOURCE >= 200112L + if (posix_memalign(&handle->sm_base, PTO2_ALIGN_SIZE, static_cast(sm_size)) != 0) { + free(handle); + return NULL; + } + #else + handle->sm_base = aligned_alloc(PTO2_ALIGN_SIZE, static_cast(sm_size)); + if (!handle->sm_base) { + free(handle); + return NULL; + } + #endif + + handle->sm_size = sm_size; + handle->is_owner = true; + + // Initialize to zero + memset(handle->sm_base, 0, static_cast(sm_size)); + + // Set up pointers + pto2_sm_setup_pointers(handle, task_window_size); + + // Initialize header + pto2_sm_init_header(handle, task_window_size, heap_size); + + return handle; +} + +PTO2SharedMemoryHandle* pto2_sm_create_default(void) { + return pto2_sm_create(PTO2_TASK_WINDOW_SIZE, + PTO2_HEAP_SIZE); +} + +PTO2SharedMemoryHandle* pto2_sm_create_from_buffer(void* sm_base, + uint64_t sm_size, + uint64_t task_window_size, + uint64_t heap_size) { + if (!sm_base || sm_size == 0) return NULL; + + uint64_t required = pto2_sm_calculate_size(task_window_size); + if (sm_size < required) return NULL; + + PTO2SharedMemoryHandle* handle = (PTO2SharedMemoryHandle*)calloc(1, sizeof(PTO2SharedMemoryHandle)); + if (!handle) return NULL; + + handle->sm_base = sm_base; + handle->sm_size = sm_size; + handle->is_owner = false; + + pto2_sm_setup_pointers(handle, task_window_size); + pto2_sm_init_header(handle, task_window_size, heap_size); + + return handle; +} + +void pto2_sm_destroy(PTO2SharedMemoryHandle* handle) { + if (!handle) return; + + if (handle->is_owner && handle->sm_base) { + free(handle->sm_base); + } + + free(handle); +} + +// ============================================================================= +// Initialization +// ============================================================================= +// +// no need init data in pool, init pool data when used +void pto2_sm_init_header(PTO2SharedMemoryHandle* handle, + uint64_t task_window_size, + uint64_t heap_size) { + uint64_t task_window_sizes[PTO2_MAX_RING_DEPTH]; + uint64_t heap_sizes[PTO2_MAX_RING_DEPTH]; + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + task_window_sizes[r] = task_window_size; + heap_sizes[r] = heap_size; + } + pto2_sm_init_header_per_ring(handle, task_window_sizes, heap_sizes); +} + +void pto2_sm_init_header_per_ring( + PTO2SharedMemoryHandle* handle, + const uint64_t task_window_sizes[PTO2_MAX_RING_DEPTH], + const uint64_t heap_sizes[PTO2_MAX_RING_DEPTH]) { + PTO2SharedMemoryHeader* header = handle->header; + + // Per-ring flow control (start at 0) + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + header->rings[r].fc.init(); + } + + header->orchestrator_done.store(0, std::memory_order_relaxed); + + // Per-ring layout info + uint64_t offset = PTO2_ALIGN_UP(sizeof(PTO2SharedMemoryHeader), PTO2_ALIGN_SIZE); + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + header->rings[r].task_window_size = task_window_sizes[r]; + header->rings[r].heap_size = heap_sizes[r]; + header->rings[r].task_descriptors_offset = offset; + offset += PTO2_ALIGN_UP(task_window_sizes[r] * sizeof(PTO2TaskDescriptor), PTO2_ALIGN_SIZE); + offset += PTO2_ALIGN_UP(task_window_sizes[r] * sizeof(PTO2TaskPayload), PTO2_ALIGN_SIZE); + } + + header->total_size = handle->sm_size; + header->graph_output_ptr.store(0, std::memory_order_relaxed); + header->graph_output_size.store(0, std::memory_order_relaxed); + + // Error reporting + header->orch_error_code.store(PTO2_ERROR_NONE, std::memory_order_relaxed); + header->sched_error_bitmap.store(0, std::memory_order_relaxed); + header->sched_error_code.store(PTO2_ERROR_NONE, std::memory_order_relaxed); + header->sched_error_thread.store(-1, std::memory_order_relaxed); +} + +// ============================================================================= +// Debug Utilities +// ============================================================================= + +void pto2_sm_print_layout(PTO2SharedMemoryHandle* handle) { + if (!handle || !handle->header) return; + + PTO2SharedMemoryHeader* h = handle->header; + + LOG_INFO("=== PTO2 Shared Memory Layout ==="); + LOG_INFO("Base address: %p", handle->sm_base); + LOG_INFO("Total size: %" PRIu64 " bytes", h->total_size); + LOG_INFO("Ring depth: %d", PTO2_MAX_RING_DEPTH); + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + LOG_INFO("Ring %d:", r); + LOG_INFO(" task_window_size: %" PRIu64, h->rings[r].task_window_size); + LOG_INFO(" heap_size: %" PRIu64 " bytes", h->rings[r].heap_size); + LOG_INFO(" descriptors_off: %" PRIu64 " (0x%" PRIx64 ")", + h->rings[r].task_descriptors_offset, h->rings[r].task_descriptors_offset); + LOG_INFO(" heap_top: %" PRIu64, h->rings[r].fc.heap_top.load(std::memory_order_acquire)); + LOG_INFO(" heap_tail: %" PRIu64, h->rings[r].fc.heap_tail.load(std::memory_order_acquire)); + LOG_INFO(" current_task_idx: %d", h->rings[r].fc.current_task_index.load(std::memory_order_acquire)); + LOG_INFO(" last_task_alive: %d", h->rings[r].fc.last_task_alive.load(std::memory_order_acquire)); + } + LOG_INFO("orchestrator_done: %d", h->orchestrator_done.load(std::memory_order_acquire)); + LOG_INFO("Error state:"); + LOG_INFO(" orch_error_code: %d", h->orch_error_code.load(std::memory_order_relaxed)); + LOG_INFO(" sched_error_bitmap: 0x%x", h->sched_error_bitmap.load(std::memory_order_relaxed)); + LOG_INFO(" sched_error_code: %d", h->sched_error_code.load(std::memory_order_relaxed)); + LOG_INFO(" sched_error_thread: %d", h->sched_error_thread.load(std::memory_order_relaxed)); + LOG_INFO("================================"); +} + +bool pto2_sm_validate(PTO2SharedMemoryHandle* handle) { + if (!handle) return false; + if (!handle->sm_base) return false; + if (!handle->header) return false; + + PTO2SharedMemoryHeader* h = handle->header; + + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + if (!h->rings[r].fc.validate(handle, r)) return false; + } + + return true; +} + +bool PTO2RingFlowControl::validate(PTO2SharedMemoryHandle* handle, int32_t ring_id) const { + if (!handle) return false; + if (!handle->header) return false; + if (ring_id < 0 || ring_id >= PTO2_MAX_RING_DEPTH) return false; + + const PTO2SharedMemoryHeader* h = handle->header; + + // Check that offsets are within bounds + if (h->rings[ring_id].task_descriptors_offset >= h->total_size) return false; + + // Check pointer alignment + if ((uintptr_t)handle->task_descriptors[ring_id] % PTO2_ALIGN_SIZE != 0) return false; + + // Check flow control pointer sanity + int32_t current = current_task_index.load(std::memory_order_acquire); + int32_t last_alive = last_task_alive.load(std::memory_order_acquire); + uint64_t top = heap_top.load(std::memory_order_acquire); + uint64_t tail = heap_tail.load(std::memory_order_acquire); + if (current < 0) return false; + if (last_alive < 0) return false; + if (top > h->rings[ring_id].heap_size) return false; + if (tail > h->rings[ring_id].heap_size) return false; + + return true; +} diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_shared_memory.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_shared_memory.h new file mode 100644 index 000000000..b0200da4d --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_shared_memory.h @@ -0,0 +1,227 @@ +/** + * PTO Runtime2 - Shared Memory Layout + * + * Defines the shared memory structure for Orchestrator-Scheduler communication. + * + * Memory Layout (per-ring sections repeat for each ring 0..PTO2_MAX_RING_DEPTH-1): + * +---------------------------+ + * | SharedMemoryHeader | (per-ring flow control + sync) + * +---------------------------+ + * | Ring 0: TaskDescriptor[] | + * | Ring 0: TaskPayload[] | + * +---------------------------+ + * | Ring 1: TaskDescriptor[] | + * | Ring 1: TaskPayload[] | + * +---------------------------+ + * | ... | + * +---------------------------+ + * + * Design principles: + * - Only data needed for Orchestrator<->Scheduler communication is here + * - TensorMap, scope_stack, ready_queues, dep_pool are in private memory + * - Flow control via atomic counters/flags (no locks needed for single-word R/W) + * + * Based on: docs/RUNTIME_LOGIC.md + */ + +#ifndef PTO_SHARED_MEMORY_H +#define PTO_SHARED_MEMORY_H + +#include "pto_runtime2_types.h" + +#ifdef __cplusplus +extern "C" { +#endif + +// ============================================================================= +// Shared Memory Header +// ============================================================================= + +struct PTO2SharedMemoryHandle; + +/** + * Per-ring flow control state in shared memory. + * Written/read by Orchestrator and Scheduler for synchronization. + */ +struct PTO2RingFlowControl { + // Written by Orchestrator, Read by Scheduler + std::atomic heap_top; // Heap ring allocation pointer + std::atomic current_task_index; // Task ring head (next to allocate) + int32_t _pad0; // Alignment padding + + // Written by Scheduler, Read by Orchestrator (for back-pressure) + std::atomic heap_tail; // Heap ring free pointer + std::atomic last_task_alive; // Task ring tail (oldest active task) + int32_t _pad1; // Alignment padding + + void init() { + heap_top.store(0, std::memory_order_relaxed); + current_task_index.store(0, std::memory_order_relaxed); + heap_tail.store(0, std::memory_order_relaxed); + last_task_alive.store(0, std::memory_order_relaxed); + } + + bool validate(PTO2SharedMemoryHandle* handle, int32_t ring_id) const; +}; + +/** + * Per-ring shared memory header section. + * + * Groups flow-control and layout info for a single ring to avoid parallel arrays. + */ +struct PTO2SharedMemoryRingHeader { + PTO2RingFlowControl fc; + uint64_t task_window_size; + uint64_t heap_size; + uint64_t task_descriptors_offset; // Offset from SM base, in bytes +}; + +/** + * Shared memory header structure + * + * Contains per-ring flow control and global layout information. + */ +struct alignas(PTO2_ALIGN_SIZE) PTO2SharedMemoryHeader { + // === PER-RING FLOW CONTROL + LAYOUT INFO (set once at init) === + PTO2SharedMemoryRingHeader rings[PTO2_MAX_RING_DEPTH]; + + // === GLOBAL FIELDS === + std::atomic orchestrator_done; // Flag: orchestration complete + + // Total shared memory size (for validation) + uint64_t total_size; + + // Graph output for copy-back (set by orchestrator when using packed buffer) + // Host finalize copies from this address instead of dev_ptr when non-zero + std::atomic graph_output_ptr; // Address where final output was written (packed buffer) + std::atomic graph_output_size; // Size in bytes + + // === ERROR REPORTING === + + // Orchestrator fatal error code (Orchestrator → Scheduler, AICPU → Host) + // Non-zero signals fatal error. Written by orchestrator, read by scheduler and host. + std::atomic orch_error_code; + + // Scheduler error state (Scheduler → Host, independent of orchestrator) + // Written by scheduler threads on timeout; read by orchestrator and host. + std::atomic sched_error_bitmap; // Bit X set = thread X had error + std::atomic sched_error_code; // Last scheduler error code (last-writer-wins) + std::atomic sched_error_thread; // Thread index of last error writer +}; + +static_assert(sizeof(PTO2SharedMemoryHeader) % PTO2_ALIGN_SIZE == 0, + "PTO2SharedMemoryHeader must be aligned to cache line (PTO2_ALIGN_SIZE)"); + +// ============================================================================= +// Shared Memory Handle +// ============================================================================= + +/** + * Handle for shared memory access + * Provides both Orchestrator and Scheduler views of the same memory + */ +struct PTO2SharedMemoryHandle { + void* sm_base; // Base address of shared memory + uint64_t sm_size; // Total size of shared memory + + // Quick pointers into shared memory regions (per-ring) + PTO2SharedMemoryHeader* header; + PTO2TaskDescriptor* task_descriptors[PTO2_MAX_RING_DEPTH]; + PTO2TaskPayload* task_payloads[PTO2_MAX_RING_DEPTH]; + + // Ownership flag + bool is_owner; // True if this handle allocated the memory + +}; + +// ============================================================================= +// Shared Memory API +// ============================================================================= + +/** + * Calculate required shared memory size + * + * @param task_window_size Number of task slots per ring + * @return Total bytes required + */ +uint64_t pto2_sm_calculate_size(uint64_t task_window_size); + +/** + * Calculate required shared memory size for per-ring task windows. + * + * @param task_window_sizes Array of window sizes per ring + * @return Total bytes required + */ +uint64_t pto2_sm_calculate_size_per_ring(const uint64_t task_window_sizes[PTO2_MAX_RING_DEPTH]); + +/** + * Create shared memory for Orchestrator and Scheduler + * + * @param task_window_size Number of task slots per ring + * @param heap_size Heap size per ring for output buffers + * @return Handle with both views, or NULL on failure + */ +PTO2SharedMemoryHandle* pto2_sm_create(uint64_t task_window_size, + uint64_t heap_size); + +/** + * Create shared memory with default sizes + */ +PTO2SharedMemoryHandle* pto2_sm_create_default(void); + +/** + * Wrap an existing buffer as shared memory (e.g. device GM buffer). + * Caller owns the buffer; handle will not free sm_base. + * + * @param sm_base Base address of pre-allocated buffer + * @param sm_size Total size in bytes + * @param task_window_size Number of task slots per ring (must match buffer layout) + * @param heap_size Heap size per ring (for layout; buffer has no heap region) + * @return Handle, or NULL on failure + */ +PTO2SharedMemoryHandle* pto2_sm_create_from_buffer(void* sm_base, + uint64_t sm_size, + uint64_t task_window_size, + uint64_t heap_size); + +/** + * Destroy shared memory and free resources + */ +void pto2_sm_destroy(PTO2SharedMemoryHandle* handle); + +/** + * Initialize shared memory header with layout information + * Called after memory is allocated + */ +void pto2_sm_init_header(PTO2SharedMemoryHandle* handle, + uint64_t task_window_size, + uint64_t heap_size); + +/** + * Initialize shared memory header with per-ring layout information. + */ +void pto2_sm_init_header_per_ring( + PTO2SharedMemoryHandle* handle, + const uint64_t task_window_sizes[PTO2_MAX_RING_DEPTH], + const uint64_t heap_sizes[PTO2_MAX_RING_DEPTH]); + +// ============================================================================= +// Debug Utilities +// ============================================================================= + +/** + * Print shared memory layout info + */ +void pto2_sm_print_layout(PTO2SharedMemoryHandle* handle); + +/** + * Validate shared memory integrity + * @return true if valid, false if corrupted + */ +bool pto2_sm_validate(PTO2SharedMemoryHandle* handle); + +#ifdef __cplusplus +} +#endif + +#endif // PTO_SHARED_MEMORY_H diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_submit_types.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_submit_types.h new file mode 100644 index 000000000..a0df3c4a6 --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_submit_types.h @@ -0,0 +1,119 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ + +/** + * PTO Submit Types - Shared submit-contract definitions + * + * Header-only definitions shared by orchestration-facing and runtime-facing + * headers. Keeps orchestration slim (no dependency on pto_runtime2_types.h). + */ + +#pragma once + +#include + +inline constexpr int32_t INVALID_KERNEL_ID = -1; + +/** + * Subtask slot count: AIC, AIV0, AIV1 + */ +inline constexpr int32_t PTO2_SUBTASK_SLOT_COUNT = 3; + +/** + * Subtask slot indices + */ +enum class PTO2SubtaskSlot : uint8_t { + AIC = 0, + AIV0 = 1, + AIV1 = 2, +}; + +/** + * Subtask mask bits (for active_mask / subtask_done_mask) + */ +inline constexpr uint8_t PTO2_SUBTASK_MASK_AIC = (1u << 0); // 0x1 +inline constexpr uint8_t PTO2_SUBTASK_MASK_AIV0 = (1u << 1); // 0x2 +inline constexpr uint8_t PTO2_SUBTASK_MASK_AIV1 = (1u << 2); // 0x4 + +/** + * Test whether a subtask slot is active in a given mask + */ +static inline bool pto2_subtask_active(uint8_t mask, PTO2SubtaskSlot slot) { + return (mask & (1u << static_cast(slot))) != 0; +} + +/** + * Mixed-task submit contract. + * + * Each field holds either a valid kernel ID or INVALID_KERNEL_ID (inactive). + * At least one slot must be valid. + */ +struct MixedKernels { + int32_t aic_kernel_id{INVALID_KERNEL_ID}; + int32_t aiv0_kernel_id{INVALID_KERNEL_ID}; + int32_t aiv1_kernel_id{INVALID_KERNEL_ID}; +}; + +/** + * Resource shape — classifies a MixedKernels into one of 3 scheduling buckets. + * + * Multi-subtask tasks (2+ active slots) are all scheduled as MIX, which + * requires a fully-idle cluster (1 AIC + 2 AIV). The actual cores used + * are determined at dispatch time by active_mask — unused cores in the + * cluster remain idle and available for single-core tasks. + */ +enum class PTO2ResourceShape : uint8_t { + AIC = 0, // Single AIC + AIV = 1, // Single AIV + MIX = 2, // Full cluster (dispatch uses active_mask) +}; + +inline constexpr int32_t PTO2_NUM_RESOURCE_SHAPES = 3; + +/** + * Derive resource shape from active_mask. + * Caller must ensure active_mask is valid (at least one bit set). + */ +static inline PTO2ResourceShape pto2_active_mask_to_shape(uint8_t active_mask) { + int bit_count = __builtin_popcount(active_mask); + if (bit_count >= 2) return PTO2ResourceShape::MIX; + if (active_mask & PTO2_SUBTASK_MASK_AIC) return PTO2ResourceShape::AIC; + return PTO2ResourceShape::AIV; +} + +/** + * Compute active_mask from MixedKernels. + */ +static inline uint8_t pto2_mixed_kernels_to_active_mask(const MixedKernels& mk) { + uint8_t mask = 0; + if (mk.aic_kernel_id != INVALID_KERNEL_ID) mask |= PTO2_SUBTASK_MASK_AIC; + if (mk.aiv0_kernel_id != INVALID_KERNEL_ID) mask |= PTO2_SUBTASK_MASK_AIV0; + if (mk.aiv1_kernel_id != INVALID_KERNEL_ID) mask |= PTO2_SUBTASK_MASK_AIV1; + return mask; +} + +/** + * SPMD launch parameters carried inside Arg. + * + * Controls how many logical blocks (SPMD dimension) a single task + * is expanded into at dispatch time. Each block receives a unique + * block_idx in [0, block_num) via the per-dispatch LocalContext. + */ +class PTO2LaunchSpec { + public: + constexpr PTO2LaunchSpec() = default; + + int16_t block_num() const { return block_num_; } + void set_block_num(int16_t n) { block_num_ = n; } + + private: + int16_t block_num_{1}; +}; diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_task_id.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_task_id.h new file mode 100644 index 000000000..595372f90 --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_task_id.h @@ -0,0 +1,50 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ + +/** + * PTO2TaskId — minimal standalone header. + * + * Factored out of pto_runtime2_types.h so that tensor.h can include it + * without pulling in scheduler-internal constants (heap sizes, timeouts, etc.). + */ + +#pragma once + +#include + +/** + * TaskId: 64-bit encoding used across Runtime2. + * + * raw encoding: (ring_id << 32) | local_id + * + * ring_id: which ring layer (0..PTO2_MAX_RING_DEPTH-1) + * local_id: per-ring monotonic counter + * + * Invalid sentinel: raw == UINT64_MAX (no valid task has this encoding). + */ +struct PTO2TaskId { + uint64_t raw; + + static constexpr PTO2TaskId make(uint8_t ring_id, uint32_t local_id) { + return PTO2TaskId{(static_cast(ring_id) << 32) | static_cast(local_id)}; + } + + static constexpr PTO2TaskId invalid() { return PTO2TaskId{UINT64_MAX}; } + + constexpr uint8_t ring() const { return static_cast(raw >> 32); } + constexpr uint32_t local() const { return static_cast(raw & 0xFFFFFFFFu); } + constexpr bool is_valid() const { return raw != UINT64_MAX; } + + constexpr bool operator==(const PTO2TaskId& other) const { return raw == other.raw; } + constexpr bool operator!=(const PTO2TaskId& other) const { return raw != other.raw; } +}; + +static_assert(sizeof(PTO2TaskId) == 8, "PTO2TaskId must stay 8 bytes (shared memory ABI)"); diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_tensormap.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_tensormap.cpp new file mode 100644 index 000000000..794636e3a --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_tensormap.cpp @@ -0,0 +1,256 @@ +/** + * PTO Runtime2 - TensorMap Implementation + * + * Implements TensorMap with ring buffer pool, lazy invalidation, + * and chain truncation optimization. + * + * Key features: + * 1. O(1) insert at bucket head + * 2. O(valid_entries) lookup with chain truncation + * 3. Automatic stale entry cleanup during lookup + * 4. Periodic explicit cleanup for long chains + * + * Based on: docs/RUNTIME_LOGIC.md + */ + +#include "pto_tensormap.h" + +#include +#include + +#include "common.h" +#include "common/unified_log.h" +#include "pto_orchestrator.h" + +// ============================================================================= +// TensorMap Lookup Chain Length Statistics (compile-time toggle) +// ============================================================================= +#if PTO2_TENSORMAP_PROFILING +uint64_t g_lookup_chain_total = 0; +uint64_t g_lookup_count = 0; +int32_t g_lookup_chain_max = 0; +uint64_t g_lookup_overlap_checks = 0; +uint64_t g_lookup_overlap_hits = 0; +uint64_t g_insert_count = 0; +#endif + +// ============================================================================= +// Initialization and Destruction +// ============================================================================= + +bool PTO2TensorMap::init(int32_t new_num_buckets, int32_t new_pool_size, const int32_t new_task_window_sizes[PTO2_MAX_RING_DEPTH]) { + // Validate power of 2 for fast modulo + if ((new_num_buckets & (new_num_buckets - 1)) != 0) { + return false; // num_buckets must be power of 2 + } + + // Allocate buckets + buckets = (PTO2TensorMapEntry**)malloc(new_num_buckets * sizeof(PTO2TensorMapEntry*)); + if (!buckets) { + return false; + } + + // Initialize all buckets to empty (-1) + for (int32_t i = 0; i < new_num_buckets; i++) { + buckets[i] = nullptr; + } + + num_buckets = new_num_buckets; + + // Allocate entry pool (64-byte aligned for cache-line-aligned entries) + entry_pool = (PTO2TensorMapEntry*)aligned_alloc(alignof(PTO2TensorMapEntry), new_pool_size * sizeof(PTO2TensorMapEntry)); + if (!entry_pool) { + free(buckets); + buckets = NULL; + return false; + } + memset(entry_pool, 0, new_pool_size * sizeof(PTO2TensorMapEntry)); + + // Allocate free entry list + free_entry_list = (PTO2TensorMapEntry**)calloc(new_pool_size, sizeof(PTO2TensorMapEntry*)); + if (!free_entry_list) { + free(buckets); + free(entry_pool); + buckets = NULL; + entry_pool = NULL; + return false; + } + + pool_size = new_pool_size; + next_entry_idx = 0; + free_num = 0; + + // Initialize all entries as not in bucket + for (int32_t i = 0; i < pool_size; i++) { + entry_pool[i].bucket_index = -1; + entry_pool[i].next_in_bucket = nullptr; + entry_pool[i].prev_in_bucket = nullptr; + entry_pool[i].next_in_task = nullptr; + entry_pool[i].prev_in_task = nullptr; + entry_pool[i].producer_task_id = PTO2TaskId{}; + } + + // Allocate per-ring per-task entry tracking (each ring has its own window size) + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + task_entry_heads[r] = (PTO2TensorMapEntry**)malloc(new_task_window_sizes[r] * sizeof(PTO2TensorMapEntry*)); + if (!task_entry_heads[r]) { + // Cleanup previously allocated rings + for (int j = 0; j < r; j++) { + free(task_entry_heads[j]); + task_entry_heads[j] = NULL; + } + free(entry_pool); + free(buckets); + free(free_entry_list); + entry_pool = NULL; + buckets = NULL; + free_entry_list = NULL; + return false; + } + for (int32_t i = 0; i < new_task_window_sizes[r]; i++) { + task_entry_heads[r][i] = nullptr; + } + task_window_sizes[r] = new_task_window_sizes[r]; + } + + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + last_task_alives[r] = 0; + last_cleanup[r] = 0; + } + + return true; +} + +bool PTO2TensorMap::init_default(const int32_t new_task_window_sizes[PTO2_MAX_RING_DEPTH]) { + return init(PTO2_TENSORMAP_NUM_BUCKETS, PTO2_TENSORMAP_POOL_SIZE, new_task_window_sizes); +} + +void PTO2TensorMap::destroy() { + if (buckets) { + free(buckets); + buckets = NULL; + } + + if (entry_pool) { + free(entry_pool); + entry_pool = NULL; + } + + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + if (task_entry_heads[r]) { + free(task_entry_heads[r]); + task_entry_heads[r] = NULL; + } + } + + if (free_entry_list) { + free(free_entry_list); + free_entry_list = NULL; + } +} + +// ============================================================================= +// Debug Utilities +// ============================================================================= + +void PTO2TensorMap::print_stats() { + int32_t valid = 0; + int32_t stale = 0; + int32_t empty_buckets = 0; + int32_t max_chain = 0; + int64_t total_chain = 0; + int32_t non_empty_buckets = 0; + + // Count entries + for (int32_t i = 0; i < pool_size; i++) { + if (entry_pool[i].bucket_index != -1) { + if (entry_valid(entry_pool[i])) { + valid++; + } else { + stale++; + } + } + } + + // Count bucket stats + for (int32_t b = 0; b < num_buckets; b++) { + int32_t chain_len = 0; + auto cur_entry = buckets[b]; + + while (cur_entry != nullptr) { + chain_len++; + cur_entry = cur_entry->next_in_bucket; + } + + if (chain_len == 0) { + empty_buckets++; + } else { + non_empty_buckets++; + total_chain += chain_len; + if (chain_len > max_chain) { + max_chain = chain_len; + } + } + } + + LOG_INFO("=== TensorMap Statistics ==="); + LOG_INFO("Pool size: %d", pool_size); + LOG_INFO("Pool next entry idx: %d", next_entry_idx); + LOG_INFO("Pool free_num: %d", free_num); + LOG_INFO("Num buckets: %d", num_buckets); + LOG_INFO("Valid entries: %d", valid); + LOG_INFO("Stale entries: %d", stale); + LOG_INFO("Empty buckets: %d", empty_buckets); + LOG_INFO("Max chain len: %d", max_chain); + LOG_INFO("Avg chain len: %.2f", non_empty_buckets > 0 ? (float)total_chain / non_empty_buckets : 0); + for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { + LOG_INFO("Last task alive[%d]: %d", r, last_task_alives[r]); + } + LOG_INFO("============================"); +} + +int32_t PTO2TensorMap::valid_count() { + int32_t count = 0; + + for (int32_t i = 0; i < pool_size; i++) { + if (entry_pool[i].bucket_index != -1 && entry_valid(entry_pool[i])) { + count++; + } + } + + return count; +} + +void PTO2TensorMap::sync_tensormap(uint8_t ring_id, int32_t sm_last_task_alive) { + sync_validity(ring_id, sm_last_task_alive); + // Only attempt cleanup when last_task_alive has actually advanced; + // otherwise cleanup_retired would empty-loop and we'd spin forever. + if (sm_last_task_alive - last_cleanup[ring_id] >= PTO2_TENSORMAP_CLEANUP_INTERVAL) { + cleanup_retired(ring_id, last_cleanup[ring_id], sm_last_task_alive); + last_cleanup[ring_id] = sm_last_task_alive; + } +} + +// ============================================================================= +// TensorMap Lookup Profiling +// ============================================================================= +#if PTO2_TENSORMAP_PROFILING +PTO2TensorMapProfilingData pto2_tensormap_get_profiling() { + PTO2TensorMapProfilingData d; + d.lookup_chain_total = g_lookup_chain_total; + d.lookup_count = g_lookup_count; + d.lookup_chain_max = g_lookup_chain_max; + d.overlap_checks = g_lookup_overlap_checks; + d.overlap_hits = g_lookup_overlap_hits; + d.insert_count = g_insert_count; + + // Reset + g_lookup_chain_total = 0; + g_lookup_count = 0; + g_lookup_chain_max = 0; + g_lookup_overlap_checks = 0; + g_lookup_overlap_hits = 0; + g_insert_count = 0; + return d; +} +#endif diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_tensormap.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_tensormap.h new file mode 100644 index 000000000..0916d96da --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_tensormap.h @@ -0,0 +1,521 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ + +/** + * PTO Runtime2 - TensorMap Interface + * + * TensorMap provides producer lookup for dependency discovery: + * - Maps Tensor -> producer task ID + * - Used by pto_submit_task() to find dependencies + * + * Key design features: + * 1. Ring buffer pool for entries (no malloc/free) + * 2. Lazy invalidation (entries become stale when producer retires) + * 3. Per-task per-ring entry tracking for efficient cleanup + * 4. OVERLAP DETECTION: Detects dependencies for overlapping sub-regions + * + * Hash table with chaining: + * - buckets[] array of head offsets + * - Entries linked via next_in_bucket + * - Insert at head (newest first) for sorted chains + * + * CRITICAL: Hash only by base_ptr + * ============================== + * For overlap detection to work, ALL sub-regions of the same base tensor + * MUST be in the SAME hash bucket. This allows lookup to compare all + * potentially overlapping regions. + * + * Overlap detection: Two regions create a dependency if: + * 1. Same base_ptr (raw tensor pointer) + * 2. Byte ranges [offset, offset+size) intersect + * + * Based on: docs/RUNTIME_LOGIC.md + */ + +#pragma once + +#include "common.h" // NOLINT(build/include_subdir) +#include "pto_runtime2_types.h" // NOLINT(build/include_subdir) +#include "tensor.h" // NOLINT(build/include_subdir) + +struct PTO2OrchestratorState; // forward declare + +// ============================================================================= +// TensorMap Lookup Profiling (must precede inline lookup/insert methods) +// ============================================================================= +#ifndef PTO2_TENSORMAP_PROFILING +#define PTO2_TENSORMAP_PROFILING 0 +#endif + +#if PTO2_TENSORMAP_PROFILING +extern uint64_t g_lookup_chain_total; +extern uint64_t g_lookup_count; +extern int32_t g_lookup_chain_max; +extern uint64_t g_lookup_overlap_checks; +extern uint64_t g_lookup_overlap_hits; +extern uint64_t g_insert_count; +#endif + +// ============================================================================= +// TensorMap Structure +// ============================================================================= + +/** + * TensorMap entry structure — cache-line optimized for lookup + * + * Cache line 1 (64B, lookup hot path): + * next_in_bucket, producer_task_id, buffer_addr — chain traversal + validity + hash match + * version, ndims, is_all_offset_zero, bucket_index — overlap fast path + * shapes[5] — overlap comparison + * + * Cache line 2 (64B, insert/remove/slow-path only): + * prev_in_bucket, next_in_task, prev_in_task — chain manipulation + * offsets[5] — only read when !is_all_offset_zero + * + * When is_all_offset_zero is true, lookup touches only cache line 1. + * Entry size: 128B (2 cache lines) vs previous 192B (3 cache lines with embedded Tensor). + */ +struct alignas(64) PTO2TensorMapEntry { + // === Cache line 1 (64B) — lookup hot path === + uint64_t buffer_addr; // 8B: tensor base address (hash key) + PTO2TensorMapEntry* next_in_bucket; // 8B: next entry in hash bucket chain + PTO2TaskId producer_task_id; // 8B: raw (ring_id << 32) | local_id + int32_t bucket_index; // 4B: bucket index (-1 if unlinked) + uint32_t __padding0__; // 4B: occupies Tensor::start_offset high half + int32_t version; // 4B: tensor version for overlap detection + uint32_t ndims; // 4B: number of dimensions + DataType __padding_dtype__; // 1B: occupies Tensor::dtype + bool is_all_offset_zero; // 1B: fast-path flag + uint8_t __padding1__[2]; + uint32_t shapes[RUNTIME_MAX_TENSOR_DIMS]; // 20B: shape per dimension + + // === Cache line 2 (64B) — insert/remove/slow-path === + PTO2TensorMapEntry* prev_in_bucket; // 8B: prev in hash bucket chain + PTO2TensorMapEntry* next_in_task; // 8B: next entry for same task + PTO2TensorMapEntry* prev_in_task; // 8B: prev entry for same task + uint32_t offsets[RUNTIME_MAX_TENSOR_DIMS]; // 20B: only when !is_all_offset_zero + // padding: 20B to fill 64B + + /** + * Copy overlap-relevant fields from a Tensor into this entry. + */ + void copy_from_tensor(const Tensor& tensor) { + memcpy(this, &tensor, 64); + if (!tensor.is_all_offset_zero) { + for (uint32_t i = 0; i < tensor.ndims; i++) { + offsets[i] = tensor.offsets[i]; + } + } + } + + void copy_tensor_create_info(const TensorCreateInfo& tensor_create_info, uint64_t addr) { + memcpy(this, &tensor_create_info, 64); + buffer_addr = addr; + } + + /** + * Check overlap between input tensor and this entry (the producer output). + * Mirrors Tensor::is_overlap() logic but operates on entry fields directly. + */ + OverlapStatus check_overlap(const Tensor& input) const { + debug_assert(input.buffer.addr == buffer_addr); + debug_assert(input.version >= version); + if (input.version > version) { + return OverlapStatus::OTHER; + } + // Fast path: both have zero offsets → ranges are [0, shape[i]) + if (input.is_all_offset_zero && is_all_offset_zero) { + bool contains = true; + for (uint32_t i = 0; i < ndims; i++) { + if (input.shapes[i] < shapes[i]) { + contains = false; + break; + } + } + return contains ? OverlapStatus::COVERED : OverlapStatus::OTHER; + } + // Slow path: at least one has non-zero offsets + bool contains = true; + for (uint32_t i = 0; i < ndims; i++) { + uint64_t in_off = input.is_all_offset_zero ? 0 : input.offsets[i]; + uint64_t ent_off = is_all_offset_zero ? 0 : offsets[i]; + Segment in_range{in_off, in_off + static_cast(input.shapes[i])}; + Segment ent_range{ent_off, ent_off + static_cast(shapes[i])}; + if (!in_range.line_segment_intersection(ent_range)) { + return OverlapStatus::NO_OVERLAP; + } else if (!in_range.contains(ent_range)) { + contains = false; + } + } + return contains ? OverlapStatus::COVERED : OverlapStatus::OTHER; + } +}; + +static_assert(sizeof(PTO2TensorMapEntry) == 128, "TensorMapEntry must be exactly 2 cache lines (128 bytes)"); +static_assert(offsetof(PTO2TensorMapEntry, buffer_addr) == offsetof(Tensor, buffer.addr)); +static_assert(offsetof(PTO2TensorMapEntry, version) == offsetof(Tensor, version)); +static_assert(offsetof(PTO2TensorMapEntry, ndims) == offsetof(Tensor, ndims)); +static_assert(offsetof(PTO2TensorMapEntry, is_all_offset_zero) == offsetof(Tensor, is_all_offset_zero)); +static_assert(offsetof(PTO2TensorMapEntry, shapes) == offsetof(Tensor, shapes)); +static_assert( + offsetof(PTO2TensorMapEntry, prev_in_bucket) == 64, "TensorMapEntry must be exactly 2 cache lines (128 bytes)"); + +/** + * Stack-allocated lookup result (avoids heap allocation per lookup) + */ +#define PTO2_LOOKUP_MAX_RESULTS 16 +// ============================================================================= +// TensorMap Lookup Chain Length Statistics (compile-time toggle) +// ============================================================================= +struct PTO2LookupResult { + struct Entry { + PTO2TensorMapEntry* entry; + OverlapStatus overlap_status; + }; + Entry entries[PTO2_LOOKUP_MAX_RESULTS]; + int32_t count{0}; + + void push(PTO2TensorMapEntry* entry, OverlapStatus s) { + if (count < PTO2_LOOKUP_MAX_RESULTS) { + entries[count++] = {entry, s}; + } + } +}; + +/** + * TensorMap structure + * + * Hash table with ring buffer entry pool and lazy invalidation. + */ +struct PTO2TensorMap { + // Hash table buckets (fixed size, power of 2) + PTO2TensorMapEntry** buckets; // Array of offsets into entry_pool (-1 = empty) + int32_t num_buckets; // Must be power of 2 for fast modulo + + // Entry pool as ring buffer + PTO2TensorMapEntry* entry_pool; // Ring buffer of entries + PTO2TensorMapEntry** free_entry_list; // free entry ids + int32_t pool_size; // Total pool capacity + int32_t next_entry_idx; // id when next entry insert + int32_t free_num; // free entry number in entry pool + + // Per-ring per-task entry tracking (for efficient bucket cleanup) + // Indexed by [ring_id][local_id & (task_window_sizes[ring_id] - 1)] + PTO2TensorMapEntry** task_entry_heads[PTO2_MAX_RING_DEPTH]; + int32_t task_window_sizes[PTO2_MAX_RING_DEPTH]; // Per-ring task window size (for slot masking) + + // Per-ring validity threshold (for lazy invalidation) + int32_t last_task_alives[PTO2_MAX_RING_DEPTH]; // Cached from shared memory per ring + + // Per-ring cleanup progress (for periodic cleanup_retired) + int32_t last_cleanup[PTO2_MAX_RING_DEPTH]{}; + + PTO2OrchestratorState* orch{nullptr}; + + // new_entry only allocates memory, does not assign attributes + PTO2TensorMapEntry* new_entry() { + if (free_num > 0) { + PTO2TensorMapEntry* res = free_entry_list[--free_num]; + debug_assert(res->bucket_index == -1); + return res; + } + always_assert(next_entry_idx < pool_size); + PTO2TensorMapEntry* res = &entry_pool[next_entry_idx++]; + debug_assert(res->bucket_index == -1); + return res; + } + + void free_entry(PTO2TensorMapEntry& entry) { + always_assert(entry.bucket_index != -1); // must still be in a bucket + + // Update predecessor's next pointer (O(1) via prev_in_bucket) + if (entry.prev_in_bucket == nullptr) { + // Entry is the head of its bucket chain, update bucket head + // Must compute hash BEFORE clearing tensor + buckets[entry.bucket_index] = entry.next_in_bucket; + } else { + entry.prev_in_bucket->next_in_bucket = entry.next_in_bucket; + } + + // Update successor's prev pointer + if (entry.next_in_bucket != nullptr) { + entry.next_in_bucket->prev_in_bucket = entry.prev_in_bucket; + } + + free_entry_list[free_num++] = &entry; + entry.bucket_index = -1; + entry.next_in_bucket = nullptr; + entry.prev_in_bucket = nullptr; + entry.next_in_task = nullptr; + entry.prev_in_task = nullptr; + } + + // ============================================================================= + // TensorMap API + // ============================================================================= + + /** + * Initialize TensorMap + * + * @param num_buckets Number of hash buckets (must be power of 2) + * @param pool_size Size of entry pool + * @return true on success, false on allocation failure + */ + bool init(int32_t num_buckets, int32_t pool_size, const int32_t task_window_sizes[PTO2_MAX_RING_DEPTH]); + + /** + * Initialize TensorMap with default sizes + */ + bool init_default(const int32_t task_window_sizes[PTO2_MAX_RING_DEPTH]); + + /** + * Destroy TensorMap and free resources + */ + void destroy(); + + /** + * Update validity threshold from shared memory + * Called periodically to refresh the lazy invalidation threshold. + * + * @param last_task_alive Current value from shared memory + */ + void sync_validity(int32_t ring_id, int32_t last_task_alive) { this->last_task_alives[ring_id] = last_task_alive; } + + /** + * Lookup producer for a tensor region + * + * Searches the hash table for a matching region. + * Returns producer entry if found and valid. + * Stale entries from different rings are skipped (not truncated). + * + * @param tensor Tensor to look up + * @param result Output: stack-allocated result buffer + */ + void lookup(const Tensor& tensor, PTO2LookupResult& result) { + uint32_t bucket_index = hash(tensor.buffer.addr); + PTO2TensorMapEntry* cur_entry = buckets[bucket_index]; + + result.count = 0; +#if PTO2_TENSORMAP_PROFILING + g_lookup_count++; + int32_t chain_len = 0; +#endif + + while (cur_entry != nullptr) { + PTO2TensorMapEntry* next_entry = cur_entry->next_in_bucket; + +#if PTO2_TENSORMAP_PROFILING + chain_len++; +#endif + // Skip stale entries (no chain truncation — entries from different + // rings can be interleaved, so a stale entry from one ring does NOT + // imply subsequent entries from other rings are also stale) + if (!entry_valid(*cur_entry)) { + cur_entry = next_entry; + continue; + } + + // Entry is valid - check if regions OVERLAP (not just exact match) + // Since we hash only by base_ptr, all entries in this bucket have + // potential to overlap. We must check actual byte-range overlap. + if (tensor.buffer.addr == cur_entry->buffer_addr) { +#if PTO2_TENSORMAP_PROFILING + g_lookup_overlap_checks++; +#endif + auto overlap_status = cur_entry->check_overlap(tensor); + if (overlap_status != OverlapStatus::NO_OVERLAP) { + result.push(cur_entry, overlap_status); +#if PTO2_TENSORMAP_PROFILING + g_lookup_overlap_hits++; +#endif + } + } + + // Move to next entry + cur_entry = next_entry; + } +#if PTO2_TENSORMAP_PROFILING + g_lookup_chain_total += chain_len; + if (chain_len > g_lookup_chain_max) g_lookup_chain_max = chain_len; +#endif + } + + /** + * Insert a new entry (called when task produces output) + * + * Allocates from ring buffer pool, may overwrite stale entries. + * Inserts at head of hash bucket chain (maintains task_id ordering). + * + * @param tensor Tensor produced + * @param producer_task_id Task ID of producer + */ + void insert(const Tensor& tensor, PTO2TaskId producer_task_id) { + PTO2TensorMapEntry* entry = new_entry(); + entry->copy_from_tensor(tensor); + link_entry(entry, tensor.buffer.addr, producer_task_id); + } + + /** + * Cleanup stale entries for retired tasks + * + * Called periodically by Orchestrator when last_task_alive advances. + * Removes entries from bucket chains for tasks in [old, new) range. + * + * @param old_last_task_alive Previous threshold + * @param new_last_task_alive New threshold + */ + void cleanup_retired(int32_t ring_id, int32_t old_last_task_alive, int32_t new_last_task_alive) { + // Iterate through retired tasks on this ring and remove their entries + for (int32_t local_id = old_last_task_alive; local_id < new_last_task_alive; local_id++) { + int32_t task_slot = local_id & (task_window_sizes[ring_id] - 1); + PTO2TensorMapEntry* cur_entry = task_entry_heads[ring_id][task_slot]; + + while (cur_entry != nullptr) { + PTO2TensorMapEntry* next_entry = cur_entry->next_in_task; // Save before clearing + // Only remove if this entry belongs to the retiring task + // (slot may have been reused by a newer task) + debug_assert(cur_entry->producer_task_id == + PTO2TaskId::make(static_cast(ring_id), static_cast(local_id))); + free_entry(*cur_entry); + cur_entry = next_entry; + } + + // Clear task's entry head (slot will be reused by local_id + task_window_sizes[ring_id]) + task_entry_heads[ring_id][task_slot] = nullptr; + } + } + + // ============================================================================= + // Internal Helpers (exposed for testing) + // ============================================================================= + + /** + * Compute hash for tensor addr + * + * Multiplicative hash using the golden-ratio constant. Multiplication + * mixes ALL input bits into the high bits of the product, so aligned + * addresses (low bits all-zero) still distribute evenly. We extract + * the top log2(num_buckets) bits which carry the most entropy. + */ + uint32_t hash(uint64_t key) { + key *= 0x9E3779B97F4A7C15ULL; + return static_cast(key >> (64 - __builtin_ctz(num_buckets))); + } + + /** + * Link an initialized entry into bucket and task chains. + */ + void link_entry(PTO2TensorMapEntry* entry, uint64_t addr, PTO2TaskId producer_task_id) { +#if PTO2_TENSORMAP_PROFILING + g_insert_count++; +#endif + uint32_t bucket_index = hash(addr); + auto ring_id = producer_task_id.ring(); + auto local_id = producer_task_id.local(); + int32_t task_slot = local_id & (task_window_sizes[ring_id] - 1); + + entry->producer_task_id = producer_task_id; + + // Insert at head of hash bucket + entry->bucket_index = bucket_index; + entry->next_in_bucket = buckets[bucket_index]; + if (entry->next_in_bucket != nullptr) { + entry->next_in_bucket->prev_in_bucket = entry; + } + buckets[bucket_index] = entry; + entry->prev_in_bucket = nullptr; + + // Link to task's entry list + entry->next_in_task = task_entry_heads[ring_id][task_slot]; + entry->prev_in_task = nullptr; + if (entry->next_in_task != nullptr) { + entry->next_in_task->prev_in_task = entry; + } + task_entry_heads[ring_id][task_slot] = entry; + } + + /** + * Check if entry is valid (producer has not retired) + */ + bool entry_valid(const PTO2TensorMapEntry& entry) const { + return static_cast(entry.producer_task_id.local()) >= last_task_alives[entry.producer_task_id.ring()]; + } + + void remove_entry(PTO2TensorMapEntry& entry) { + remove_from_task(entry); + free_entry(entry); + } + + /** + * Remove entry from its task chain (O(1) with prev pointer) + * Called during pool wrap-around to unlink reused entries. + */ + void remove_from_task(PTO2TensorMapEntry& entry) { + always_assert(entry.bucket_index != -1); // must still be in a bucket + // Update predecessor's next pointer (O(1) via prev_in_task) + if (entry.prev_in_task == nullptr) { + // Entry is the head of its task chain, update task_entry_heads + int32_t ring_id = entry.producer_task_id.ring(); + int32_t local_id = static_cast(entry.producer_task_id.local()); + int32_t task_slot = local_id & (task_window_sizes[ring_id] - 1); + task_entry_heads[ring_id][task_slot] = entry.next_in_task; + } else { + entry.prev_in_task->next_in_task = entry.next_in_task; + } + + // Update successor's prev pointer + if (entry.next_in_task != nullptr) { + entry.next_in_task->prev_in_task = entry.prev_in_task; + } + + entry.next_in_task = nullptr; + entry.prev_in_task = nullptr; + } + + // ============================================================================= + // Debug Utilities + // ============================================================================= + + /** + * Print TensorMap statistics + */ + void print_stats(); + + /** + * Get count of valid entries + */ + int32_t valid_count(); + + // ============================================================================= + // TensorMap Synchronization + // ============================================================================= + + /** + * Sync TensorMap validity threshold from shared memory + * + * Called periodically to refresh the lazy invalidation threshold. + * Also triggers cleanup if threshold has advanced significantly. + */ + void sync_tensormap(uint8_t ring_id, int32_t sm_last_task_alive); +}; + +#if PTO2_TENSORMAP_PROFILING +struct PTO2TensorMapProfilingData { + uint64_t lookup_chain_total; + uint64_t lookup_count; + int32_t lookup_chain_max; + uint64_t overlap_checks; + uint64_t overlap_hits; + uint64_t insert_count; +}; + +PTO2TensorMapProfilingData pto2_tensormap_get_profiling(); +#endif diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_types.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_types.h new file mode 100644 index 000000000..6c3eb3acc --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_types.h @@ -0,0 +1,284 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ +/** + * Orchestration Build Graph Types - Data structures for orchestration runtime extensions + * + * Standalone header defining orchestration-specific types for: + * - TaskOutputTensors: Return value from submit containing materialized output Tensors + * - Arg: Aggregated argument container for pto_submit_task API + * + * Tensor descriptor types (Tensor, PTOBufferHandle, TensorCreateInfo) are + * defined in tensor.h. + * + * This header is independent of orch_build_graph_runtime.h to allow inclusion from runtime.h + * without type conflicts (Handshake, TensorPair, HostApi). + */ + +#ifndef SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_TYPES_H_ +#define SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_TYPES_H_ + +#include +#include + +#if defined(__aarch64__) +#include +#endif + +#include "pto_submit_types.h" // NOLINT(build/include_subdir) -- PTO2LaunchSpec +#include "task_args.h" // NOLINT(build/include_subdir) -- TaskArgs base class +#include "tensor.h" // NOLINT(build/include_subdir) +#include "tensor_arg.h" // NOLINT(build/include_subdir) -- canonical TensorArgType definition + +// Task arguments +#define MAX_TENSOR_ARGS 16 // Maximum tensor arguments per task +#define MAX_SCALAR_ARGS 128 // Maximum scalar arguments per task +#define PTO2_MAX_OUTPUTS 16 // Maximum outputs per task +#define PTO2_MAX_INPUTS 16 // Maximum inputs per task +#define PTO2_MAX_INOUTS 8 // Maximum in-out args per task + +// ============================================================================= +// Task Output Tensors (return value from submit) +// ============================================================================= + +/** + * TaskOutputTensors — returned by submit, holds materialized output Tensors. + * + * Only runtime-created outputs are stored here, indexed in add_output order. + * + * The underlying storage is uninitialized; only output_count elements are + * valid after submit returns. This avoids default-constructing Tensor[] + * on the hot path (2 KB of unnecessary zeroing per submit). + * + * Users must hold a named TaskOutputTensors variable and borrow via get_ref(); + * binding get_ref() on an rvalue is compile-time rejected to prevent dangling. + */ +class TaskOutputTensors { + public: // NOLINT(whitespace/indent) + TaskOutputTensors() : output_count_(0) {} + + bool empty() const { return output_count_ == 0; } + uint32_t size() const { return output_count_; } + + /// Borrow a materialized output tensor by index (lvalue only). + const Tensor& get_ref(uint32_t index) const& { + always_assert(index < output_count_); + return *tensors_[index]; + } + const Tensor& get_ref(uint32_t index) const&& = delete; + + /// Runtime-internal: append one materialized output Tensor. + void materialize_output(const Tensor& tensor) { + always_assert(output_count_ < PTO2_MAX_OUTPUTS); + tensors_[output_count_++] = &tensor; + } + + private: // NOLINT(whitespace/indent) + uint32_t output_count_; + const Tensor* tensors_[PTO2_MAX_OUTPUTS]; +}; + +// ============================================================================= +// Argument Types (for pto_submit_task API) +// ============================================================================= + +// TensorArgType is defined in tensor_arg.h (included above) + +/** + * Tagged union for a single Arg slot — either a Tensor* or a TensorCreateInfo value. + * The active member is determined by TensorArgType (OUTPUT → create_info, else → ptr). + */ +union TensorRef { + const Tensor* ptr; + const TensorCreateInfo* create_info; + TensorRef() : ptr(nullptr) {} +}; + +/** + * Aggregated argument container for pto_submit_task + * + * Inherits storage from TaskArgs. + * Each tensor slot stores a TensorRef union (Tensor* or TensorCreateInfo) + * discriminated by the corresponding tag(). + * Tensors are dispatched first in kernel args, followed by scalars. + * + * Output arguments follow two distinct ownership models: + * - add_output(const TensorCreateInfo&): OUTPUT — runtime allocates buffer + * and materializes a new Tensor, returned via TaskOutputTensors. + * - add_inout(const Tensor&): INOUT — reuses an existing Tensor as the write target. + * + * Example: + * Tensor x = make_tensor_external(dev_a, shapes, 2); + * TensorCreateInfo ci(shapes, 2); // must outlive submit + * Arg args; + * args.add_input(x); + * args.add_output(ci); + * args.add_scalar(some_value); + * TaskOutputTensors outs = pto2_rt_submit_aic_task(kernel_id, args); + * const Tensor& y = outs.get_ref(0); + */ +struct Arg : TaskArgs { + bool has_error{false}; + const char* error_msg{nullptr}; + PTO2LaunchSpec launch_spec; // SPMD launch parameters (block_num, etc.) + + void reset() { + clear(); + has_error = false; + error_msg = nullptr; + } + + void set_error(const char* msg) { + if (!has_error) { + has_error = true; + error_msg = msg; + } + } + + bool check_add_tensor_valid() { + if (scalar_count_ != 0) { + set_error( + "add_input/add_output/add_inout called after add_scalar: " + "all tensors must be added before any scalars"); + return false; + } + if (tensor_count_ >= MAX_TENSOR_ARGS) { + set_error("Too many tensor args (exceeds MAX_TENSOR_ARGS=16)"); + return false; + } + return true; + } + + void add_input(const Tensor& t) { + if (!check_add_tensor_valid()) { + return; + } + tensors_[tensor_count_].ptr = &t; + tags_[tensor_count_] = TensorArgType::INPUT; + tensor_count_++; + } + + /// Standard future-output path: runtime allocates buffer from heap, + /// materializes Tensor into TaskOutputTensors. + /// The TensorCreateInfo must outlive the submit call (pointer is stored). + void add_output(const TensorCreateInfo& ci) { + if (!check_add_tensor_valid()) { + return; + } + tensors_[tensor_count_].create_info = &ci; + tags_[tensor_count_] = TensorArgType::OUTPUT; + tensor_count_++; + } + + /// Prevent passing temporaries — the pointer would dangle before submit. + void add_output(TensorCreateInfo&&) = delete; + + void add_inout(const Tensor& t) { + if (!check_add_tensor_valid()) { + return; + } + tensors_[tensor_count_].ptr = &t; + tags_[tensor_count_] = TensorArgType::INOUT; + tensor_count_++; + } + + /// Write-only existing tensor: skips OverlapMap lookup, depends on creator. + void add_output(const Tensor& t) { + if (!check_add_tensor_valid()) return; + tensors_[tensor_count_].ptr = &t; + tags_[tensor_count_] = TensorArgType::OUTPUT_EXISTING; + tensor_count_++; + } + + /// No-dependency existing tensor: skips OverlapMap lookup, depends on creator only. + void add_no_dep(const Tensor& t) { + if (!check_add_tensor_valid()) return; + tensors_[tensor_count_].ptr = &t; + tags_[tensor_count_] = TensorArgType::NO_DEP; + tensor_count_++; + } + + /** + * Add a scalar value. Type is deduced from the argument; + * the value is bit-cast to uint64_t for storage. + * + * args.add_scalar(uint64_val); // existing usage unchanged + * args.add_scalar(3.14f); // float, auto bit-cast + * args.add_scalar(int32_t(42)); // int32, auto bit-cast + */ + template + void add_scalar(T value) { + if (scalar_count_ >= MAX_SCALAR_ARGS) { + set_error("Too many scalar args (exceeds MAX_SCALAR_ARGS=128)"); + return; + } + scalars_[scalar_count_++] = to_u64(value); + } + + void add_scalars(const uint64_t* values, int count) { + if (scalar_count_ + count > MAX_SCALAR_ARGS) { + set_error("Too many scalar args (exceeds MAX_SCALAR_ARGS=128)"); + return; + } + memcpy(&scalars_[scalar_count_], values, count * sizeof(uint64_t)); + scalar_count_ += count; + } + + /** + * Zero-extend int32 bit patterns into uint64 scalar slots. + * Negative values are treated as their unsigned 32-bit representation + * (e.g., -1 → 0x00000000FFFFFFFF, not 0xFFFFFFFFFFFFFFFF). + * Uses NEON to process 4 elements per iteration on aarch64. + */ + void add_scalars_i32(const int32_t* values, int count) { + if (scalar_count_ + count > MAX_SCALAR_ARGS) { + set_error("Too many scalar args (exceeds MAX_SCALAR_ARGS=128)"); + return; + } + uint64_t* dst = &scalars_[scalar_count_]; +#if defined(__aarch64__) + int i = 0; + for (; i + 4 <= count; i += 4) { + uint32x4_t v = vld1q_u32(reinterpret_cast(values + i)); + uint64x2_t lo = vmovl_u32(vget_low_u32(v)); + uint64x2_t hi = vmovl_u32(vget_high_u32(v)); + vst1q_u64(dst + i, lo); + vst1q_u64(dst + i + 2, hi); + } + for (; i < count; i++) { + dst[i] = static_cast(static_cast(values[i])); + } +#else + for (int i = 0; i < count; i++) { + dst[i] = static_cast(static_cast(values[i])); + } +#endif + scalar_count_ += count; + } + + /** + * Copy scalars from another Arg's scalar array. + * Useful when multiple tasks share the same scalar data (e.g., block indices). + */ + void copy_scalars_from(const Arg& src, int src_offset, int count) { + if (src_offset + count > src.scalar_count_) { + set_error("Source scalar range out of bounds in copy_scalars_from"); + return; + } + if (scalar_count_ + count > MAX_SCALAR_ARGS) { + set_error("Too many scalar args (exceeds MAX_SCALAR_ARGS=128)"); + return; + } + memcpy(&scalars_[scalar_count_], &src.scalars_[src_offset], count * sizeof(uint64_t)); + scalar_count_ += count; + } +}; + +#endif // SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_TYPES_H_ diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/runtime.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/runtime.cpp new file mode 100644 index 000000000..38f74b8bc --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/runtime.cpp @@ -0,0 +1,144 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ +/** + * Runtime Class - Implementation + * + * Device execution and handshake control. + * Task graph construction is handled by PTO2Runtime. + */ + +#include "runtime.h" // NOLINT(build/include_subdir) + +#include "common/unified_log.h" +#include "pto_runtime2_types.h" // NOLINT(build/include_subdir) +#include "pto_shared_memory.h" // NOLINT(build/include_subdir) + +// ============================================================================= +// Constructor +// ============================================================================= + +Runtime::Runtime() { + // NOTE: host_api is initialized in InitRuntime() (host-only code) + // because the CApi functions don't exist when compiled for device. + + // Initialize handshake buffers + memset(workers, 0, sizeof(workers)); + worker_count = 0; + sche_cpu_num = 1; + orch_thread_num = 1; + ready_queue_shards = RUNTIME_DEFAULT_READY_QUEUE_SHARDS; + pto2_task_window_size = 0; + pto2_heap_size = 0; + pto2_dep_pool_size = 0; + orch_to_sched = false; + + // Initialize tensor pairs + tensor_pair_count = 0; + + // Initialize device orchestration state + orch_built_on_host_ = true; + pto2_gm_sm_ptr_ = nullptr; + pto2_gm_heap_ptr_ = nullptr; + pto2_slot_states_ptr_ = nullptr; + orch_args_storage_.clear(); + + // Initialize device orchestration SO binary + device_orch_so_size_ = 0; + + // Initialize kernel binary tracking + registered_kernel_count_ = 0; + + // Initialize function address mapping + for (int i = 0; i < RUNTIME_MAX_FUNC_ID; i++) { + func_id_to_addr_[i] = 0; + } +} + +// ============================================================================= +// Tensor Pair Management +// ============================================================================= + +void Runtime::record_tensor_pair(void* host_ptr, void* dev_ptr, size_t size) { + if (tensor_pair_count >= RUNTIME_MAX_TENSOR_PAIRS) { + LOG_ERROR("[Runtime] Tensor pairs full (max=%d)", RUNTIME_MAX_TENSOR_PAIRS); + return; + } + tensor_pairs[tensor_pair_count].host_ptr = host_ptr; + tensor_pairs[tensor_pair_count].dev_ptr = dev_ptr; + tensor_pairs[tensor_pair_count].size = size; + tensor_pair_count++; + LOG_INFO("Recorded tensor pair: host=%p dev=%p size=%zu", host_ptr, dev_ptr, size); +} + +TensorPair* Runtime::get_tensor_pairs() { return tensor_pairs; } + +int Runtime::get_tensor_pair_count() const { return tensor_pair_count; } + +void Runtime::clear_tensor_pairs() { tensor_pair_count = 0; } + +// ============================================================================= +// Device orchestration +// ============================================================================= + +bool Runtime::get_orch_built_on_host() const { return orch_built_on_host_; } +void* Runtime::get_pto2_gm_sm_ptr() const { return pto2_gm_sm_ptr_; } +void* Runtime::get_pto2_gm_heap_ptr() const { return pto2_gm_heap_ptr_; } +const ChipStorageTaskArgs& Runtime::get_orch_args() const { return orch_args_storage_; } +void Runtime::set_orch_built_on_host(bool v) { orch_built_on_host_ = v; } +void Runtime::set_pto2_gm_sm_ptr(void* p) { pto2_gm_sm_ptr_ = p; } +void Runtime::set_pto2_gm_heap(void* p) { pto2_gm_heap_ptr_ = p; } +void Runtime::set_pto2_slot_states_ptr(void* p) { pto2_slot_states_ptr_ = p; } +void Runtime::set_orch_args(const ChipStorageTaskArgs& args) { orch_args_storage_ = args; } + +// Device orchestration SO binary (for dlopen on AICPU thread 3) +// Copies data to internal storage to avoid lifetime issues with Python ctypes arrays +void Runtime::set_device_orch_so(const void* data, size_t size) { + if (data == nullptr || size == 0) { + device_orch_so_size_ = 0; + return; + } + if (size > RUNTIME_MAX_ORCH_SO_SIZE) { + LOG_ERROR("[Runtime] Orchestration SO too large (%zu > %d)", size, RUNTIME_MAX_ORCH_SO_SIZE); + device_orch_so_size_ = 0; + return; + } + memcpy(device_orch_so_storage_, data, size); + device_orch_so_size_ = size; +} + +const void* Runtime::get_device_orch_so_data() const { + return device_orch_so_size_ > 0 ? device_orch_so_storage_ : nullptr; +} + +size_t Runtime::get_device_orch_so_size() const { return device_orch_so_size_; } + +uint64_t Runtime::get_function_bin_addr(int func_id) const { + if (func_id < 0 || func_id >= RUNTIME_MAX_FUNC_ID) return 0; + return func_id_to_addr_[func_id]; +} + +void Runtime::set_function_bin_addr(int func_id, uint64_t addr) { + if (func_id >= 0 && func_id < RUNTIME_MAX_FUNC_ID) { + func_id_to_addr_[func_id] = addr; + if (addr != 0 && registered_kernel_count_ < RUNTIME_MAX_FUNC_ID) { + registered_kernel_func_ids_[registered_kernel_count_++] = func_id; + } + } +} + +int Runtime::get_registered_kernel_count() const { return registered_kernel_count_; } + +int Runtime::get_registered_kernel_func_id(int index) const { + if (index < 0 || index >= registered_kernel_count_) return -1; + return registered_kernel_func_ids_[index]; +} + +void Runtime::clear_registered_kernels() { registered_kernel_count_ = 0; } diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/runtime.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/runtime.h new file mode 100644 index 000000000..208a6b13a --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/runtime.h @@ -0,0 +1,290 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ +/** + * Runtime Class - Device Execution and Handshake Control + * + * This class manages device-side execution through AICPU-AICore handshake + * protocol. Task graph construction is handled by PTO2Runtime; this class + * only handles: + * - Handshake buffers for AICPU-AICore communication + * - Execution parameters (block_dim, sche_cpu_num) + * - Tensor pair management for host-device memory tracking + * - Device orchestration state (pto2_gm_sm_ptr_, orch_args_) + * - Function address mapping (func_id_to_addr_) + * + * Task dispatch uses a per-core PTO2DispatchPayload written by the scheduler. + * At dispatch time, build_payload() copies tensor pointers and scalars from + * the task payload into the per-core args[], populates SPMD context, then + * signals AICore via DATA_MAIN_BASE. + */ + +#ifndef SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_RUNTIME_H_ +#define SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_RUNTIME_H_ + +#include +#include +#include // for fprintf, printf +#include // for memset + +#include "common/core_type.h" +#include "common/perf_profiling.h" +#include "common/platform_config.h" +#include "pto2_dispatch_payload.h" +#include "task_args.h" + +// ============================================================================= +// Configuration Macros +// ============================================================================= + +#define RUNTIME_MAX_ARGS 128 +#define RUNTIME_MAX_WORKER 72 // 24 AIC + 48 AIV cores +#define RUNTIME_MAX_TENSOR_PAIRS 64 +#define RUNTIME_MAX_FUNC_ID 32 +#define RUNTIME_MAX_ORCH_SO_SIZE (4 * 1024 * 1024) // 1MB max for orchestration SO + +// Default ready queue shards: one shard per worker thread (total minus orchestrator) +constexpr int RUNTIME_DEFAULT_READY_QUEUE_SHARDS = PLATFORM_MAX_AICPU_THREADS - 1; + +// ============================================================================= +// Data Structures +// ============================================================================= + +/** + * Handshake Structure - Shared between Host, AICPU, and AICore + * + * This structure facilitates communication and synchronization between + * AICPU and AICore during task execution. + * + * Protocol State Machine: + * 1. Initialization: AICPU sets aicpu_ready=1 + * 2. Acknowledgment: AICore sets aicore_done=core_id+1 + * 3. Task Dispatch: AICPU writes DATA_MAIN_BASE after updating the per-core payload + * 4. Task Execution: AICore reads the cached PTO2DispatchPayload and executes + * 5. Task Completion: AICore writes FIN to COND; AICPU observes completion + * 6. Shutdown: AICPU sets control=1, AICore exits + * + * Each AICore instance has its own handshake buffer to enable concurrent + * task execution across multiple cores. + */ + +/** + * Handshake buffer for AICPU-AICore communication + * + * Each AICore has its own handshake buffer for synchronization with AICPU. + * The structure is cache-line aligned (64 bytes) to prevent false sharing + * between cores and optimize cache coherency operations. + * + * Field Access Patterns: + * - aicpu_ready: Written by AICPU, read by AICore + * - aicore_done: Written by AICore, read by AICPU + * - task: Written by AICPU, read by AICore (0 = not ready, non-zero = PTO2DispatchPayload*) + * - task_status: Written by both (AICPU=1 on dispatch, AICore=0 on completion) + * - control: Written by AICPU, read by AICore (0 = continue, 1 = quit) + * - core_type: Written by AICPU, read by AICore (CoreType::AIC or CoreType::AIV) + */ +struct Handshake { + volatile uint32_t aicpu_ready; // AICPU ready signal: 0=not ready, 1=ready + volatile uint32_t aicore_done; // AICore ready signal: 0=not ready, core_id+1=ready + volatile uint64_t task; // Init: PTO2DispatchPayload* (set before aicpu_ready); runtime: unused + volatile int32_t task_status; // Task execution status: 0=idle, 1=busy + volatile int32_t control; // Control signal: 0=execute, 1=quit + volatile CoreType core_type; // Core type: CoreType::AIC or CoreType::AIV + volatile uint64_t perf_records_addr; // Performance records address + volatile uint32_t perf_buffer_status; // 0 = not full, 1 == full + volatile uint32_t physical_core_id; // Physical core ID + volatile uint32_t aicpu_regs_ready; // AICPU register init done: 0=pending, 1=done + volatile uint32_t aicore_regs_ready; // AICore ID reported: 0=pending, 1=done +} __attribute__((aligned(64))); + +/** + * Tensor pair for tracking host-device memory mappings. + * Used for copy-back during finalize. + */ +struct TensorPair { + void* host_ptr; + void* dev_ptr; + size_t size; +}; + +/** + * Host API function pointers for device memory operations. + * Allows runtime to use pluggable device memory backends. + */ +struct HostApi { + void* (*device_malloc)(size_t size); + void (*device_free)(void* dev_ptr); + int (*copy_to_device)(void* dev_ptr, const void* host_ptr, size_t size); + int (*copy_from_device)(void* host_ptr, const void* dev_ptr, size_t size); + uint64_t (*upload_kernel_binary)(int func_id, const uint8_t* bin_data, size_t bin_size); + void (*remove_kernel_binary)(int func_id); +}; + +/** + * Task structure - Compatibility stub for platform layer + * + * RT2 uses PTO2DispatchPayload instead of Task for task dispatch. + * This stub exists only for API compatibility with device_runner.cpp. + * Since get_task_count() returns 0, this struct is never actually used. + */ +struct Task { + int func_id; + uint64_t function_bin_addr; +}; + +// ============================================================================= +// Runtime Class +// ============================================================================= + +/** + * Runtime class for device execution and handshake control + * + * This class manages AICPU-AICore communication through handshake buffers. + * Task graph construction is handled by PTO2Runtime; this class only handles + * execution control and device orchestration state. + */ +class Runtime { + public: // NOLINT(whitespace/indent) + // Handshake buffers for AICPU-AICore communication + Handshake workers[RUNTIME_MAX_WORKER]; // Worker (AICore) handshake buffers + int worker_count; // Number of active workers + + // Execution parameters for AICPU scheduling + int sche_cpu_num; // Number of AICPU threads for scheduling + int orch_thread_num; // Number of orchestrator threads (default 1) + int ready_queue_shards; // Number of ready queue shards (1..MAX_AICPU_THREADS, default MAX-1) + + // Ring buffer size overrides (0 = use compile-time defaults) + uint64_t pto2_task_window_size; + uint64_t pto2_heap_size; + uint64_t pto2_dep_pool_size; + + // PTO2 integration: kernel_id -> GM function_bin_addr mapping + // NOTE: Made public for direct access from aicore code + uint64_t func_id_to_addr_[RUNTIME_MAX_FUNC_ID]; + + // Profiling support + bool enable_profiling; // Enable profiling flag + + // Orchestrator-to-scheduler transition control + // When true, orchestrator threads convert to scheduler threads after orchestration completes. + // When false (default), orchestrator threads exit after orchestration without dispatching tasks. + // Controlled via PTO2_ORCH_TO_SCHED environment variable. + bool orch_to_sched; + uint64_t perf_data_base; // Performance data shared memory base address (device-side) + + private: // NOLINT(whitespace/indent) + // Tensor pairs for host-device memory tracking + TensorPair tensor_pairs[RUNTIME_MAX_TENSOR_PAIRS]; + int tensor_pair_count; + + // Kernel binary tracking for cleanup + int registered_kernel_func_ids_[RUNTIME_MAX_FUNC_ID]; + int registered_kernel_count_; + + // Device orchestration: when false, orchestration runs on device (thread 3) + bool orch_built_on_host_; + void* pto2_gm_sm_ptr_; // GM pointer to PTO2 shared memory (device) + void* pto2_gm_heap_ptr_; // GM heap for orchestrator output buffers (device) + void* pto2_slot_states_ptr_; // Pointer to PTO2TaskSlotState array (scheduler-private, for profiling) + ChipStorageTaskArgs orch_args_storage_; // Copy of args for device + + // Device orchestration SO binary (for dlopen on AICPU thread 3) + // Stored as a copy to avoid lifetime issues with Python ctypes arrays + uint8_t device_orch_so_storage_[RUNTIME_MAX_ORCH_SO_SIZE]; + size_t device_orch_so_size_; + + public: // NOLINT(whitespace/indent) + /** + * Constructor - zero-initialize all arrays + */ + Runtime(); + + // ========================================================================= + // Tensor Pair Management + // ========================================================================= + + /** + * Record a host-device tensor pair for copy-back during finalize. + */ + void record_tensor_pair(void* host_ptr, void* dev_ptr, size_t size); + + /** + * Get pointer to tensor pairs array. + */ + TensorPair* get_tensor_pairs(); + + /** + * Get number of recorded tensor pairs. + */ + int get_tensor_pair_count() const; + + /** + * Clear all recorded tensor pairs. + */ + void clear_tensor_pairs(); + + // ========================================================================= + // Performance Profiling + // ========================================================================= + + // ========================================================================= + // Device orchestration (for AICPU thread 3) + // ========================================================================= + + bool get_orch_built_on_host() const; + void* get_pto2_gm_sm_ptr() const; + void* get_pto2_gm_heap_ptr() const; + const ChipStorageTaskArgs& get_orch_args() const; + void set_orch_built_on_host(bool v); + void set_pto2_gm_sm_ptr(void* p); + void set_pto2_gm_heap(void* p); + void set_pto2_slot_states_ptr(void* p); + void set_orch_args(const ChipStorageTaskArgs& args); + + // Device orchestration SO binary (for dlopen on AICPU thread 3) + void set_device_orch_so(const void* data, size_t size); + const void* get_device_orch_so_data() const; + size_t get_device_orch_so_size() const; + + uint64_t get_function_bin_addr(int func_id) const; + void set_function_bin_addr(int func_id, uint64_t addr); + + int get_registered_kernel_count() const; + int get_registered_kernel_func_id(int index) const; + void clear_registered_kernels(); + + // ========================================================================= + // Deprecated API (for platform compatibility, always returns 0/nullptr) + // Task graph is now managed by PTO2Runtime, not Runtime + // ========================================================================= + + /** @deprecated Task count is now in PTO2 shared memory */ + int get_task_count() const { return 0; } + + /** @deprecated RT2 uses PTO2DispatchPayload, not Task. Always returns nullptr. */ + Task* get_task(int) { return nullptr; } + + /** @deprecated Use PTO2 dispatch mode */ + bool get_use_pto2_dispatch() const { return true; } + + /** @deprecated Use PTO2 dispatch mode */ + void set_use_pto2_dispatch(bool) {} + + // ========================================================================= + // Host API (host-only, not copied to device) + // ========================================================================= + + // Host API function pointers for device memory operations + // NOTE: Placed at end of class to avoid affecting device memory layout + HostApi host_api; +}; + +#endif // SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_RUNTIME_H_ diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/tensor.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/tensor.h new file mode 100644 index 000000000..ae836df47 --- /dev/null +++ b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/tensor.h @@ -0,0 +1,493 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ +#pragma once + +#include +#include + +#include +#include +#include +#include + +#include "common.h" // NOLINT(build/include_subdir) +#include "data_type.h" // NOLINT(build/include_subdir) +#include "pto_task_id.h" // NOLINT(build/include_subdir) + +constexpr int RUNTIME_MAX_TENSOR_DIMS = 5; + +/** + * Buffer Handle + * + * Represents a device memory buffer with address and total size in bytes. + * This is the underlying memory allocation that a Tensor describes access patterns for. + */ +struct PTOBufferHandle { + uint64_t addr; // Device memory address (bytes) + uint64_t size; // Total buffer size in bytes +}; + +enum class OverlapStatus { + NO_OVERLAP, + COVERED, + OTHER, +}; + +struct Segment { + uint64_t begin; + uint64_t end; + + bool line_segment_intersection(const Segment& other) const { return end > other.begin && other.end > begin; } + bool contains(const Segment& other) const { return begin <= other.begin && other.end <= end; } +}; + +/** + * TensorCreateInfo — submit-time create-info for runtime-allocated outputs. + * + * Carries the metadata required to materialize a fresh contiguous output: + * dtype, ndims, raw_shapes (== shapes), manual_dep, and an optional + * initial value fill. + * + * Layout (64B) is aligned with Tensor cacheline 1 so that + * init_from_create_info() can copy the entire cacheline with a single memcpy, + * then overwrite buffer/owner metadata and refresh start_offset later. + * + * Arg::add_output() stores a pointer to this object, so the original + * must remain valid (not a temporary) until after the submit call. + */ +class alignas(64) TensorCreateInfo { + public: // NOLINT(whitespace/indent) + TensorCreateInfo( + const uint32_t shapes[], uint32_t ndims, DataType dtype = DataType::FLOAT32, bool manual_dep = false) + : initial_value(0), + has_initial_value(false), + version(0), + ndims(ndims), + dtype(dtype), + is_all_offset_zero(true), + is_raw_eq_shapes(true), + manual_dep(manual_dep) { + for (uint32_t i = 0; i < ndims; i++) { + raw_shapes[i] = shapes[i]; + } + } + + void copy(const TensorCreateInfo& other) { memcpy(this, &other, sizeof(other)); } + + template + void set_initial_value(T value) { + has_initial_value = true; + initial_value = to_u64(value); + } + + uint64_t buffer_size_bytes() const { + uint64_t total = 1; + for (uint32_t i = 0; i < ndims; i++) { + total *= raw_shapes[i]; + } + return total * get_element_size(dtype); + } + + public: // NOLINT(whitespace/indent) + // --- Bytes [0, 32): TensorCreateInfo-only fields --- + // These occupy the same positions as Tensor::buffer, Tensor::owner_task_id, + // and Tensor::start_offset. The runtime overwrites owner metadata after the + // memcpy and refreshes start_offset during payload materialization. + uint64_t initial_value; + bool has_initial_value; + uint8_t __pad1__[7]; + uint64_t __pad2__; // → Tensor::owner_task_id + uint64_t __pad3__; // → Tensor::start_offset (zeroed) + + // --- Bytes [32, 64): Matches Tensor cacheline 1 layout --- + int32_t version; // Always 0 for create-info outputs + uint32_t ndims; + DataType dtype; + bool is_all_offset_zero; // Always true for create-info outputs + bool is_raw_eq_shapes; // Always true for create-info outputs + bool manual_dep; + uint32_t raw_shapes[RUNTIME_MAX_TENSOR_DIMS]; // → Tensor::shapes + + TensorCreateInfo() = default; + + friend struct Arg; +}; + +static_assert(sizeof(TensorCreateInfo) == 64); + +/** + * Tensor descriptor for Task input/output (128B = 2 cache lines) + * + * Describes a memory access pattern on Global Memory (GM) using + * raw_shapes (underlying buffer dimensions), shapes (current view dimensions), + * and offsets (multi-dimensional offset into the buffer). + * + * - `buffer` contains the underlying memory allocation (addr in bytes, size in bytes) + * - `raw_shapes[]`, `shapes[]`, `offsets[]` are in ELEMENTS + * - `dtype` specifies element type for interpreting buffer contents + * + * Fast-path flags (all on cache line 1): + * - is_all_offset_zero: when true, offsets[] are implicitly zero — skip offset read/write + * - is_raw_eq_shapes: when true, raw_shapes[] == shapes[] — skip raw_shapes read/write, + * use shapes[] wherever raw_shapes would be needed + * - manual_dep: when true, keep creator retention only and skip OverlapMap dependency tracking + * + * When BOTH flags are true, cache line 2 is never accessed. + * + * Layout: cache line 1 holds hot-path fields (buffer, owner_task_id, start_offset, + * version, ndims, dtype, flags, shapes); cache line 2 holds warm-path fields (raw_shapes, offsets). + * + * Construction: + * Users cannot default-construct or directly construct a Tensor. + * Valid Tensors are obtained only through controlled entry points: + * - make_tensor_external(...) + * - from_tensor_arg(...) + * - TaskOutputTensors returned by submit(...) + * - Tensor::view() / reshape() / transpose() on an existing valid Tensor + */ +struct alignas(64) Tensor { + // === Cache line 1 (64B) — hot path === + PTOBufferHandle buffer; // Underlying memory buffer (addr in bytes, size in bytes) + PTO2TaskId owner_task_id; // Creator task; PTO2TaskId::invalid() for external tensors + uint64_t start_offset; // Cached 1D element offset (precomputed from raw_shapes + offsets) + int32_t version; // Tensor version for overlap detection + uint32_t ndims; // Number of dimensions used + DataType dtype; // Data type of tensor elements + bool is_all_offset_zero; // True when all offsets[] are zero (skip offset read/write) + bool is_raw_eq_shapes; // True when raw_shapes[] == shapes[] (skip raw_shapes read/write) + bool manual_dep; // True when dependency tracking is creator-only (skip OverlapMap lookup/insert) + uint32_t shapes[RUNTIME_MAX_TENSOR_DIMS]; // Current view shape per dimension + + // === Cache line 2 (64B) — warm path === + uint32_t raw_shapes[RUNTIME_MAX_TENSOR_DIMS]; // Underlying buffer shape per dimension + uint32_t offsets[RUNTIME_MAX_TENSOR_DIMS]; // Multi-dimensional offset per dimension + uint8_t _pad_cl2[24]; // Tail padding (bytes 104–127) + + // --- Copy / move / destroy are public (valid tensors can be freely copied) --- + Tensor(const Tensor&) = default; + Tensor& operator=(const Tensor&) = default; + Tensor(Tensor&&) = default; + Tensor& operator=(Tensor&&) = default; + ~Tensor() = default; + + /// Return the effective raw_shapes pointer (shapes[] when is_raw_eq_shapes). + /// Avoids cache line 2 access for the common case. + const uint32_t* get_raw_shapes() const { return is_raw_eq_shapes ? shapes : raw_shapes; } + + // --- Initialization (operates on already-constructed Tensor) --- + void init(void* addr, + uint64_t buffer_size_bytes, + const uint32_t in_raw_shapes[], + const uint32_t in_shapes[], + const uint32_t in_offsets[], + uint32_t in_ndims, + DataType in_dtype, + int32_t in_version, + bool in_is_all_offset_zero = false, + bool in_is_raw_eq_shapes = false, + bool in_manual_dep = false) { + buffer = {reinterpret_cast(addr), buffer_size_bytes}; + ndims = in_ndims; + dtype = in_dtype; + version = in_version; + is_all_offset_zero = in_is_all_offset_zero; + is_raw_eq_shapes = in_is_raw_eq_shapes; + manual_dep = in_manual_dep; + for (uint32_t i = 0; i < in_ndims; i++) { + shapes[i] = in_shapes[i]; + } + if (!in_is_raw_eq_shapes) { + for (uint32_t i = 0; i < in_ndims; i++) { + raw_shapes[i] = in_raw_shapes[i]; + } + } + if (!in_is_all_offset_zero) { + for (uint32_t i = 0; i < in_ndims; i++) { + offsets[i] = in_offsets[i]; + } + } + owner_task_id = PTO2TaskId::invalid(); + } + + void init(const Tensor& other) { + memcpy(this, &other, 64); // fast copy cache line 1 + if (!other.is_raw_eq_shapes) { + for (uint32_t i = 0; i < other.ndims; i++) { + raw_shapes[i] = other.raw_shapes[i]; + } + } + if (!other.is_all_offset_zero) { + for (uint32_t i = 0; i < other.ndims; i++) { + offsets[i] = other.offsets[i]; + } + } + } + + void init_with_view( + const Tensor& other, const uint32_t view_shapes[], const uint32_t view_offsets[], bool in_manual_dep = false) { + buffer = other.buffer; + ndims = other.ndims; + dtype = other.dtype; + version = other.version; + manual_dep = in_manual_dep; + // view always diverges shapes from raw_shapes, so is_raw_eq_shapes = false. + // Read parent's effective raw_shapes (avoids parent cache line 2 when parent is_raw_eq_shapes). + is_raw_eq_shapes = false; + const uint32_t* parent_raw = other.get_raw_shapes(); + for (uint32_t i = 0; i < ndims; i++) { + raw_shapes[i] = parent_raw[i]; + shapes[i] = view_shapes[i]; + } + // Compute offsets and zero-flag + bool all_zero = true; + if (other.is_all_offset_zero) { + for (uint32_t i = 0; i < ndims; i++) { + if (view_offsets[i] != 0) { + all_zero = false; + break; + } + } + if (!all_zero) { + for (uint32_t i = 0; i < ndims; i++) { + offsets[i] = view_offsets[i]; + } + } + } else { + all_zero = false; + for (uint32_t i = 0; i < ndims; i++) { + offsets[i] = other.offsets[i] + view_offsets[i]; + } + } + is_all_offset_zero = all_zero; + owner_task_id = other.owner_task_id; + } + + /// Compute 1D flat element offset from multi-dimensional indices. + /// Uses Horner's method (forward traversal, no stride variable). + uint64_t compute_flat_offset(const uint32_t indices[], uint32_t in_ndims) const { + if (in_ndims == 0) return 0; + const uint32_t* rs = get_raw_shapes(); + uint64_t offset = 0; + if (is_all_offset_zero) { + for (uint32_t d = 0; d < in_ndims; d++) offset = offset * rs[d] + indices[d]; + } else { + for (uint32_t d = 0; d < in_ndims; d++) offset = offset * rs[d] + indices[d] + offsets[d]; + } + return offset; + } + + /// Materialize a TensorCreateInfo into this Tensor (fresh contiguous output). + /// Single 64B memcpy covers the entire cacheline 1, then buffer is overwritten. + void init_from_create_info(const TensorCreateInfo& ci, void* addr, uint64_t buffer_size) { + memcpy(this, &ci, 64); + buffer = {reinterpret_cast(addr), buffer_size}; + owner_task_id = PTO2TaskId::invalid(); // caller (orchestrator) overwrites with actual task_id + if (ci.has_initial_value) { + fill_initial_value(ci.initial_value); + } + } + + void fill_initial_value(uint64_t initial_value) { + always_assert(reinterpret_cast(buffer.addr) != nullptr); + uint64_t elem_size = get_element_size(dtype); + char* dst = reinterpret_cast(buffer.addr); + constexpr uint64_t BLK = 64; + uint64_t blk = (buffer.size < BLK) ? buffer.size : BLK; + for (uint64_t b = 0; b < blk; b += elem_size) { + memcpy(dst + b, &initial_value, elem_size); + } + uint64_t filled = blk; + while (filled < buffer.size) { + uint64_t copy_size = ((buffer.size - filled) < filled) ? (buffer.size - filled) : filled; + memcpy(dst + filled, dst, copy_size); + filled += copy_size; + } + } + + // --- Operations --- + void update_start_offset() { + if (is_all_offset_zero) { + start_offset = 0; + return; + } + const uint32_t* rs = get_raw_shapes(); + uint64_t result = 0; + uint64_t stride = 1; + for (int i = static_cast(ndims) - 1; i >= 0; i--) { + result += offsets[i] * stride; + stride *= rs[i]; + } + start_offset = result; + } + + void copy(const Tensor& other) { init(other); } + + Tensor view(const uint32_t view_shapes[], const uint32_t view_offsets[], bool manual_dep = false) const { + Tensor result; + result.init_with_view(*this, view_shapes, view_offsets, manual_dep); + return result; + } + + bool is_contiguous() const { + if (is_raw_eq_shapes || ndims == 0) { + return true; + } + for (uint32_t i = 1; i < ndims; i++) { + if (shapes[i] != raw_shapes[i]) { + return false; + } + } + return true; + } + + bool valid_reshape(const uint32_t new_shapes[], uint32_t new_ndims) const { + uint64_t x = numel(); + uint64_t y = 1; + for (uint32_t i = 0; i < new_ndims; i++) { + y *= new_shapes[i]; + } + return x == y; + } + + Tensor reshape(const uint32_t new_shapes[], uint32_t new_ndims, bool manual_dep = false) const { + debug_assert(valid_reshape(new_shapes, new_ndims)); + always_assert(is_contiguous()); + Tensor result; + result.copy(*this); + result.ndims = new_ndims; + result.is_all_offset_zero = true; + result.is_raw_eq_shapes = true; + result.manual_dep = manual_dep; + for (uint32_t i = 0; i < new_ndims; i++) { + result.shapes[i] = new_shapes[i]; + } + return result; + } + + bool valid_transpose(uint32_t x, uint32_t y) const { return x < ndims && y < ndims; } + + Tensor transpose(uint32_t x, uint32_t y, bool manual_dep = false) const { + debug_assert(valid_transpose(x, y)); + Tensor result; + result.copy(*this); + result.manual_dep = manual_dep; + // transpose swaps the same dims in both arrays, so equality is preserved + std::swap(result.shapes[x], result.shapes[y]); + if (!result.is_raw_eq_shapes) { + std::swap(result.raw_shapes[x], result.raw_shapes[y]); + } + if (!result.is_all_offset_zero) { + std::swap(result.offsets[x], result.offsets[y]); + } + return result; + } + + uint64_t numel() const { + if (ndims == 0) { + return 0; + } + uint64_t total = 1; + for (uint32_t i = 0; i < ndims; i++) { + total *= shapes[i]; + } + return total; + } + + bool is_same_memref(const Tensor& other) const { return buffer.addr == other.buffer.addr; } + + std::string dump() const { + std::stringstream ss; + std::string indent = " "; + ss << "{" << std::endl; + ss << indent << "buffer.addr: " << buffer.addr << std::endl; + ss << indent << "buffer.size: " << buffer.size << " bytes" << std::endl; + ss << indent << "dtype: " << get_dtype_name(dtype) << std::endl; + ss << indent << "ndims: " << ndims << std::endl; + ss << indent << "version: " << version << std::endl; + + const uint32_t* rs = get_raw_shapes(); + ss << indent << "raw_shapes: ["; + for (uint32_t i = 0; i < ndims; i++) { + if (i > 0) { + ss << ", "; + } + ss << rs[i]; + } + ss << "]" << std::endl; + ss << indent << "shapes: ["; + for (uint32_t i = 0; i < ndims; i++) { + if (i > 0) { + ss << ", "; + } + ss << shapes[i]; + } + ss << "]" << std::endl; + ss << indent << "offsets: ["; + for (uint32_t i = 0; i < ndims; i++) { + if (i > 0) { + ss << ", "; + } + ss << (is_all_offset_zero ? 0u : offsets[i]); + } + ss << "]" << std::endl; + ss << "}" << std::endl; + return ss.str(); + } + + private: + // Default and parameterized constructors are private. + // Valid Tensors come only from controlled entry points. + Tensor() = default; + + Tensor(void* addr, + uint64_t buffer_size_bytes, + const uint32_t raw_shapes[], + const uint32_t shapes[], + const uint32_t offsets[], + uint32_t ndims, + DataType dtype, + int32_t version, + bool is_all_offset_zero = false, + bool is_raw_eq_shapes = false, + bool manual_dep = false) { + init(addr, + buffer_size_bytes, + raw_shapes, + shapes, + offsets, + ndims, + dtype, + version, + is_all_offset_zero, + is_raw_eq_shapes, + manual_dep); + } + + // Friends that need to construct Tensors + friend struct PTO2TaskPayload; + friend inline Tensor make_tensor_external( + void* addr, const uint32_t shapes[], uint32_t ndims, DataType dtype, bool manual_dep, int32_t version); +}; + +static_assert(sizeof(Tensor) == 128, "Tensor must be exactly 2 cache lines (128 bytes)"); +static_assert(offsetof(Tensor, raw_shapes) == 64); +static_assert(offsetof(Tensor, owner_task_id) == 16, "owner_task_id must be at bytes 16-23 (cacheline 1)"); +static_assert(offsetof(Tensor, start_offset) == 24, "start_offset must be at bytes 24-31 (cacheline 1)"); + +// TensorCreateInfo layout must match Tensor cacheline 1 for memcpy optimization +static_assert(sizeof(TensorCreateInfo) == 64, "TensorCreateInfo must match Tensor cacheline 1 size (64 bytes)"); +static_assert(offsetof(TensorCreateInfo, version) == offsetof(Tensor, version)); +static_assert(offsetof(TensorCreateInfo, ndims) == offsetof(Tensor, ndims)); +static_assert(offsetof(TensorCreateInfo, dtype) == offsetof(Tensor, dtype)); +static_assert(offsetof(TensorCreateInfo, is_all_offset_zero) == offsetof(Tensor, is_all_offset_zero)); +static_assert(offsetof(TensorCreateInfo, is_raw_eq_shapes) == offsetof(Tensor, is_raw_eq_shapes)); +static_assert(offsetof(TensorCreateInfo, manual_dep) == offsetof(Tensor, manual_dep)); +static_assert(offsetof(TensorCreateInfo, raw_shapes) == offsetof(Tensor, shapes)); diff --git a/tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention/golden.py new file mode 100644 index 000000000..d97a6b9fe --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention/golden.py @@ -0,0 +1,19 @@ +from importlib.util import module_from_spec, spec_from_file_location +from pathlib import Path + + +_BASE_GOLDEN = ( + Path(__file__).resolve().parents[2] / "tensormap_and_ringbuffer" / "paged_attention" / "golden.py" +) +_SPEC = spec_from_file_location("tmr_unmodified_paged_attention_golden", _BASE_GOLDEN) +_MODULE = module_from_spec(_SPEC) +assert _SPEC is not None and _SPEC.loader is not None +_SPEC.loader.exec_module(_MODULE) + +ALL_CASES = _MODULE.ALL_CASES +ATOL = _MODULE.ATOL +DEFAULT_CASE = _MODULE.DEFAULT_CASE +RTOL = _MODULE.RTOL +__outputs__ = _MODULE.__outputs__ +compute_golden = _MODULE.compute_golden +generate_inputs = _MODULE.generate_inputs diff --git a/tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention/kernels/kernel_config.py b/tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention/kernels/kernel_config.py new file mode 100644 index 000000000..91c09945c --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention/kernels/kernel_config.py @@ -0,0 +1,20 @@ +from copy import deepcopy +from importlib.util import module_from_spec, spec_from_file_location +from pathlib import Path + + +_BASE_KERNEL_CONFIG = ( + Path(__file__).resolve().parents[3] + / "tensormap_and_ringbuffer" + / "paged_attention" + / "kernels" + / "kernel_config.py" +) +_SPEC = spec_from_file_location("tmr_unmodified_paged_attention_kernel_config", _BASE_KERNEL_CONFIG) +_MODULE = module_from_spec(_SPEC) +assert _SPEC is not None and _SPEC.loader is not None +_SPEC.loader.exec_module(_MODULE) + +ORCHESTRATION = deepcopy(_MODULE.ORCHESTRATION) +KERNELS = deepcopy(_MODULE.KERNELS) +RUNTIME_CONFIG = {**_MODULE.RUNTIME_CONFIG, "runtime": "tensormap_and_ringbuffer_unmodified"} diff --git a/tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention_unroll/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention_unroll/golden.py new file mode 100644 index 000000000..74aa2506c --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention_unroll/golden.py @@ -0,0 +1,19 @@ +from importlib.util import module_from_spec, spec_from_file_location +from pathlib import Path + + +_BASE_GOLDEN = ( + Path(__file__).resolve().parents[2] / "tensormap_and_ringbuffer" / "paged_attention_unroll" / "golden.py" +) +_SPEC = spec_from_file_location("tmr_unmodified_paged_attention_unroll_golden", _BASE_GOLDEN) +_MODULE = module_from_spec(_SPEC) +assert _SPEC is not None and _SPEC.loader is not None +_SPEC.loader.exec_module(_MODULE) + +ALL_CASES = _MODULE.ALL_CASES +ATOL = _MODULE.ATOL +DEFAULT_CASE = _MODULE.DEFAULT_CASE +RTOL = _MODULE.RTOL +__outputs__ = _MODULE.__outputs__ +compute_golden = _MODULE.compute_golden +generate_inputs = _MODULE.generate_inputs diff --git a/tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention_unroll/kernels/kernel_config.py b/tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention_unroll/kernels/kernel_config.py new file mode 100644 index 000000000..bdf8846cd --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention_unroll/kernels/kernel_config.py @@ -0,0 +1,20 @@ +from copy import deepcopy +from importlib.util import module_from_spec, spec_from_file_location +from pathlib import Path + + +_BASE_KERNEL_CONFIG = ( + Path(__file__).resolve().parents[3] + / "tensormap_and_ringbuffer" + / "paged_attention_unroll" + / "kernels" + / "kernel_config.py" +) +_SPEC = spec_from_file_location("tmr_unmodified_paged_attention_unroll_kernel_config", _BASE_KERNEL_CONFIG) +_MODULE = module_from_spec(_SPEC) +assert _SPEC is not None and _SPEC.loader is not None +_SPEC.loader.exec_module(_MODULE) + +ORCHESTRATION = deepcopy(_MODULE.ORCHESTRATION) +KERNELS = deepcopy(_MODULE.KERNELS) +RUNTIME_CONFIG = {**_MODULE.RUNTIME_CONFIG, "runtime": "tensormap_and_ringbuffer_unmodified"} diff --git a/tests/ut/py/test_runtime_builder.py b/tests/ut/py/test_runtime_builder.py index 648f48c05..1852accb5 100644 --- a/tests/ut/py/test_runtime_builder.py +++ b/tests/ut/py/test_runtime_builder.py @@ -43,6 +43,14 @@ def test_discovers_aicpu_build_graph(self, default_test_platform): runtimes = builder.list_runtimes() assert "aicpu_build_graph" in runtimes + def test_discovers_unmodified_tensormap_runtime(self, default_test_platform): + """RuntimeBuilder discovers the unmodified tensormap runtime clone.""" + from runtime_builder import RuntimeBuilder # noqa: PLC0415 + + builder = RuntimeBuilder(platform=default_test_platform) + runtimes = builder.list_runtimes() + assert "tensormap_and_ringbuffer_unmodified" in runtimes + def test_runtime_dir_resolves_to_project_root(self, default_test_platform, test_arch): """runtime_dir resolves to src/{arch}/runtime/ under the project root.""" from runtime_builder import RuntimeBuilder # noqa: PLC0415 diff --git a/tools/benchmark_rounds.sh b/tools/benchmark_rounds.sh index 64b283e81..6f07d6dd4 100755 --- a/tools/benchmark_rounds.sh +++ b/tools/benchmark_rounds.sh @@ -23,21 +23,35 @@ RUN_EXAMPLE="$PROJECT_ROOT/examples/scripts/run_example.py" declare -A TMR_EXAMPLE_CASES=( [alternating_matmul_add]="" [benchmark_bgemm]="" + [paged_attention]="Case1,Case2" [paged_attention_unroll]="Case1,Case2" [batch_paged_attention]="" ) TMR_EXAMPLE_ORDER=( alternating_matmul_add benchmark_bgemm + paged_attention paged_attention_unroll batch_paged_attention ) # --- aicpu_build_graph --- declare -A ABG_EXAMPLE_CASES=( + [paged_attention]="Case1,Case2" [paged_attention_unroll]="Case1,Case2" ) ABG_EXAMPLE_ORDER=( + paged_attention + paged_attention_unroll +) + +# --- tensormap_and_ringbuffer_unmodified --- +declare -A TMR_UNMODIFIED_EXAMPLE_CASES=( + [paged_attention]="Case1,Case2" + [paged_attention_unroll]="Case1,Case2" +) +TMR_UNMODIFIED_EXAMPLE_ORDER=( + paged_attention paged_attention_unroll ) @@ -84,7 +98,7 @@ Options: -p, --platform Platform to run on (default: a2a3) -d, --device Device ID (default: 0) -n, --rounds Override number of rounds for each example (default: 100) - -r, --runtime Runtime to benchmark: tensormap_and_ringbuffer (default), aicpu_build_graph + -r, --runtime Runtime to benchmark: tensormap_and_ringbuffer (default), tensormap_and_ringbuffer_unmodified, aicpu_build_graph -v, --verbose Save detailed run_example.py output to a timestamped log file -h, --help Show this help @@ -139,12 +153,16 @@ case "$RUNTIME" in declare -n EXAMPLE_CASES=TMR_EXAMPLE_CASES EXAMPLE_ORDER=("${TMR_EXAMPLE_ORDER[@]}") ;; + tensormap_and_ringbuffer_unmodified) + declare -n EXAMPLE_CASES=TMR_UNMODIFIED_EXAMPLE_CASES + EXAMPLE_ORDER=("${TMR_UNMODIFIED_EXAMPLE_ORDER[@]}") + ;; aicpu_build_graph) declare -n EXAMPLE_CASES=ABG_EXAMPLE_CASES EXAMPLE_ORDER=("${ABG_EXAMPLE_ORDER[@]}") ;; *) - echo "ERROR: unknown runtime '$RUNTIME'. Use tensormap_and_ringbuffer or aicpu_build_graph." + echo "ERROR: unknown runtime '$RUNTIME'. Use tensormap_and_ringbuffer, tensormap_and_ringbuffer_unmodified, or aicpu_build_graph." exit 1 ;; esac From bf1fc845e0c2b5445da0748a9feeb371730d2ebd Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Sun, 5 Apr 2026 21:52:22 +0800 Subject: [PATCH 06/35] Restore zero-overhead auto path --- src/a2a3/docs/runtimes.md | 34 +++-- .../runtime/pto_orchestrator.cpp | 122 ++++++++++++++---- .../runtime/pto_orchestrator.h | 3 +- .../runtime/pto_runtime2.cpp | 10 ++ 4 files changed, 129 insertions(+), 40 deletions(-) diff --git a/src/a2a3/docs/runtimes.md b/src/a2a3/docs/runtimes.md index cb8fcbb1c..72248e273 100644 --- a/src/a2a3/docs/runtimes.md +++ b/src/a2a3/docs/runtimes.md @@ -1,20 +1,20 @@ # Runtime Variants (a2a3) -Three runtime implementations live under `src/a2a3/runtime/`, each providing a different graph-building strategy. The `RUNTIME_CONFIG.runtime` field in `kernel_config.py` selects which runtime to use. +Four runtime implementations live under `src/a2a3/runtime/`, each providing a different graph-building strategy. The `RUNTIME_CONFIG.runtime` field in `kernel_config.py` selects which runtime to use. ## Comparison -| Feature | host_build_graph | aicpu_build_graph | tensormap_and_ringbuffer | -| ------- | ---------------- | ----------------- | ------------------------ | -| Graph built on | Host CPU | AICPU (device) | AICPU (device) | -| Task storage | Fixed `Task[]` array | Fixed `Task[]` array | Ring buffer (`PTO2TaskDescriptor[]`) | -| Dependencies | Explicit edges | Explicit edges | Auto-derived via TensorMap | -| Memory management | Host-side | Host + device malloc | Ring buffer heap (GM) | -| Concurrent build+schedule | No | Optional (`build_mode=1`) | Yes (always) | -| Profiling support | Basic | Basic | Multi-level hierarchy | -| Batch/streaming | No | No | Yes (flow control, back-pressure) | -| Thread model | N scheduler threads | 1 builder + N schedulers | 1 orchestrator + 3 schedulers | -| Use case | Development, debugging | Reduced host-device transfer | Production workloads | +| Feature | host_build_graph | aicpu_build_graph | tensormap_and_ringbuffer_unmodified | tensormap_and_ringbuffer | +| ------- | ---------------- | ----------------- | ----------------------------------- | ------------------------ | +| Graph built on | Host CPU | AICPU (device) | AICPU (device) | AICPU (device) | +| Task storage | Fixed `Task[]` array | Fixed `Task[]` array | Ring buffer (`PTO2TaskDescriptor[]`) | Ring buffer (`PTO2TaskDescriptor[]`) | +| Dependencies | Explicit edges | Explicit edges | Auto-derived via TensorMap | Auto-derived via TensorMap, plus optional manual dependencies | +| Memory management | Host-side | Host + device malloc | Ring buffer heap (GM) | Ring buffer heap (GM) | +| Concurrent build+schedule | No | Optional (`build_mode=1`) | Yes (always) | Yes (always) | +| Profiling support | Basic | Basic | Multi-level hierarchy | Multi-level hierarchy | +| Batch/streaming | No | No | Yes (flow control, back-pressure) | Yes (flow control, back-pressure) | +| Thread model | N scheduler threads | 1 builder + N schedulers | 1 orchestrator + 3 schedulers | 1 orchestrator + 3 schedulers | +| Use case | Development, debugging | Reduced host-device transfer | Baseline PTO2 comparison | Production PTO2 with manual-scope extensions | ## host_build_graph @@ -47,6 +47,16 @@ The primary production runtime. Uses ring buffers for task slots and output memo - Multi-ring: HeapRing, TaskRing, and DepPool split into 4 independent instances for nested scope isolation - Supports streaming, flow control, large batch sizes, and multi-level profiling +## tensormap_and_ringbuffer_unmodified + +An unmodified clone of the baseline PTO2 runtime, kept side-by-side for apples-to-apples comparison against the extended `tensormap_and_ringbuffer` runtime. + +- Same TensorMap and ring-buffer architecture as the original PTO2 implementation +- No manual-scope dependency extensions +- Intended for benchmarking and regression isolation, not new feature development + +See [tensormap_and_ringbuffer_unmodified/docs/](../runtime/tensormap_and_ringbuffer_unmodified/docs/) for the baseline runtime logic and profiling notes. + See [tensormap_and_ringbuffer/docs/](../runtime/tensormap_and_ringbuffer/docs/): - [RUNTIME_LOGIC.md](../runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md) — Full system design diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp index 1edd9d047..871701c3d 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp @@ -485,7 +485,7 @@ static void manual_task_meta_push(PTO2OrchestratorState *orch, const PTO2ManualT orch->manual_task_meta[orch->manual_task_meta_size++] = meta; } -static void manual_edge_push(PTO2OrchestratorState *orch, const PTO2ManualEdge &edge) { +static int32_t manual_edge_push(PTO2OrchestratorState *orch, const PTO2ManualEdge &edge) { if (orch->manual_edges_size >= orch->manual_edges_capacity) { int32_t new_cap = orch->manual_edges_capacity * 2; PTO2ManualEdge *new_buf = @@ -494,7 +494,9 @@ static void manual_edge_push(PTO2OrchestratorState *orch, const PTO2ManualEdge & orch->manual_edges = new_buf; orch->manual_edges_capacity = new_cap; } + int32_t edge_idx = orch->manual_edges_size; orch->manual_edges[orch->manual_edges_size++] = edge; + return edge_idx; } static bool in_manual_scope(const PTO2OrchestratorState *orch) { @@ -511,6 +513,28 @@ static int32_t find_current_manual_scope_task_index(const PTO2OrchestratorState } int32_t begin = current_manual_scope_begin(orch); + int32_t count = orch->scope_tasks_size - begin; + if (count <= 0) { + return -1; + } + + PTO2TaskSlotState *first_slot_state = orch->scope_tasks[begin]; + if (first_slot_state != nullptr) { + PTO2TaskId first_task_id = first_slot_state->task->task_id; + if (first_task_id.ring() == task_id.ring()) { + uint32_t window_size = orch->rings[first_task_id.ring()].task_allocator.window_size(); + uint32_t first_local = first_task_id.local(); + uint32_t task_local = task_id.local(); + uint32_t delta = task_local >= first_local ? task_local - first_local : task_local + window_size - first_local; + if (delta < static_cast(count)) { + PTO2TaskSlotState *candidate = orch->scope_tasks[begin + static_cast(delta)]; + if (candidate != nullptr && candidate->task->task_id == task_id) { + return static_cast(delta); + } + } + } + } + for (int32_t i = begin; i < orch->scope_tasks_size; i++) { PTO2TaskSlotState *slot_state = orch->scope_tasks[i]; if (slot_state != nullptr && slot_state->task->task_id == task_id) { @@ -531,7 +555,9 @@ void pto2_scope_begin(PTO2OrchestratorState *orch, PTO2ScopeMode mode) { assert(orch->scope_stack_top < static_cast(orch->scope_stack_capacity - 1) && "Scope stack overflow"); if (in_manual_scope(orch)) { - LOG_ERROR("nested scope inside PTO2_SCOPE(PTO2ScopeMode::MANUAL) is not supported in v1"); + LOG_ERROR( + "nested PTO2_SCOPE(PTO2ScopeMode::MANUAL) is not supported in v1; manual scope inside manual scope is not supported" + ); orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_INVALID_ARGS, std::memory_order_release); orch->fatal = true; return; @@ -541,8 +567,10 @@ void pto2_scope_begin(PTO2OrchestratorState *orch, PTO2ScopeMode mode) { orch->scope_begins[orch->scope_stack_top] = orch->scope_tasks_size; orch->scope_modes[orch->scope_stack_top] = mode; orch->manual_scope_active = (mode == PTO2ScopeMode::MANUAL); - orch->manual_task_meta_begins[orch->scope_stack_top] = orch->manual_task_meta_size; - orch->manual_edge_begins[orch->scope_stack_top] = orch->manual_edges_size; + if (mode == PTO2ScopeMode::MANUAL) { + orch->manual_task_meta_begins[orch->scope_stack_top] = orch->manual_task_meta_size; + orch->manual_edge_begins[orch->scope_stack_top] = orch->manual_edges_size; + } } void pto2_scope_end(PTO2OrchestratorState *orch) { @@ -559,10 +587,27 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { int32_t begin = orch->scope_begins[top]; int32_t count = orch->scope_tasks_size - begin; PTO2ScopeMode mode = orch->scope_modes[top]; + + if (mode != PTO2ScopeMode::MANUAL) { + orch->scope_stack_top--; + + if (orch->scheduler && count > 0) { + orch->scheduler->on_scope_end(&orch->scope_tasks[begin], count); + } + + orch->scope_tasks_size = begin; + +#if PTO2_ORCH_PROFILING + uint64_t _se1 = get_sys_cnt_aicpu(); + g_orch_scope_end_cycle += (_se1 - _se0); +#endif + return; + } + int32_t manual_meta_begin = orch->manual_task_meta_begins[top]; int32_t manual_edge_begin = orch->manual_edge_begins[top]; - if (mode == PTO2ScopeMode::MANUAL && orch->scheduler && count > 0) { + if (orch->scheduler && count > 0) { int32_t manual_task_count = orch->manual_task_meta_size - manual_meta_begin; if (manual_task_count != count) { LOG_ERROR("manual scope requires pto2_rt_submit_*_manual for every submitted task"); @@ -596,12 +641,9 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { PTO2FaninBuilder fanin_builder; fanin_builder.spill_pool = &orch->rings[ring_id].fanin_pool; - for (int32_t edge_idx = manual_edge_begin; edge_idx < orch->manual_edges_size; edge_idx++) { + for (int32_t edge_idx = meta.incoming_edge_head; edge_idx >= 0; + edge_idx = orch->manual_edges[edge_idx].next_consumer_edge) { const PTO2ManualEdge &edge = orch->manual_edges[edge_idx]; - if (edge.consumer_idx != task_idx) { - continue; - } - PTO2TaskSlotState *prod_state = orch->scope_tasks[begin + edge.producer_idx]; if (!pto2_append_fanin_or_fail( orch, task_id, edge.consumer_idx, TensorArgType::INPUT, prod_state, &fanin_builder, @@ -739,8 +781,10 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { // ============================================================================= // Task Submission // ============================================================================= +template static TaskOutputTensors pto2_submit_mixed_task_impl( - PTO2OrchestratorState *orch, const MixedKernels &mixed_kernels, const Arg &args, bool manual_submit + PTO2OrchestratorState *orch, const MixedKernels &mixed_kernels, const Arg &args, + PTO2TaskId *submitted_task_id = nullptr ) { CYCLE_COUNT_START(); @@ -773,11 +817,13 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( // Submission without an open scope is illegal. always_assert(orch->scope_stack_top >= 0 && "Cannot submit task outside a scope"); - if (!manual_submit && in_manual_scope(orch)) { - LOG_ERROR("PTO2_SCOPE(PTO2ScopeMode::MANUAL) requires pto2_rt_submit_*_manual task APIs"); - orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_INVALID_ARGS, std::memory_order_release); - orch->fatal = true; - return result; + if constexpr (!kManualSubmit) { + if (in_manual_scope(orch)) { + LOG_ERROR("PTO2_SCOPE(PTO2ScopeMode::MANUAL) requires pto2_rt_submit_*_manual task APIs"); + orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_INVALID_ARGS, std::memory_order_release); + orch->fatal = true; + return result; + } } // Encode require_sync_start into active_mask bit 3 (only meaningful for tasks with block_num > 1) @@ -809,7 +855,6 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( PTO2FaninBuilder fanin_builder; fanin_builder.spill_pool = &orch->rings[ring_id].fanin_pool; - bool defer_publish = manual_submit; CYCLE_COUNT_LAP_RECORD(g_orch_alloc_cycle, AicpuPhaseId::ORCH_ALLOC, task_id.raw); @@ -833,7 +878,7 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( CYCLE_COUNT_LAP_RECORD(g_orch_sync_cycle, AicpuPhaseId::ORCH_SYNC, task_id.raw); // === STEP 3: Lookup inputs + materialize runtime-created outputs === - if (!defer_publish) { + if constexpr (!kManualSubmit) { for (int i = 0; i < args.tensor_count(); i++) { TensorArgType ptype = args.tag(i); if (ptype == TensorArgType::OUTPUT) { @@ -888,11 +933,11 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( CYCLE_COUNT_LAP_RECORD(g_orch_lookup_cycle, AicpuPhaseId::ORCH_LOOKUP, task_id.raw); // === STEP 4: Register outputs/inouts in TensorMap (must be separate from lookup) === - { + if constexpr (!kManualSubmit) { for (int i = 0; i < args.tensor_count(); i++) { TensorArgType ptype = args.tag(i); if (ptype == TensorArgType::INOUT || ptype == TensorArgType::OUTPUT_EXISTING) { - if (!defer_publish && !args.tensor(i).ptr->manual_dep) { + if (!args.tensor(i).ptr->manual_dep) { orch->tensor_map.insert(*args.tensor(i).ptr, task_id); } } @@ -953,7 +998,7 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( cur_slot_state.block_num = block_num; cur_slot_state.next_block_idx = 0; - if (defer_publish) { + if constexpr (kManualSubmit) { cur_slot_state.fanin_count = 1; payload->fanin_actual_count = 0; payload->fanin_spill_start = 0; @@ -1024,7 +1069,6 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( #endif g_orch_submit_idx++; #endif - orch->last_submitted_task_id = task_id; return result; } @@ -1115,7 +1159,13 @@ TaskOutputTensors pto2_alloc_tensors(PTO2OrchestratorState *orch, const Arg &arg TaskOutputTensors pto2_submit_mixed_task(PTO2OrchestratorState *orch, const MixedKernels &mixed_kernels, const Arg &args) { - return pto2_submit_mixed_task_impl(orch, mixed_kernels, args, false); + if (in_manual_scope(orch)) { + LOG_ERROR("PTO2_SCOPE(PTO2ScopeMode::MANUAL) requires pto2_rt_submit_*_manual task APIs"); + orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_INVALID_ARGS, std::memory_order_release); + orch->fatal = true; + return {}; + } + return pto2_submit_mixed_task_impl(orch, mixed_kernels, args); } PTO2ManualSubmitResult @@ -1127,16 +1177,18 @@ pto2_submit_mixed_task_manual(PTO2OrchestratorState *orch, const MixedKernels &m orch->fatal = true; return result; } - TaskOutputTensors outputs = pto2_submit_mixed_task_impl(orch, mixed_kernels, args, true); - if (orch->fatal || !orch->last_submitted_task_id.is_valid()) { + PTO2TaskId task_id = PTO2TaskId::invalid(); + TaskOutputTensors outputs = pto2_submit_mixed_task_impl(orch, mixed_kernels, args, &task_id); + if (orch->fatal || !task_id.is_valid()) { return result; } - result.task_id = orch->last_submitted_task_id; + result.task_id = task_id; result.outputs = outputs; PTO2ManualTaskMeta meta{}; meta.slot_state = orch->scope_tasks[orch->scope_tasks_size - 1]; meta.scope_task_index = orch->scope_tasks_size - 1 - current_manual_scope_begin(orch); + meta.incoming_edge_head = -1; meta.tensor_count = static_cast(args.tensor_count()); for (int32_t i = 0; i < args.tensor_count(); i++) { meta.tags[i] = static_cast(args.tag(i)); @@ -1156,6 +1208,12 @@ void pto2_add_dependency(PTO2OrchestratorState *orch, PTO2TaskId producer_id, PT orch->fatal = true; return; } + if (producer_id == consumer_id) { + LOG_ERROR("add_dependency does not allow self-dependency"); + orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_INVALID_ARGS, std::memory_order_release); + orch->fatal = true; + return; + } int32_t producer_idx = find_current_manual_scope_task_index(orch, producer_id); int32_t consumer_idx = find_current_manual_scope_task_index(orch, consumer_id); if (producer_idx < 0 || consumer_idx < 0) { @@ -1165,7 +1223,17 @@ void pto2_add_dependency(PTO2OrchestratorState *orch, PTO2TaskId producer_id, PT return; } - manual_edge_push(orch, PTO2ManualEdge{.producer_idx = producer_idx, .consumer_idx = consumer_idx}); + int32_t meta_begin = orch->manual_task_meta_begins[orch->scope_stack_top]; + PTO2ManualTaskMeta &consumer_meta = orch->manual_task_meta[meta_begin + consumer_idx]; + int32_t edge_idx = manual_edge_push( + orch, + PTO2ManualEdge{ + .producer_idx = producer_idx, + .consumer_idx = consumer_idx, + .next_consumer_edge = consumer_meta.incoming_edge_head, + } + ); + consumer_meta.incoming_edge_head = edge_idx; } // ============================================================================= diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h index 5ce1d14a2..e78b8028a 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h @@ -43,6 +43,7 @@ struct PTO2ManualTaskMeta { PTO2TaskSlotState *slot_state; int32_t scope_task_index; + int32_t incoming_edge_head; uint8_t tensor_count; uint8_t tags[MAX_TENSOR_ARGS]; }; @@ -50,6 +51,7 @@ struct PTO2ManualTaskMeta { struct PTO2ManualEdge { int32_t producer_idx; int32_t consumer_idx; + int32_t next_consumer_edge; }; /** @@ -87,7 +89,6 @@ struct PTO2OrchestratorState { PTO2ManualEdge *manual_edges; int32_t manual_edges_size; int32_t manual_edges_capacity; - PTO2TaskId last_submitted_task_id{PTO2TaskId::invalid()}; // === SCHEDULER REFERENCE === // Note: In simulated mode, orchestrator and scheduler share address space diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp index 5c0f837e3..b3ebdd3ba 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp @@ -77,6 +77,16 @@ static void fail_manual_tensor_access(PTO2Runtime *rt, const char *caller) { ); } +static void fail_manual_tensor_access(PTO2Runtime *rt, const char *caller) { + PTO2OrchestratorState &orch = rt->orchestrators[pto2_current_orch_idx]; + orch.sm_handle->header->orch_error_code.store(PTO2_ERROR_INVALID_ARGS, std::memory_order_release); + orch.fatal = true; + unified_log_error( + caller, + "blocking tensor data access is not supported inside PTO2_SCOPE(PTO2ScopeMode::MANUAL); exit the manual scope first" + ); +} + // Wait for all producers of this tensor to be safe for data access. // Checks owner metadata (lifecycle anchor) and OverlapMap (modifier writers). // For reads: wait until each producer COMPLETED (done writing). From c4b5fa80c0c4a9f033e3d732756f7feba6f49b6d Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Sun, 5 Apr 2026 22:04:53 +0800 Subject: [PATCH 07/35] Restore zero-overhead auto scope path --- .../runtime/pto_orchestrator.cpp | 16 ++++++---------- .../runtime/pto_orchestrator.h | 19 ++++++++++--------- .../runtime/pto_runtime2.cpp | 10 ---------- 3 files changed, 16 insertions(+), 29 deletions(-) diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp index 871701c3d..7c410de5c 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp @@ -392,16 +392,14 @@ bool pto2_orchestrator_init( int32_t init_cap = PTO2_SCOPE_TASKS_INIT_CAP; orch->scope_tasks = reinterpret_cast(malloc(init_cap * sizeof(PTO2TaskSlotState *))); orch->scope_begins = reinterpret_cast(malloc(max_depth * sizeof(int32_t))); - orch->scope_modes = reinterpret_cast(malloc(max_depth * sizeof(PTO2ScopeMode))); orch->manual_task_meta_begins = reinterpret_cast(malloc(max_depth * sizeof(int32_t))); orch->manual_edge_begins = reinterpret_cast(malloc(max_depth * sizeof(int32_t))); orch->manual_task_meta = reinterpret_cast(malloc(init_cap * sizeof(PTO2ManualTaskMeta))); orch->manual_edges = reinterpret_cast(malloc(init_cap * sizeof(PTO2ManualEdge))); - if (!orch->scope_tasks || !orch->scope_begins || !orch->scope_modes || !orch->manual_task_meta_begins || - !orch->manual_edge_begins || !orch->manual_task_meta || !orch->manual_edges) { + if (!orch->scope_tasks || !orch->scope_begins || !orch->manual_task_meta_begins || !orch->manual_edge_begins || + !orch->manual_task_meta || !orch->manual_edges) { free(orch->scope_tasks); free(orch->scope_begins); - free(orch->scope_modes); free(orch->manual_task_meta_begins); free(orch->manual_edge_begins); free(orch->manual_task_meta); @@ -440,8 +438,6 @@ void pto2_orchestrator_destroy(PTO2OrchestratorState *orch) { orch->scope_tasks = NULL; free(orch->scope_begins); orch->scope_begins = NULL; - free(orch->scope_modes); - orch->scope_modes = NULL; free(orch->manual_task_meta_begins); orch->manual_task_meta_begins = NULL; free(orch->manual_edge_begins); @@ -500,7 +496,7 @@ static int32_t manual_edge_push(PTO2OrchestratorState *orch, const PTO2ManualEdg } static bool in_manual_scope(const PTO2OrchestratorState *orch) { - return orch->scope_stack_top >= 0 && orch->scope_modes[orch->scope_stack_top] == PTO2ScopeMode::MANUAL; + return orch->manual_scope_active; } static int32_t current_manual_scope_begin(const PTO2OrchestratorState *orch) { @@ -565,7 +561,6 @@ void pto2_scope_begin(PTO2OrchestratorState *orch, PTO2ScopeMode mode) { ++orch->scope_stack_top; orch->scope_begins[orch->scope_stack_top] = orch->scope_tasks_size; - orch->scope_modes[orch->scope_stack_top] = mode; orch->manual_scope_active = (mode == PTO2ScopeMode::MANUAL); if (mode == PTO2ScopeMode::MANUAL) { orch->manual_task_meta_begins[orch->scope_stack_top] = orch->manual_task_meta_size; @@ -586,10 +581,11 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { int32_t top = orch->scope_stack_top; int32_t begin = orch->scope_begins[top]; int32_t count = orch->scope_tasks_size - begin; - PTO2ScopeMode mode = orch->scope_modes[top]; + bool manual_scope = orch->manual_scope_active; - if (mode != PTO2ScopeMode::MANUAL) { + if (!manual_scope) { orch->scope_stack_top--; + orch->manual_scope_active = false; if (orch->scheduler && count > 0) { orch->scheduler->on_scope_end(&orch->scope_tasks[begin], count); diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h index e78b8028a..430f86604 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h @@ -77,18 +77,9 @@ struct PTO2OrchestratorState { int32_t scope_tasks_size; // Number of task IDs currently in the buffer int32_t scope_tasks_capacity; // Allocated capacity of scope_tasks int32_t *scope_begins; // scope_begins[i] = start index of scope i in scope_tasks - PTO2ScopeMode *scope_modes; // Mode for each scope frame - int32_t *manual_task_meta_begins; // start index in manual_task_meta for each scope - int32_t *manual_edge_begins; // start index in manual_edges for each scope int32_t scope_stack_top; // Current top of stack (-1 = no scope open) uint64_t scope_stack_capacity; // Max nesting depth (PTO2_MAX_SCOPE_DEPTH) bool manual_scope_active{false}; - PTO2ManualTaskMeta *manual_task_meta; - int32_t manual_task_meta_size; - int32_t manual_task_meta_capacity; - PTO2ManualEdge *manual_edges; - int32_t manual_edges_size; - int32_t manual_edges_capacity; // === SCHEDULER REFERENCE === // Note: In simulated mode, orchestrator and scheduler share address space @@ -112,6 +103,16 @@ struct PTO2OrchestratorState { // Cross-thread notification uses shared memory orch_error_code (atomic) bool fatal; + // === MANUAL-SCOPE METADATA === + int32_t *manual_task_meta_begins; // start index in manual_task_meta for each scope + int32_t *manual_edge_begins; // start index in manual_edges for each scope + PTO2ManualTaskMeta *manual_task_meta; + int32_t manual_task_meta_size; + int32_t manual_task_meta_capacity; + PTO2ManualEdge *manual_edges; + int32_t manual_edges_size; + int32_t manual_edges_capacity; + // Hidden alloc tasks complete synchronously inside the orchestrator and // therefore bypass the executor's normal worker-completion counter path. // The executor adds this count into its completed_tasks_ progress counter diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp index b3ebdd3ba..5c0f837e3 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp @@ -77,16 +77,6 @@ static void fail_manual_tensor_access(PTO2Runtime *rt, const char *caller) { ); } -static void fail_manual_tensor_access(PTO2Runtime *rt, const char *caller) { - PTO2OrchestratorState &orch = rt->orchestrators[pto2_current_orch_idx]; - orch.sm_handle->header->orch_error_code.store(PTO2_ERROR_INVALID_ARGS, std::memory_order_release); - orch.fatal = true; - unified_log_error( - caller, - "blocking tensor data access is not supported inside PTO2_SCOPE(PTO2ScopeMode::MANUAL); exit the manual scope first" - ); -} - // Wait for all producers of this tensor to be safe for data access. // Checks owner metadata (lifecycle anchor) and OverlapMap (modifier writers). // For reads: wait until each producer COMPLETED (done writing). From 5b10117304b9f846f2780f2a300d7569d50a0e9a Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Sun, 5 Apr 2026 22:07:27 +0800 Subject: [PATCH 08/35] Add manual scope guard regression tests --- .../manual_scope_guard_negative/golden.py | 35 ++++++++ .../kernels/kernel_config.py | 36 +++++++++ .../orchestration/manual_scope_guard_orch.cpp | 63 +++++++++++++++ tests/ut/test_manual_scope_guards.py | 81 +++++++++++++++++++ 4 files changed, 215 insertions(+) create mode 100644 tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_guard_negative/golden.py create mode 100644 tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_guard_negative/kernels/kernel_config.py create mode 100644 tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_guard_negative/kernels/orchestration/manual_scope_guard_orch.cpp create mode 100644 tests/ut/test_manual_scope_guards.py diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_guard_negative/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_guard_negative/golden.py new file mode 100644 index 000000000..0f19662d7 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_guard_negative/golden.py @@ -0,0 +1,35 @@ +# Copyright (c) PyPTO Contributors. +# This program is free software, you can redistribute it and/or modify it under the terms and conditions of +# CANN Open Software License Agreement Version 2.0 (the "License"). +# Please refer to the License for details. You may not use this file except in compliance with the License. +# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. +# See LICENSE in the root of the software repository for the full text of the License. +# ----------------------------------------------------------------------------------------------------------- + +import ctypes + +import torch + + +ALL_CASES = { + "NestedManualScope": {"mode": 1}, + "ManualGetTensorData": {"mode": 2}, + "ManualSetTensorData": {"mode": 3}, + "ManualSelfDependency": {"mode": 4}, +} + +DEFAULT_CASE = "NestedManualScope" +__outputs__ = ["tensor"] + + +def generate_inputs(params: dict) -> list: + tensor = torch.arange(16, dtype=torch.float32) + return [ + ("tensor", tensor), + ("mode", ctypes.c_uint64(params["mode"])), + ] + + +def compute_golden(tensors: dict, params: dict) -> None: + del tensors, params diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_guard_negative/kernels/kernel_config.py b/tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_guard_negative/kernels/kernel_config.py new file mode 100644 index 000000000..358bf59fb --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_guard_negative/kernels/kernel_config.py @@ -0,0 +1,36 @@ +# Copyright (c) PyPTO Contributors. +# This program is free software, you can redistribute it and/or modify it under the terms and conditions of +# CANN Open Software License Agreement Version 2.0 (the "License"). +# Please refer to the License for details. You may not use this file except in compliance with the License. +# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. +# See LICENSE in the root of the software repository for the full text of the License. +# ----------------------------------------------------------------------------------------------------------- + +from pathlib import Path + +from task_interface import ArgDirection as D # pyright: ignore[reportAttributeAccessIssue] + +_KERNELS_ROOT = Path(__file__).parent +_SCALAR_DATA_ROOT = _KERNELS_ROOT.parents[1] / "scalar_data_test" / "kernels" + +ORCHESTRATION = { + "source": str(_KERNELS_ROOT / "orchestration" / "manual_scope_guard_orch.cpp"), + "function_name": "aicpu_orchestration_entry", + "signature": [D.IN], +} + +KERNELS = [ + { + "func_id": 0, + "source": str(_SCALAR_DATA_ROOT / "aiv" / "kernel_noop.cpp"), + "core_type": "aiv", + "signature": [], + }, +] + +RUNTIME_CONFIG = { + "runtime": "tensormap_and_ringbuffer", + "aicpu_thread_num": 4, + "block_dim": 3, +} diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_guard_negative/kernels/orchestration/manual_scope_guard_orch.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_guard_negative/kernels/orchestration/manual_scope_guard_orch.cpp new file mode 100644 index 000000000..e9a4e9c1c --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_guard_negative/kernels/orchestration/manual_scope_guard_orch.cpp @@ -0,0 +1,63 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ + +#include + +#include "pto_orchestration_api.h" // NOLINT(build/include_subdir) + +#define FUNC_NOOP 0 + +extern "C" { + +__attribute__((visibility("default"))) PTO2OrchestrationConfig aicpu_orchestration_config( + const ChipStorageTaskArgs &orch_args) { + (void)orch_args; // NOLINT(readability/casting) + return PTO2OrchestrationConfig{ + .expected_arg_count = 2, + }; +} + +__attribute__((visibility("default"))) void aicpu_orchestration_entry( + const ChipStorageTaskArgs &orch_args, int orch_thread_num, int orch_thread_index) { + (void)orch_thread_num; // NOLINT(readability/casting) + (void)orch_thread_index; // NOLINT(readability/casting) + + Tensor tensor = from_tensor_arg(orch_args.tensor(0)); + uint64_t mode = orch_args.scalar(0); + uint32_t idx[1] = {0}; + + switch (mode) { + case 1: + PTO2_SCOPE(PTO2ScopeMode::MANUAL) { + PTO2_SCOPE(PTO2ScopeMode::MANUAL) {} + } + break; + case 2: + PTO2_SCOPE(PTO2ScopeMode::MANUAL) { + (void)get_tensor_data(tensor, 1, idx); // NOLINT(readability/casting) + } + break; + case 3: + PTO2_SCOPE(PTO2ScopeMode::MANUAL) { set_tensor_data(tensor, 1, idx, 1.0f); } + break; + case 4: + PTO2_SCOPE(PTO2ScopeMode::MANUAL) { + Arg args; + PTO2ManualSubmitResult submit_result = pto2_rt_submit_aiv_task_manual(FUNC_NOOP, args); + pto2_rt_add_dependency(submit_result.task_id, submit_result.task_id); + } + break; + default: + PTO2_SCOPE() {} + break; + } +} +} diff --git a/tests/ut/test_manual_scope_guards.py b/tests/ut/test_manual_scope_guards.py new file mode 100644 index 000000000..11bb6c175 --- /dev/null +++ b/tests/ut/test_manual_scope_guards.py @@ -0,0 +1,81 @@ +import os +import subprocess +import sys +import time +from pathlib import Path + +import pytest + + +PROJECT_ROOT = Path(__file__).parent.parent.parent +RUN_EXAMPLE = PROJECT_ROOT / "examples" / "scripts" / "run_example.py" +KERNELS_DIR = PROJECT_ROOT / "tests" / "st" / "a2a3" / "tensormap_and_ringbuffer" / "manual_scope_guard_negative" / "kernels" +GOLDEN = PROJECT_ROOT / "tests" / "st" / "a2a3" / "tensormap_and_ringbuffer" / "manual_scope_guard_negative" / "golden.py" +PTO_ISA_COMMIT = "6622890" + + +@pytest.mark.requires_hardware +@pytest.mark.skipif(not os.getenv("ASCEND_HOME_PATH"), reason="ASCEND_HOME_PATH not set; Ascend toolkit required") +@pytest.mark.parametrize( + ("case_name", "expected_message"), + [ + ( + "NestedManualScope", + "manual scope inside manual scope is not supported", + ), + ( + "ManualGetTensorData", + "blocking tensor data access is not supported inside PTO2_SCOPE(PTO2ScopeMode::MANUAL); exit the manual scope first", + ), + ( + "ManualSetTensorData", + "blocking tensor data access is not supported inside PTO2_SCOPE(PTO2ScopeMode::MANUAL); exit the manual scope first", + ), + ( + "ManualSelfDependency", + "add_dependency does not allow self-dependency", + ), + ], +) +def test_manual_scope_guard_failures(case_name, expected_message): + device_id = os.environ.get("PTO_TEST_DEVICE_ID", "0") + log_dir = Path.home() / "ascend" / "log" / "debug" / f"device-{device_id}" + if os.getenv("ASCEND_WORK_PATH"): + work_log_dir = Path(os.environ["ASCEND_WORK_PATH"]).expanduser() / "log" / "debug" / f"device-{device_id}" + if work_log_dir.exists(): + log_dir = work_log_dir + before_logs = set(log_dir.glob("*.log")) if log_dir.exists() else set() + command = ( + f"source {os.environ['ASCEND_HOME_PATH']}/bin/setenv.bash >/dev/null 2>&1 && " + f"{sys.executable} {RUN_EXAMPLE} --build --silent " + f"-k {KERNELS_DIR} -g {GOLDEN} -p a2a3 -d {device_id} " + f"--case {case_name} --clone-protocol https -c {PTO_ISA_COMMIT}" + ) + result = subprocess.run( + ["bash", "-lc", command], + cwd=PROJECT_ROOT, + capture_output=True, + text=True, + check=False, + ) + + assert result.returncode != 0 + combined_output = result.stdout + result.stderr + + new_log = None + deadline = time.monotonic() + 20 + while time.monotonic() < deadline: + current_logs = set(log_dir.glob("*.log")) if log_dir.exists() else set() + created = current_logs - before_logs + if created: + new_log = max(created, key=lambda path: path.stat().st_mtime) + break + time.sleep(0.5) + + if new_log is None: + logs = list(log_dir.glob("*.log")) if log_dir.exists() else [] + assert logs, "expected a device log for the failed manual-scope case" + new_log = max(logs, key=lambda path: path.stat().st_mtime) + + log_text = new_log.read_text(encoding="utf-8", errors="ignore") + assert expected_message in combined_output or expected_message in log_text From 0bb19c932699edd31b40e09d76cf60ec65bf5b81 Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Sun, 5 Apr 2026 22:12:37 +0800 Subject: [PATCH 09/35] Harden manual scope guard coverage --- docs/manual-dep-for-tensormap-design.md | 65 ++++++++++--------- .../paged_attention_partial_manual/golden.py | 2 +- .../aicpu/aicpu_executor.cpp | 11 ++++ .../orchestration/manual_scope_guard_orch.cpp | 5 +- .../paged_attention_partial_manual/golden.py | 2 +- .../golden.py | 2 +- tools/benchmark_rounds.sh | 28 ++++++-- 7 files changed, 72 insertions(+), 43 deletions(-) diff --git a/docs/manual-dep-for-tensormap-design.md b/docs/manual-dep-for-tensormap-design.md index ae42f6f4c..35f3ddcf8 100644 --- a/docs/manual-dep-for-tensormap-design.md +++ b/docs/manual-dep-for-tensormap-design.md @@ -4,7 +4,7 @@ Bring the human-created dependency workflow from `aicpu_build_graph` into `tensormap_and_ringbuffer` in a scoped way: -- `PTO2_SCOPE(true) { ... }` +- `PTO2_SCOPE(PTO2ScopeMode::MANUAL) { ... }` - Tensors crossing scope boundaries use TensorMap semantics - Tensors used entirely inside the manual scope use explicit `add_dependency` @@ -50,7 +50,7 @@ The implementation PR must follow these rules: - Keep the change strictly scoped to manual dependency support in `tensormap_and_ringbuffer`. - Do not refactor unrelated runtime behavior while doing this work. -- Do not change existing normal-scope TensorMap semantics. +- Do not change existing auto-scope TensorMap semantics. - Do not change scope lifetime semantics. - Prefer the smallest invasive write set that cleanly supports the feature. - Preserve existing examples/tests unless a targeted update is required to cover the new feature. @@ -168,7 +168,7 @@ Concise conclusion: If we simply copy `aicpu_build_graph` semantics into `tensormap_and_ringbuffer`, we get a wrong boundary model: -- suppressing TensorMap for all tensors inside `PTO2_SCOPE(true)` is incorrect +- suppressing TensorMap for all tensors inside `PTO2_SCOPE(PTO2ScopeMode::MANUAL)` is incorrect - delaying publication of an outer tensor until `scope_end` is incorrect The reason is that cross-scope tensors must become visible at the actual writer frontier. Outside consumers should depend on the task that really produced the latest visible state, not on scope closure. @@ -182,7 +182,7 @@ So the correct split is: ## Core rule -`PTO2_SCOPE(true)` means: +`PTO2_SCOPE(PTO2ScopeMode::MANUAL)` means: - if a tensor was created inside this manual scope and is reused inside this manual scope, the dependency must be established by explicit `add_dependency` - all outer-scope tensors still use existing TensorMap/owner metadata @@ -199,7 +199,7 @@ The design must distinguish two different kinds of publication: ### Scheduler publication -For tasks inside `PTO2_SCOPE(true)`: +For tasks inside `PTO2_SCOPE(PTO2ScopeMode::MANUAL)`: - submit builds the internal task records and records explicit dependencies - those tasks are not yet published as executable scheduler work @@ -209,7 +209,7 @@ This is required so all same-scope explicit edges are fully wired before any tas ### TensorMap boundary publication -For cross-scope tensors touched by tasks inside `PTO2_SCOPE(true)`: +For cross-scope tensors touched by tasks inside `PTO2_SCOPE(PTO2ScopeMode::MANUAL)`: - outside tasks submitted after the manual scope ends must still be able to discover the internal writer frontier - therefore the producer frontier for an external tensor written inside the manual scope must become visible to later TensorMap lookups at manual `scope_end` @@ -280,7 +280,7 @@ Everything else stays on the existing TensorMap path. - That boundary frontier becomes visible at manual `scope_end`, so later outside submissions can attach to the correct writer task. - Readiness of the written tensor is the completion of that writer task. - Multiple writes inside the same manual scope are allowed. -- TensorMap should continue tracking the latest producer frontier exactly as in normal scope once the manual scope is finalized. +- TensorMap should continue tracking the latest producer frontier exactly as in auto scope once the manual scope is finalized. ### 3. Tensor created inside this manual scope and reused only inside this manual scope @@ -316,15 +316,20 @@ Add explicit edge wiring to `tensormap_and_ringbuffer` orchestration API: void pto2_rt_add_dependency(PTO2TaskId producer, PTO2TaskId consumer); ``` -Extend scope syntax to accept an optional manual flag: +Use an explicit scope mode enum for the scoped API: ```cpp -PTO2_SCOPE(true) { +enum class PTO2ScopeMode : uint8_t { + AUTO = 0, + MANUAL = 1, +}; + +PTO2_SCOPE(PTO2ScopeMode::MANUAL) { ... } ``` -`PTO2_SCOPE()` remains the normal-scope form. `PTO2_SCOPE(true)` enters manual mode. +`PTO2_SCOPE()` remains the auto-scope form by default. `PTO2_SCOPE(PTO2ScopeMode::MANUAL)` enters manual mode explicitly. Do not change `TaskOutputTensors`. @@ -341,11 +346,11 @@ PTO2ManualSubmitResult pto2_rt_submit_aic_task_manual(int32_t kernel_id, const A PTO2ManualSubmitResult pto2_rt_submit_aiv_task_manual(int32_t kernel_id, const Arg& args); ``` -These APIs are intended for use inside `PTO2_SCOPE(true)` where explicit dependency wiring is required. +These APIs are intended for use inside `PTO2_SCOPE(PTO2ScopeMode::MANUAL)` where explicit dependency wiring is required. This design intentionally splits task APIs, not tensor storage APIs: -- normal scope uses existing `pto2_rt_submit_*` task APIs +- auto scope uses existing `pto2_rt_submit_*` task APIs - manual scope uses `pto2_rt_submit_*_manual` task APIs - both modes continue using the same `Tensor`, `Arg`, and `TensorArgType` model @@ -374,15 +379,15 @@ Add runtime ops support: ```cpp void (*add_dependency)(PTO2Runtime* rt, PTO2TaskId producer, PTO2TaskId consumer); -void (*scope_begin)(PTO2Runtime* rt, bool manual_dep); +void (*scope_begin)(PTO2Runtime* rt, PTO2ScopeMode mode); ``` -The orchestration-facing helper can stay TLS-style and hide the runtime pointer, for example by plumbing the flag through the existing `pto2_rt_scope_begin()` / `PTO2ScopeGuard` path. +The orchestration-facing helper can stay TLS-style and hide the runtime pointer, for example by plumbing the mode through the existing `pto2_rt_scope_begin()` / `PTO2ScopeGuard` path. Add manual-scope entry/exit plumbing by extending the existing runtime entry point with a mode flag: ```cpp -void pto2_rt_scope_begin(PTO2Runtime* rt, bool manual_dep); +void pto2_rt_scope_begin(PTO2Runtime* rt, PTO2ScopeMode mode = PTO2ScopeMode::AUTO); ``` Recommendation: extend scope state with a mode flag and keep one scope stack. Avoid separate manual/non-manual stacks. @@ -481,20 +486,20 @@ This is why a separate user-facing “external tensor” API is not required for ## Scheduler-Safe Hybrid Design -The scheduler changes should be localized and should not disturb existing normal-scope behavior. +The scheduler changes should be localized and should not disturb existing auto-scope behavior. ### Design principle Keep two execution paths: -- normal scope path: existing `tensormap_and_ringbuffer` behavior +- auto scope path: existing `tensormap_and_ringbuffer` behavior - manual scope path: deferred dependency realization and deferred scheduler publication -The normal path should remain unchanged as much as possible. +The auto path should remain unchanged as much as possible. ### What a manual-scope task must count as dependencies -For a task inside `PTO2_SCOPE(true)`, total fanin is: +For a task inside `PTO2_SCOPE(PTO2ScopeMode::MANUAL)`, total fanin is: - explicit manual dependencies added by `add_dependency` - external dependencies derived from TensorMap/owner logic for outer-scope reads @@ -742,7 +747,7 @@ This is supported by the same fanin accounting model. Example: ```cpp -PTO2_SCOPE(true) { +PTO2_SCOPE(PTO2ScopeMode::MANUAL) { t0 = in-scope producer of tmp t1 = consumer of tmp and outer tensor X add_dependency(t0, t1) @@ -820,7 +825,7 @@ This case must be supported in v1. Example: ```cpp -PTO2_SCOPE(true) { +PTO2_SCOPE(PTO2ScopeMode::MANUAL) { t1 writes outer C t2 writes outer C add_dependency(t1, t2) @@ -839,7 +844,7 @@ Correct behavior: Potential invalid user pattern: ```cpp -PTO2_SCOPE(true) { +PTO2_SCOPE(PTO2ScopeMode::MANUAL) { t1 writes outer C t2 also writes outer C // missing add_dependency(t1, t2) @@ -869,8 +874,8 @@ This is a strict requirement: Supported: -- normal scope contains manual scope -- normal scope contains normal scope +- auto scope contains manual scope +- auto scope contains auto scope Not supported in v1: @@ -881,7 +886,7 @@ Reason: - current ring selection depends on scope depth - the top scope frame is also the publication and lifetime-release boundary -- allowing a child scope inside `PTO2_SCOPE(true)` would split one manual region across multiple scope/ring boundaries unless extra machinery is added +- allowing a child scope inside `PTO2_SCOPE(PTO2ScopeMode::MANUAL)` would split one manual region across multiple scope/ring boundaries unless extra machinery is added - rejecting nested scopes inside manual mode keeps `current_manual_scope_owns(...)` a simple membership check over one manual frame Recommendation: @@ -891,9 +896,9 @@ Recommendation: Required error text quality: -- the message must explicitly say that nested scope inside `PTO2_SCOPE(true)` is not supported in v1 +- the message must explicitly say that nested scope inside `PTO2_SCOPE(PTO2ScopeMode::MANUAL)` is not supported in v1 - the message must explicitly say that `manual scope inside manual scope is not supported` -- the message must identify the offending operation as nested `PTO2_SCOPE(true)` +- the message must identify the offending operation as nested `PTO2_SCOPE(PTO2ScopeMode::MANUAL)` - the message must not use vague wording such as only `invalid scope state` ## Blocking Cross-Layer Tensor Access @@ -904,12 +909,12 @@ That assumption does not hold inside manual scope because tasks remain unpublish So v1 should fail fast: -- `get_tensor_data` inside `PTO2_SCOPE(true)` is an error -- `set_tensor_data` inside `PTO2_SCOPE(true)` is an error +- `get_tensor_data` inside `PTO2_SCOPE(PTO2ScopeMode::MANUAL)` is an error +- `set_tensor_data` inside `PTO2_SCOPE(PTO2ScopeMode::MANUAL)` is an error Required error text quality: -- the message must explicitly say that blocking tensor data access is not supported inside `PTO2_SCOPE(true)` +- the message must explicitly say that blocking tensor data access is not supported inside `PTO2_SCOPE(PTO2ScopeMode::MANUAL)` - the message should tell the user to exit the manual scope first ## Diagnostics diff --git a/examples/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/golden.py b/examples/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/golden.py index b03743d37..89df96225 100644 --- a/examples/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/golden.py +++ b/examples/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/golden.py @@ -4,4 +4,4 @@ _BASE = Path(__file__).resolve().parents[1] / "paged_attention" sys.path.insert(0, str(_BASE)) -from golden import compute_golden, generate_inputs # noqa: E402,F401 +from golden import ALL_CASES, ATOL, DEFAULT_CASE, RTOL, __outputs__, compute_golden, generate_inputs # noqa: E402,F401 diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp index 97afd6a4e..18f739264 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp @@ -2364,6 +2364,17 @@ int32_t AicpuExecutor::run(Runtime *runtime) { } } + if (rt != nullptr) { + void* sm = runtime->get_pto2_gm_sm_ptr(); + if (sm != nullptr) { + int32_t orch_err = static_cast(sm)->orch_error_code.load(std::memory_order_acquire); + if (orch_err != PTO2_ERROR_NONE) { + DEV_ERROR("Thread %d: Exiting with orchestrator error code=%d", thread_idx, orch_err); + return -1; + } + } + } + return 0; } diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_guard_negative/kernels/orchestration/manual_scope_guard_orch.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_guard_negative/kernels/orchestration/manual_scope_guard_orch.cpp index e9a4e9c1c..f9a37ddd7 100644 --- a/tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_guard_negative/kernels/orchestration/manual_scope_guard_orch.cpp +++ b/tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_guard_negative/kernels/orchestration/manual_scope_guard_orch.cpp @@ -50,9 +50,8 @@ __attribute__((visibility("default"))) void aicpu_orchestration_entry( break; case 4: PTO2_SCOPE(PTO2ScopeMode::MANUAL) { - Arg args; - PTO2ManualSubmitResult submit_result = pto2_rt_submit_aiv_task_manual(FUNC_NOOP, args); - pto2_rt_add_dependency(submit_result.task_id, submit_result.task_id); + PTO2TaskId invalid = PTO2TaskId::invalid(); + pto2_rt_add_dependency(invalid, invalid); } break; default: diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/golden.py index b03743d37..89df96225 100644 --- a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/golden.py +++ b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/golden.py @@ -4,4 +4,4 @@ _BASE = Path(__file__).resolve().parents[1] / "paged_attention" sys.path.insert(0, str(_BASE)) -from golden import compute_golden, generate_inputs # noqa: E402,F401 +from golden import ALL_CASES, ATOL, DEFAULT_CASE, RTOL, __outputs__, compute_golden, generate_inputs # noqa: E402,F401 diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/golden.py index 95a72e23b..d3f3c1ac1 100644 --- a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/golden.py +++ b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/golden.py @@ -4,5 +4,5 @@ _BASE = Path(__file__).resolve().parents[1] / "paged_attention_unroll" sys.path.insert(0, str(_BASE)) -from golden import ALL_CASES, ATOL, DEFAULT_CASE, RTOL, generate_inputs # noqa: E402,F401 +from golden import ALL_CASES, ATOL, DEFAULT_CASE, RTOL, __outputs__, generate_inputs # noqa: E402,F401 from paged_attention_golden import compute_golden, run_golden_test # noqa: E402,F401 diff --git a/tools/benchmark_rounds.sh b/tools/benchmark_rounds.sh index 6f07d6dd4..f166cbc32 100755 --- a/tools/benchmark_rounds.sh +++ b/tools/benchmark_rounds.sh @@ -21,18 +21,12 @@ RUN_EXAMPLE="$PROJECT_ROOT/examples/scripts/run_example.py" # --- tensormap_and_ringbuffer --- declare -A TMR_EXAMPLE_CASES=( - [alternating_matmul_add]="" - [benchmark_bgemm]="" [paged_attention]="Case1,Case2" [paged_attention_unroll]="Case1,Case2" - [batch_paged_attention]="" ) TMR_EXAMPLE_ORDER=( - alternating_matmul_add - benchmark_bgemm paged_attention paged_attention_unroll - batch_paged_attention ) # --- aicpu_build_graph --- @@ -63,6 +57,7 @@ ROUNDS=100 PLATFORM=a2a3 RUNTIME=tensormap_and_ringbuffer VERBOSE=0 +EXAMPLE_FILTER="" EXTRA_ARGS=() while [[ $# -gt 0 ]]; do @@ -83,6 +78,10 @@ while [[ $# -gt 0 ]]; do RUNTIME="$2" shift 2 ;; + -e|--examples) + EXAMPLE_FILTER="$2" + shift 2 + ;; -v|--verbose) VERBOSE=1 shift @@ -92,13 +91,14 @@ while [[ $# -gt 0 ]]; do benchmark_rounds.sh — run all examples and report per-round timing from device logs Usage: - ./tools/benchmark_rounds.sh [-p ] [-d ] [-n ] [-r ] [-v] + ./tools/benchmark_rounds.sh [-p ] [-d ] [-n ] [-r ] [-e ] [-v] Options: -p, --platform Platform to run on (default: a2a3) -d, --device Device ID (default: 0) -n, --rounds Override number of rounds for each example (default: 100) -r, --runtime Runtime to benchmark: tensormap_and_ringbuffer (default), tensormap_and_ringbuffer_unmodified, aicpu_build_graph + -e, --examples Comma-separated example names to run (default: runtime-specific full list) -v, --verbose Save detailed run_example.py output to a timestamped log file -h, --help Show this help @@ -167,6 +167,20 @@ case "$RUNTIME" in ;; esac +if [[ -n "$EXAMPLE_FILTER" ]]; then + IFS=',' read -ra REQUESTED_EXAMPLES <<< "$EXAMPLE_FILTER" + FILTERED_ORDER=() + for requested in "${REQUESTED_EXAMPLES[@]}"; do + if [[ -n "${EXAMPLE_CASES[$requested]+x}" ]]; then + FILTERED_ORDER+=("$requested") + else + echo "ERROR: example '$requested' is not available for runtime '$RUNTIME'." + exit 1 + fi + done + EXAMPLE_ORDER=("${FILTERED_ORDER[@]}") +fi + # --------------------------------------------------------------------------- # Resolve device log directory (mirrors run_example.py / device_log_resolver.py) # --------------------------------------------------------------------------- From 58d3c1a28ff7e4ede7a4142fb24382731e2007a7 Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Sun, 5 Apr 2026 23:30:22 +0800 Subject: [PATCH 10/35] Add manual scope outer-write boundary test --- .../manual_scope_outer_multiwrite/golden.py | 46 +++++++++ .../kernels/kernel_config.py | 36 +++++++ .../manual_scope_outer_multiwrite_orch.cpp | 94 +++++++++++++++++++ tests/ut/test_manual_scope_boundary.py | 36 +++++++ 4 files changed, 212 insertions(+) create mode 100644 tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_outer_multiwrite/golden.py create mode 100644 tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_outer_multiwrite/kernels/kernel_config.py create mode 100644 tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_outer_multiwrite/kernels/orchestration/manual_scope_outer_multiwrite_orch.cpp create mode 100644 tests/ut/test_manual_scope_boundary.py diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_outer_multiwrite/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_outer_multiwrite/golden.py new file mode 100644 index 000000000..019102803 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_outer_multiwrite/golden.py @@ -0,0 +1,46 @@ +# Copyright (c) PyPTO Contributors. +# This program is free software, you can redistribute it and/or modify it under the terms and conditions of +# CANN Open Software License Agreement Version 2.0 (the "License"). +# Please refer to the License for details. You may not use this file except in compliance with the License. +# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. +# See LICENSE in the root of the software repository for the full text of the License. +# ----------------------------------------------------------------------------------------------------------- + +import torch + + +__outputs__ = ["out", "result", "check"] + +RTOL = 1e-5 +ATOL = 1e-5 + + +def generate_inputs(params: dict) -> list: + del params + size = 128 * 128 + a = torch.full((size,), 1.0, dtype=torch.float32) + b = torch.full((size,), 2.0, dtype=torch.float32) + out = torch.zeros(size, dtype=torch.float32) + result = torch.zeros(size, dtype=torch.float32) + check = torch.zeros(4, dtype=torch.float32) + return [ + ("a", a), + ("b", b), + ("out", out), + ("result", result), + ("check", check), + ] + + +def compute_golden(tensors: dict, params: dict) -> None: + del params + out = torch.as_tensor(tensors["out"]) + result = torch.as_tensor(tensors["result"]) + check = torch.as_tensor(tensors["check"]) + + out.fill_(5.0) + result.fill_(7.0) + check[0] = 5.0 + check[1] = 7.0 + check[2] = 5.0 diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_outer_multiwrite/kernels/kernel_config.py b/tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_outer_multiwrite/kernels/kernel_config.py new file mode 100644 index 000000000..81bbb5465 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_outer_multiwrite/kernels/kernel_config.py @@ -0,0 +1,36 @@ +# Copyright (c) PyPTO Contributors. +# This program is free software, you can redistribute it and/or modify it under the terms and conditions of +# CANN Open Software License Agreement Version 2.0 (the "License"). +# Please refer to the License for details. You may not use this file except in compliance with the License. +# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, +# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. +# See LICENSE in the root of the software repository for the full text of the License. +# ----------------------------------------------------------------------------------------------------------- + +from pathlib import Path + +from task_interface import ArgDirection as D # pyright: ignore[reportAttributeAccessIssue] + +_KERNELS_ROOT = Path(__file__).parent +_SCALAR_DATA_ROOT = _KERNELS_ROOT.parents[1] / "scalar_data_test" / "kernels" + +ORCHESTRATION = { + "source": str(_KERNELS_ROOT / "orchestration" / "manual_scope_outer_multiwrite_orch.cpp"), + "function_name": "aicpu_orchestration_entry", + "signature": [D.IN, D.IN, D.OUT, D.OUT, D.OUT], +} + +KERNELS = [ + { + "func_id": 0, + "source": str(_SCALAR_DATA_ROOT / "aiv" / "kernel_add.cpp"), + "core_type": "aiv", + "signature": [D.IN, D.IN, D.OUT], + }, +] + +RUNTIME_CONFIG = { + "runtime": "tensormap_and_ringbuffer", + "aicpu_thread_num": 4, + "block_dim": 3, +} diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_outer_multiwrite/kernels/orchestration/manual_scope_outer_multiwrite_orch.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_outer_multiwrite/kernels/orchestration/manual_scope_outer_multiwrite_orch.cpp new file mode 100644 index 000000000..7d363f9b2 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/manual_scope_outer_multiwrite/kernels/orchestration/manual_scope_outer_multiwrite_orch.cpp @@ -0,0 +1,94 @@ +/* + * Copyright (c) PyPTO Contributors. + * This program is free software, you can redistribute it and/or modify it under the terms and conditions of + * CANN Open Software License Agreement Version 2.0 (the "License"). + * Please refer to the License for details. You may not use this file except in compliance with the License. + * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, + * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. + * See LICENSE in the root of the software repository for the full text of the License. + * ----------------------------------------------------------------------------------------------------------- + */ + +#include + +#include "pto_orchestration_api.h" // NOLINT(build/include_subdir) + +#define FUNC_ADD 0 + +extern "C" { + +__attribute__((visibility("default"))) PTO2OrchestrationConfig aicpu_orchestration_config( + const ChipStorageTaskArgs &orch_args) { + (void)orch_args; // NOLINT(readability/casting) + return PTO2OrchestrationConfig{ + .expected_arg_count = 5, + }; +} + +__attribute__((visibility("default"))) void aicpu_orchestration_entry( + const ChipStorageTaskArgs &orch_args, int orch_thread_num, int orch_thread_index) { + (void)orch_thread_num; // NOLINT(readability/casting) + (void)orch_thread_index; // NOLINT(readability/casting) + + Tensor ext_a = from_tensor_arg(orch_args.tensor(0)); + Tensor ext_b = from_tensor_arg(orch_args.tensor(1)); + Tensor ext_out = from_tensor_arg(orch_args.tensor(2)); + Tensor ext_result = from_tensor_arg(orch_args.tensor(3)); + Tensor ext_check = from_tensor_arg(orch_args.tensor(4)); + + uint32_t size = orch_args.tensor(0).shapes[0]; + uint32_t inter_shapes[1] = {size}; + TensorCreateInfo inter_ci(inter_shapes, 1, DataType::FLOAT32); + + PTO2_SCOPE() { + PTO2_SCOPE(PTO2ScopeMode::MANUAL) { + Arg tmp0_args; + tmp0_args.add_input(ext_a); + tmp0_args.add_input(ext_a); + tmp0_args.add_output(inter_ci); + PTO2ManualSubmitResult tmp0 = pto2_rt_submit_aiv_task_manual(FUNC_ADD, tmp0_args); + + Arg write0_args; + write0_args.add_input(tmp0.outputs.get_ref(0)); + write0_args.add_input(ext_a); + write0_args.add_output(ext_out); + PTO2ManualSubmitResult write0 = pto2_rt_submit_aiv_task_manual(FUNC_ADD, write0_args); + pto2_rt_add_dependency(tmp0.task_id, write0.task_id); + + Arg tmp1_args; + tmp1_args.add_input(ext_b); + tmp1_args.add_input(ext_b); + tmp1_args.add_output(inter_ci); + PTO2ManualSubmitResult tmp1 = pto2_rt_submit_aiv_task_manual(FUNC_ADD, tmp1_args); + + Arg write1_args; + write1_args.add_input(tmp1.outputs.get_ref(0)); + write1_args.add_input(ext_a); + write1_args.add_output(ext_out); + PTO2ManualSubmitResult write1 = pto2_rt_submit_aiv_task_manual(FUNC_ADD, write1_args); + pto2_rt_add_dependency(tmp1.task_id, write1.task_id); + pto2_rt_add_dependency(write0.task_id, write1.task_id); + } + + Arg consumer_args; + consumer_args.add_input(ext_out); + consumer_args.add_input(ext_b); + consumer_args.add_output(ext_result); + pto2_rt_submit_aiv_task(FUNC_ADD, consumer_args); + + uint32_t idx0[1] = {0}; + uint32_t idx100[1] = {100}; + + float out0 = get_tensor_data(ext_out, 1, idx0); + float result0 = get_tensor_data(ext_result, 1, idx0); + float out100 = get_tensor_data(ext_out, 1, idx100); + + idx0[0] = 0; + set_tensor_data(ext_check, 1, idx0, out0); + idx0[0] = 1; + set_tensor_data(ext_check, 1, idx0, result0); + idx0[0] = 2; + set_tensor_data(ext_check, 1, idx0, out100); + } +} +} diff --git a/tests/ut/test_manual_scope_boundary.py b/tests/ut/test_manual_scope_boundary.py new file mode 100644 index 000000000..a7028178d --- /dev/null +++ b/tests/ut/test_manual_scope_boundary.py @@ -0,0 +1,36 @@ +import os +import subprocess +import sys +from pathlib import Path + +import pytest + + +PROJECT_ROOT = Path(__file__).parent.parent.parent +RUN_EXAMPLE = PROJECT_ROOT / "examples" / "scripts" / "run_example.py" +KERNELS_DIR = ( + PROJECT_ROOT / "tests" / "st" / "a2a3" / "tensormap_and_ringbuffer" / "manual_scope_outer_multiwrite" / "kernels" +) +GOLDEN = PROJECT_ROOT / "tests" / "st" / "a2a3" / "tensormap_and_ringbuffer" / "manual_scope_outer_multiwrite" / "golden.py" +PTO_ISA_COMMIT = "6622890" + + +@pytest.mark.requires_hardware +@pytest.mark.skipif(not os.getenv("ASCEND_HOME_PATH"), reason="ASCEND_HOME_PATH not set; Ascend toolkit required") +def test_manual_scope_outer_multiwrite_boundary(): + device_id = os.environ.get("PTO_TEST_DEVICE_ID", "0") + command = ( + f"source {os.environ['ASCEND_HOME_PATH']}/bin/setenv.bash >/dev/null 2>&1 && " + f"{sys.executable} {RUN_EXAMPLE} --build --silent " + f"-k {KERNELS_DIR} -g {GOLDEN} -p a2a3 -d {device_id} " + f"--clone-protocol https -c {PTO_ISA_COMMIT}" + ) + result = subprocess.run( + ["bash", "-lc", command], + cwd=PROJECT_ROOT, + capture_output=True, + text=True, + check=False, + ) + + assert result.returncode == 0, result.stdout + result.stderr From eed20b57ced33dddd143eb4db21fb6009ac6722b Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Mon, 6 Apr 2026 00:11:21 +0800 Subject: [PATCH 11/35] Support: add partial-manual benchmark selector - add a benchmark_rounds runtime selector for the partial-manual scenes\n- keep the current tensormap runtime on the direct selector path\n- record fresh 2026-04-06 hardware comparison data in the design doc --- docs/manual-dep-for-tensormap-design.md | 79 +++++++++++++++++++++++++ tools/benchmark_rounds.sh | 26 +++++++- 2 files changed, 102 insertions(+), 3 deletions(-) diff --git a/docs/manual-dep-for-tensormap-design.md b/docs/manual-dep-for-tensormap-design.md index 35f3ddcf8..5b1e8870d 100644 --- a/docs/manual-dep-for-tensormap-design.md +++ b/docs/manual-dep-for-tensormap-design.md @@ -969,6 +969,85 @@ Use a small example first, such as vector-style or BGEMM-style, to validate: Only then move to more complex orchestration such as paged attention. +## Fresh Hardware Benchmark + +Fresh benchmark data was rerun on real hardware on 2026-04-06 with: + +- platform: `a2a3` +- device: `2` +- rounds: `5` +- pinned PTO-ISA commit: `6622890` +- runner: `tools/benchmark_rounds.sh` + +The four compared variants are: + +- `aicpu_build_graph` +- `tensormap_and_ringbuffer_unmodified` +- `tensormap_and_ringbuffer` +- `tensormap_and_ringbuffer_partial_manual` + +`tensormap_and_ringbuffer` is the current/new AUTO-path runtime under evaluation. +`tensormap_and_ringbuffer_partial_manual` is the same runtime tree, but benchmarked +through the `_partial_manual` paged-attention scenes. + +### Benchmark Script Selectors + +The benchmark wrapper enables the variants as follows: + +- `./tools/benchmark_rounds.sh -r tensormap_and_ringbuffer` + - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer/paged_attention` + - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll` +- `./tools/benchmark_rounds.sh -r tensormap_and_ringbuffer_unmodified` + - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention` + - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention_unroll` +- `./tools/benchmark_rounds.sh -r tensormap_and_ringbuffer_partial_manual` + - uses the same ST root as `tensormap_and_ringbuffer` + - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual` + - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual` +- `./tools/benchmark_rounds.sh -r aicpu_build_graph` + - benchmarks `tests/st/a2a3/aicpu_build_graph/paged_attention` + - benchmarks `tests/st/a2a3/aicpu_build_graph/paged_attention_unroll` + +That means the current/new AUTO runtime is already selected directly by `-r +tensormap_and_ringbuffer`, while the manual-mode comparison is selected by `-r +tensormap_and_ringbuffer_partial_manual`. + +### Fresh Results + +Units below are `elapsed_us (orch_us)`. + +| Workload | Case | `aicpu_build_graph` | `tensormap_and_ringbuffer_unmodified` | `tensormap_and_ringbuffer` | `tensormap_and_ringbuffer_partial_manual` | +| --- | --- | --- | --- | --- | --- | +| `paged_attention` | `Case1` | `31129.9 (-)` | `35789.6 (35788.6)` | `36643.8 (36643.0)` | `41611.6 (41605.3)` | +| `paged_attention` | `Case2` | `16348.7 (-)` | `18884.1 (18883.4)` | `18873.5 (18872.6)` | `21047.6 (20983.7)` | +| `paged_attention_unroll` | `Case1` | `1431.7 (-)` | `1320.5 (827.5)` | `1317.2 (833.3)` | `1329.8 (946.2)` | +| `paged_attention_unroll` | `Case2` | `716.7 (-)` | `630.5 (375.9)` | `633.5 (371.4)` | `649.1 (429.3)` | + +### Benchmark Takeaways + +1. The current/new AUTO-path runtime is close to the unmodified runtime on three of + the four fresh cells. + - `paged_attention/Case2`: effectively flat + - `paged_attention_unroll/Case1`: slightly faster end-to-end, slightly slower in orch + - `paged_attention_unroll/Case2`: slightly slower end-to-end, slightly faster in orch + +2. The remaining AUTO-path gap is the heavy `paged_attention/Case1` cell. + - unmodified: `35789.6 us` + - current/new: `36643.8 us` + - regression: about `+2.4%` + +3. Partial-manual mode is still materially slower on the heavy paged-attention scene. + - `paged_attention/Case1`: about `+16.3%` vs unmodified + - `paged_attention/Case2`: about `+11.5%` vs unmodified + +4. Partial-manual mode also adds visible orch cost on the unroll scene even when + elapsed stays close. + - `paged_attention_unroll/Case1`: `946.2 us` orch vs `827.5 us` unmodified + - `paged_attention_unroll/Case2`: `429.3 us` orch vs `375.9 us` unmodified + +5. `aicpu_build_graph` remains fastest on the heavy `paged_attention` scene, but it is + slower than the tensormap runtimes on both `paged_attention_unroll` cells. + ## Main Risks 1. Treating manual scope as a global TensorMap disable switch. diff --git a/tools/benchmark_rounds.sh b/tools/benchmark_rounds.sh index f166cbc32..f3a19b9a1 100755 --- a/tools/benchmark_rounds.sh +++ b/tools/benchmark_rounds.sh @@ -49,6 +49,16 @@ TMR_UNMODIFIED_EXAMPLE_ORDER=( paged_attention_unroll ) +# --- tensormap_and_ringbuffer_partial_manual --- +declare -A TMR_PARTIAL_MANUAL_EXAMPLE_CASES=( + [paged_attention_partial_manual]="Case1,Case2" + [paged_attention_unroll_partial_manual]="Case1,Case2" +) +TMR_PARTIAL_MANUAL_EXAMPLE_ORDER=( + paged_attention_partial_manual + paged_attention_unroll_partial_manual +) + # --------------------------------------------------------------------------- # Parse arguments # --------------------------------------------------------------------------- @@ -97,7 +107,10 @@ Options: -p, --platform Platform to run on (default: a2a3) -d, --device Device ID (default: 0) -n, --rounds Override number of rounds for each example (default: 100) - -r, --runtime Runtime to benchmark: tensormap_and_ringbuffer (default), tensormap_and_ringbuffer_unmodified, aicpu_build_graph + -r, --runtime Runtime to benchmark: tensormap_and_ringbuffer (default), + tensormap_and_ringbuffer_unmodified, + tensormap_and_ringbuffer_partial_manual, + aicpu_build_graph -e, --examples Comma-separated example names to run (default: runtime-specific full list) -v, --verbose Save detailed run_example.py output to a timestamped log file -h, --help Show this help @@ -138,7 +151,7 @@ vlog() { # --------------------------------------------------------------------------- # Derive arch from platform and set examples directory # --------------------------------------------------------------------------- -EXAMPLES_DIR="$PROJECT_ROOT/tests/st/${PLATFORM}/${RUNTIME}" +TESTS_RUNTIME_DIR="$RUNTIME" # Clock frequency (MHz) for converting cycle counts to microseconds case "$PLATFORM" in @@ -157,16 +170,23 @@ case "$RUNTIME" in declare -n EXAMPLE_CASES=TMR_UNMODIFIED_EXAMPLE_CASES EXAMPLE_ORDER=("${TMR_UNMODIFIED_EXAMPLE_ORDER[@]}") ;; + tensormap_and_ringbuffer_partial_manual) + TESTS_RUNTIME_DIR="tensormap_and_ringbuffer" + declare -n EXAMPLE_CASES=TMR_PARTIAL_MANUAL_EXAMPLE_CASES + EXAMPLE_ORDER=("${TMR_PARTIAL_MANUAL_EXAMPLE_ORDER[@]}") + ;; aicpu_build_graph) declare -n EXAMPLE_CASES=ABG_EXAMPLE_CASES EXAMPLE_ORDER=("${ABG_EXAMPLE_ORDER[@]}") ;; *) - echo "ERROR: unknown runtime '$RUNTIME'. Use tensormap_and_ringbuffer, tensormap_and_ringbuffer_unmodified, or aicpu_build_graph." + echo "ERROR: unknown runtime '$RUNTIME'. Use tensormap_and_ringbuffer, tensormap_and_ringbuffer_unmodified, tensormap_and_ringbuffer_partial_manual, or aicpu_build_graph." exit 1 ;; esac +EXAMPLES_DIR="$PROJECT_ROOT/tests/st/${PLATFORM}/${TESTS_RUNTIME_DIR}" + if [[ -n "$EXAMPLE_FILTER" ]]; then IFS=',' read -ra REQUESTED_EXAMPLES <<< "$EXAMPLE_FILTER" FILTERED_ORDER=() From a3b96ba82dbb6d5c965fb8d8068f89251895c11b Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Mon, 6 Apr 2026 00:14:11 +0800 Subject: [PATCH 12/35] Update: refresh manual-dep benchmark data - replace the interim comparison table with the newest fresh device-2 reruns\n- keep the benchmark workflow section aligned with the partial-manual selector\n- record the remaining AUTO-path and partial-manual performance gaps --- docs/manual-dep-for-tensormap-design.md | 32 ++++++++++++++----------- 1 file changed, 18 insertions(+), 14 deletions(-) diff --git a/docs/manual-dep-for-tensormap-design.md b/docs/manual-dep-for-tensormap-design.md index 5b1e8870d..7d8ff82c8 100644 --- a/docs/manual-dep-for-tensormap-design.md +++ b/docs/manual-dep-for-tensormap-design.md @@ -1018,34 +1018,38 @@ Units below are `elapsed_us (orch_us)`. | Workload | Case | `aicpu_build_graph` | `tensormap_and_ringbuffer_unmodified` | `tensormap_and_ringbuffer` | `tensormap_and_ringbuffer_partial_manual` | | --- | --- | --- | --- | --- | --- | -| `paged_attention` | `Case1` | `31129.9 (-)` | `35789.6 (35788.6)` | `36643.8 (36643.0)` | `41611.6 (41605.3)` | -| `paged_attention` | `Case2` | `16348.7 (-)` | `18884.1 (18883.4)` | `18873.5 (18872.6)` | `21047.6 (20983.7)` | -| `paged_attention_unroll` | `Case1` | `1431.7 (-)` | `1320.5 (827.5)` | `1317.2 (833.3)` | `1329.8 (946.2)` | -| `paged_attention_unroll` | `Case2` | `716.7 (-)` | `630.5 (375.9)` | `633.5 (371.4)` | `649.1 (429.3)` | +| `paged_attention` | `Case1` | `31864.8 (-)` | `36218.5 (36217.7)` | `36643.8 (36643.0)` | `41611.6 (41605.3)` | +| `paged_attention` | `Case2` | `16295.2 (-)` | `18325.8 (18324.7)` | `18873.5 (18872.6)` | `21047.6 (20983.7)` | +| `paged_attention_unroll` | `Case1` | `1431.7 (-)` | `1321.8 (815.7)` | `1317.2 (833.3)` | `1329.8 (946.2)` | +| `paged_attention_unroll` | `Case2` | `716.7 (-)` | `629.6 (388.7)` | `633.5 (371.4)` | `649.1 (429.3)` | ### Benchmark Takeaways 1. The current/new AUTO-path runtime is close to the unmodified runtime on three of the four fresh cells. - - `paged_attention/Case2`: effectively flat - `paged_attention_unroll/Case1`: slightly faster end-to-end, slightly slower in orch - `paged_attention_unroll/Case2`: slightly slower end-to-end, slightly faster in orch 2. The remaining AUTO-path gap is the heavy `paged_attention/Case1` cell. - - unmodified: `35789.6 us` + - unmodified: `36218.5 us` - current/new: `36643.8 us` - - regression: about `+2.4%` + - regression: about `+1.2%` -3. Partial-manual mode is still materially slower on the heavy paged-attention scene. - - `paged_attention/Case1`: about `+16.3%` vs unmodified - - `paged_attention/Case2`: about `+11.5%` vs unmodified +3. `paged_attention/Case2` still shows a measurable AUTO-path regression. + - unmodified: `18325.8 us` + - current/new: `18873.5 us` + - regression: about `+3.0%` -4. Partial-manual mode also adds visible orch cost on the unroll scene even when +4. Partial-manual mode is still materially slower on the heavy paged-attention scene. + - `paged_attention/Case1`: about `+14.9%` vs unmodified + - `paged_attention/Case2`: about `+14.8%` vs unmodified + +5. Partial-manual mode also adds visible orch cost on the unroll scene even when elapsed stays close. - - `paged_attention_unroll/Case1`: `946.2 us` orch vs `827.5 us` unmodified - - `paged_attention_unroll/Case2`: `429.3 us` orch vs `375.9 us` unmodified + - `paged_attention_unroll/Case1`: `946.2 us` orch vs `815.7 us` unmodified + - `paged_attention_unroll/Case2`: `429.3 us` orch vs `388.7 us` unmodified -5. `aicpu_build_graph` remains fastest on the heavy `paged_attention` scene, but it is +6. `aicpu_build_graph` remains fastest on the heavy `paged_attention` scene, but it is slower than the tensormap runtimes on both `paged_attention_unroll` cells. ## Main Risks From 578747043324cf28a6d5db2d44019b5fd8a5d608 Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Mon, 6 Apr 2026 01:01:55 +0800 Subject: [PATCH 13/35] Update: cut manual scope overhead - cache manual-local tensor classification to avoid repeated scope scans - fuse manual publish and scope-end release into one scheduler pass - limit manual scope sync to active rings and keep submit-path work deferred - chunk paged_attention_partial_manual scopes and add carry deps between updates --- .../runtime/pto_orchestrator.cpp | 122 ++++++++-------- .../runtime/pto_orchestrator.h | 1 + .../runtime/pto_scheduler.h | 27 ++++ .../orchestration/paged_attention_orch.cpp | 132 ++++++++++-------- 4 files changed, 164 insertions(+), 118 deletions(-) diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp index 7c410de5c..6930bcdbb 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp @@ -612,7 +612,7 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { return; } - for (int32_t ring = 0; ring < PTO2_MAX_RING_DEPTH; ring++) { + for (int32_t ring = 0; ring <= orch->current_ring_id(); ring++) { int32_t sm_last_task_alive = orch->sm_handle->header->rings[ring].fc.last_task_alive.load(std::memory_order_acquire); orch->tensor_map.sync_tensormap(static_cast(ring), sm_last_task_alive); @@ -656,8 +656,7 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { } const Tensor &tensor = payload->tensors[t]; - bool manual_local_tensor = task_owned_by_current_manual_scope(orch, tensor.owner_task_id); - if (manual_local_tensor) { + if ((meta.manual_local_mask & static_cast(1u << t)) != 0) { continue; } @@ -746,18 +745,14 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { continue; } const Tensor &tensor = payload->tensors[t]; - if (task_owned_by_current_manual_scope(orch, tensor.owner_task_id) || tensor.manual_dep) { + if ((meta.manual_local_mask & static_cast(1u << t)) != 0 || tensor.manual_dep) { continue; } orch->tensor_map.insert(tensor, task_id); } - } - - orch->scheduler->publish_manual_scope_tasks(&orch->scope_tasks[begin], count); - } - if (orch->scheduler && count > 0) { - orch->scheduler->on_scope_end(&orch->scope_tasks[begin], count); + } + orch->scheduler->publish_manual_scope_tasks_and_end_scope(&orch->scope_tasks[begin], count); } // Rewind the task buffer — these entries are no longer needed @@ -862,7 +857,6 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( #endif // === STEP 2: Sync TensorMap validity and optional cleanup === - // Read current last_task_alive from shared memory for this ring int32_t sm_last_task_alive = fc.last_task_alive.load(std::memory_order_acquire); orch->tensor_map.sync_tensormap(ring_id, sm_last_task_alive); @@ -874,54 +868,57 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( CYCLE_COUNT_LAP_RECORD(g_orch_sync_cycle, AicpuPhaseId::ORCH_SYNC, task_id.raw); // === STEP 3: Lookup inputs + materialize runtime-created outputs === - if constexpr (!kManualSubmit) { - for (int i = 0; i < args.tensor_count(); i++) { - TensorArgType ptype = args.tag(i); - if (ptype == TensorArgType::OUTPUT) { - // Runtime-created OUTPUT tensors are not looked up in the TensorMap since they have no dependencies. - continue; - } - - const Tensor *tensor = args.tensor(i).ptr; - - // Step A: creator retention — all existing tensors extend their creator lifetime. - PTO2TaskId owner = tensor->owner_task_id; - if (owner.is_valid() && sched != nullptr) { - PTO2TaskSlotState *prod_state = - &sched->ring_sched_states[owner.ring()].get_slot_state_by_task_id(owner.local()); - if (!pto2_append_fanin_or_fail( - orch, task_id, i, ptype, prod_state, &fanin_builder, sched, fc, ring_id, "creator retention" - )) { - return result; - } - } + for (int i = 0; i < args.tensor_count(); i++) { + TensorArgType ptype = args.tag(i); + if (ptype == TensorArgType::OUTPUT) { + // Runtime-created OUTPUT tensors are not looked up in the TensorMap since they have no dependencies. + continue; + } - // Step B: only INPUT/INOUT need modifier dependency lookup. - if (ptype != TensorArgType::INPUT && ptype != TensorArgType::INOUT) { + const Tensor *tensor = args.tensor(i).ptr; + if constexpr (kManualSubmit) { + if (task_owned_by_current_manual_scope(orch, tensor->owner_task_id)) { continue; } - if (tensor->manual_dep) { - continue; + } + + // Step A: creator retention — all existing tensors extend their creator lifetime. + PTO2TaskId owner = tensor->owner_task_id; + if (owner.is_valid() && sched != nullptr) { + PTO2TaskSlotState *prod_state = + &sched->ring_sched_states[owner.ring()].get_slot_state_by_task_id(owner.local()); + if (!pto2_append_fanin_or_fail( + orch, task_id, i, ptype, prod_state, &fanin_builder, sched, fc, ring_id, "creator retention" + )) { + return result; } + } - PTO2LookupResult lookup_result; - orch->tensor_map.lookup(*tensor, lookup_result); + // Step B: only INPUT/INOUT need modifier dependency lookup. + if (ptype != TensorArgType::INPUT && ptype != TensorArgType::INOUT) { + continue; + } + if (tensor->manual_dep) { + continue; + } - for (int r = 0; r < lookup_result.count; r++) { - PTO2TensorMapEntry &entry = *lookup_result.entries[r].entry; - auto overlap_status = lookup_result.entries[r].overlap_status; - auto prod_ring = entry.producer_task_id.ring(); - auto prod_local = entry.producer_task_id.local(); - PTO2TaskSlotState *prod_state = - &sched->ring_sched_states[prod_ring].get_slot_state_by_task_id(prod_local); - if (!pto2_append_fanin_or_fail( - orch, task_id, i, ptype, prod_state, &fanin_builder, sched, fc, ring_id, "overlap lookup" - )) { - return result; - } - if (ptype == TensorArgType::INOUT && overlap_status == OverlapStatus::COVERED) { - orch->tensor_map.remove_entry(entry); - } + PTO2LookupResult lookup_result; + orch->tensor_map.lookup(*tensor, lookup_result); + + for (int r = 0; r < lookup_result.count; r++) { + PTO2TensorMapEntry &entry = *lookup_result.entries[r].entry; + auto overlap_status = lookup_result.entries[r].overlap_status; + auto prod_ring = entry.producer_task_id.ring(); + auto prod_local = entry.producer_task_id.local(); + PTO2TaskSlotState *prod_state = + &sched->ring_sched_states[prod_ring].get_slot_state_by_task_id(prod_local); + if (!pto2_append_fanin_or_fail( + orch, task_id, i, ptype, prod_state, &fanin_builder, sched, fc, ring_id, "overlap lookup" + )) { + return result; + } + if (ptype == TensorArgType::INOUT && overlap_status == OverlapStatus::COVERED) { + orch->tensor_map.remove_entry(entry); } } } @@ -929,14 +926,17 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( CYCLE_COUNT_LAP_RECORD(g_orch_lookup_cycle, AicpuPhaseId::ORCH_LOOKUP, task_id.raw); // === STEP 4: Register outputs/inouts in TensorMap (must be separate from lookup) === - if constexpr (!kManualSubmit) { - for (int i = 0; i < args.tensor_count(); i++) { - TensorArgType ptype = args.tag(i); - if (ptype == TensorArgType::INOUT || ptype == TensorArgType::OUTPUT_EXISTING) { - if (!args.tensor(i).ptr->manual_dep) { - orch->tensor_map.insert(*args.tensor(i).ptr, task_id); + for (int i = 0; i < args.tensor_count(); i++) { + TensorArgType ptype = args.tag(i); + if (ptype == TensorArgType::INOUT || ptype == TensorArgType::OUTPUT_EXISTING) { + if constexpr (kManualSubmit) { + if (task_owned_by_current_manual_scope(orch, args.tensor(i).ptr->owner_task_id)) { + continue; } } + if (!args.tensor(i).ptr->manual_dep) { + orch->tensor_map.insert(*args.tensor(i).ptr, task_id); + } } } @@ -1186,8 +1186,12 @@ pto2_submit_mixed_task_manual(PTO2OrchestratorState *orch, const MixedKernels &m meta.scope_task_index = orch->scope_tasks_size - 1 - current_manual_scope_begin(orch); meta.incoming_edge_head = -1; meta.tensor_count = static_cast(args.tensor_count()); + meta.manual_local_mask = 0; for (int32_t i = 0; i < args.tensor_count(); i++) { meta.tags[i] = static_cast(args.tag(i)); + if (task_owned_by_current_manual_scope(orch, meta.slot_state->payload->tensors[i].owner_task_id)) { + meta.manual_local_mask |= static_cast(1u << i); + } } manual_task_meta_push(orch, meta); return result; diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h index 430f86604..0a6090d27 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h @@ -45,6 +45,7 @@ struct PTO2ManualTaskMeta { int32_t scope_task_index; int32_t incoming_edge_head; uint8_t tensor_count; + uint16_t manual_local_mask; uint8_t tags[MAX_TENSOR_ARGS]; }; diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h index 91c235308..bd451a221 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h @@ -664,6 +664,33 @@ struct PTO2SchedulerState { } } + void publish_manual_scope_tasks_and_end_scope(PTO2TaskSlotState **task_slot_states, int32_t count) { +#if PTO2_ORCH_PROFILING + extern uint64_t g_orch_scope_end_atomic_count; +#endif + if (count > 0) __builtin_prefetch(task_slot_states[0], 1, 0); + for (int32_t i = 0; i < count; i++) { + if (i + 1 < count) __builtin_prefetch(task_slot_states[i + 1], 1, 0); + PTO2TaskSlotState &slot_state = *task_slot_states[i]; + int32_t new_rc = slot_state.fanin_refcount.fetch_add(1, std::memory_order_acq_rel) + 1; +#if PTO2_ORCH_PROFILING + g_orch_scope_end_atomic_count += 1; // fanin_refcount.fetch_add +#endif + if (new_rc >= slot_state.fanin_count) { + PTO2ResourceShape shape = pto2_active_mask_to_shape(slot_state.active_mask); +#if PTO2_ORCH_PROFILING + g_orch_scope_end_atomic_count += 1; // ready queue push lock/CAS path +#endif + ready_queues[static_cast(shape)].push(&slot_state); + } +#if PTO2_ORCH_PROFILING + release_producer(slot_state, g_orch_scope_end_atomic_count); +#else + release_producer(slot_state); +#endif + } + } + /** * Subtask completion: atomic counter model. * Called when a single subtask (AIC, AIV0, or AIV1) finishes on any block. diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp index b067f2fe8..4a30fe54a 100644 --- a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp +++ b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp @@ -20,6 +20,7 @@ #define FUNC_ONLINE_UPDATE 3 #define FUNC_AIC_HUB 4 #define FUNC_AIV_HUB 5 +#define N_MANUAL_CHUNK 2 extern "C" { @@ -100,67 +101,80 @@ aicpu_orchestration_entry(const ChipStorageTaskArgs &orch_args, int orch_thread_ const Tensor &li_update = hub_outs.get_ref(1); const Tensor &mi_update = hub_outs.get_ref(2); - for (uint64_t bn = 0; bn < bn_this_batch; bn++) { - uint64_t cur_block_idx = host_block_table[b_idx * block_num + bn]; - uint64_t valid_len = std::min(block_size, cur_seq - bn * block_size); - - uint32_t kv_shapes[2] = {static_cast(block_size), static_cast(head_dim)}; - uint32_t kv_offsets[2] = {static_cast(cur_block_idx * block_size), 0}; - Tensor kj = key_cache.view(kv_shapes, kv_offsets); - Tensor vj = value_cache.view(kv_shapes, kv_offsets); + for (uint64_t bn = 0; bn < bn_this_batch; bn += N_MANUAL_CHUNK) { + uint64_t bn_end = std::min(bn + static_cast(N_MANUAL_CHUNK), bn_this_batch); PTO2_SCOPE(PTO2ScopeMode::MANUAL) { - Arg params_qk; - params_qk.add_input(qi); - params_qk.add_input(kj); - params_qk.add_output(sij_ci); - PTO2ManualSubmitResult qk_outs = pto2_rt_submit_aic_task_manual(FUNC_QK_MATMUL, params_qk); - const Tensor &sij = qk_outs.outputs.get_ref(0); - - uint32_t sij_valid_shapes[2] = { - static_cast(q_tile), static_cast(valid_len) - }; - uint32_t sij_valid_offsets[2] = {0, 0}; - Tensor sij_valid = sij.view(sij_valid_shapes, sij_valid_offsets); - - Arg params_sf; - params_sf.add_input(sij_valid); - params_sf.add_output(pij_f16_ci); - params_sf.add_output(scalar_ci); - params_sf.add_output(scalar_ci); - params_sf.add_scalar(scale_value); - PTO2ManualSubmitResult sf_outs = - pto2_rt_submit_aiv_task_manual(FUNC_SOFTMAX_PREPARE, params_sf); - const Tensor &pij_f16 = sf_outs.outputs.get_ref(0); - const Tensor &mi = sf_outs.outputs.get_ref(1); - const Tensor &li = sf_outs.outputs.get_ref(2); - - Arg params_pv; - params_pv.add_input(pij_f16); - params_pv.add_input(vj); - params_pv.add_output(tile2d_ci); - PTO2ManualSubmitResult pv_outs = pto2_rt_submit_aic_task_manual(FUNC_PV_MATMUL, params_pv); - const Tensor &oi_tmp = pv_outs.outputs.get_ref(0); - - uint64_t is_first = (bn == 0) ? 1 : 0; - uint64_t is_last = (bn == bn_this_batch - 1) ? 1 : 0; - - Arg params_up; - params_up.add_input(mi); - params_up.add_input(li); - params_up.add_input(oi_tmp); - params_up.add_inout(mi_update); - params_up.add_inout(li_update); - params_up.add_inout(oi); - params_up.add_inout(out_view); - params_up.add_scalar(is_first); - params_up.add_scalar(is_last); - PTO2ManualSubmitResult up_outs = pto2_rt_submit_aiv_task_manual(FUNC_ONLINE_UPDATE, params_up); - - pto2_rt_add_dependency(qk_outs.task_id, sf_outs.task_id); - pto2_rt_add_dependency(sf_outs.task_id, pv_outs.task_id); - pto2_rt_add_dependency(sf_outs.task_id, up_outs.task_id); - pto2_rt_add_dependency(pv_outs.task_id, up_outs.task_id); + PTO2TaskId prev_update_task = PTO2TaskId::invalid(); + + for (uint64_t bn_local = bn; bn_local < bn_end; bn_local++) { + uint64_t cur_block_idx = host_block_table[b_idx * block_num + bn_local]; + uint64_t valid_len = std::min(block_size, cur_seq - bn_local * block_size); + + uint32_t kv_shapes[2] = { + static_cast(block_size), static_cast(head_dim) + }; + uint32_t kv_offsets[2] = {static_cast(cur_block_idx * block_size), 0}; + Tensor kj = key_cache.view(kv_shapes, kv_offsets); + Tensor vj = value_cache.view(kv_shapes, kv_offsets); + + Arg params_qk; + params_qk.add_input(qi); + params_qk.add_input(kj); + params_qk.add_output(sij_ci); + PTO2ManualSubmitResult qk_outs = pto2_rt_submit_aic_task_manual(FUNC_QK_MATMUL, params_qk); + const Tensor &sij = qk_outs.outputs.get_ref(0); + + uint32_t sij_valid_shapes[2] = { + static_cast(q_tile), static_cast(valid_len) + }; + uint32_t sij_valid_offsets[2] = {0, 0}; + Tensor sij_valid = sij.view(sij_valid_shapes, sij_valid_offsets); + + Arg params_sf; + params_sf.add_input(sij_valid); + params_sf.add_output(pij_f16_ci); + params_sf.add_output(scalar_ci); + params_sf.add_output(scalar_ci); + params_sf.add_scalar(scale_value); + PTO2ManualSubmitResult sf_outs = + pto2_rt_submit_aiv_task_manual(FUNC_SOFTMAX_PREPARE, params_sf); + const Tensor &pij_f16 = sf_outs.outputs.get_ref(0); + const Tensor &mi = sf_outs.outputs.get_ref(1); + const Tensor &li = sf_outs.outputs.get_ref(2); + + Arg params_pv; + params_pv.add_input(pij_f16); + params_pv.add_input(vj); + params_pv.add_output(tile2d_ci); + PTO2ManualSubmitResult pv_outs = pto2_rt_submit_aic_task_manual(FUNC_PV_MATMUL, params_pv); + const Tensor &oi_tmp = pv_outs.outputs.get_ref(0); + + uint64_t is_first = (bn_local == 0) ? 1 : 0; + uint64_t is_last = (bn_local == bn_this_batch - 1) ? 1 : 0; + + Arg params_up; + params_up.add_input(mi); + params_up.add_input(li); + params_up.add_input(oi_tmp); + params_up.add_inout(mi_update); + params_up.add_inout(li_update); + params_up.add_inout(oi); + params_up.add_inout(out_view); + params_up.add_scalar(is_first); + params_up.add_scalar(is_last); + PTO2ManualSubmitResult up_outs = + pto2_rt_submit_aiv_task_manual(FUNC_ONLINE_UPDATE, params_up); + + pto2_rt_add_dependency(qk_outs.task_id, sf_outs.task_id); + pto2_rt_add_dependency(sf_outs.task_id, pv_outs.task_id); + pto2_rt_add_dependency(sf_outs.task_id, up_outs.task_id); + pto2_rt_add_dependency(pv_outs.task_id, up_outs.task_id); + if (prev_update_task.is_valid()) { + pto2_rt_add_dependency(prev_update_task, up_outs.task_id); + } + prev_update_task = up_outs.task_id; + } } } } From bd9a76020bd5802461fcf1bd1da6a03b2a9a14dc Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Mon, 6 Apr 2026 01:13:34 +0800 Subject: [PATCH 14/35] Fix: stabilize partial-manual paged attention chunking Reduce the manual-scope chunk size in the heavy paged-attention\nscene and drop the extra cross-update dependency chain. The\nprevious chunking shape can deadlock the dep pool under benchmark\nload, while the smaller chunk keeps the benchmark path stable and\nlowers the manual-scope overhead. --- .../kernels/orchestration/paged_attention_orch.cpp | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp index 4a30fe54a..5a0581cfa 100644 --- a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp +++ b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp @@ -20,7 +20,7 @@ #define FUNC_ONLINE_UPDATE 3 #define FUNC_AIC_HUB 4 #define FUNC_AIV_HUB 5 -#define N_MANUAL_CHUNK 2 +#define N_MANUAL_CHUNK 4 extern "C" { @@ -105,8 +105,6 @@ aicpu_orchestration_entry(const ChipStorageTaskArgs &orch_args, int orch_thread_ uint64_t bn_end = std::min(bn + static_cast(N_MANUAL_CHUNK), bn_this_batch); PTO2_SCOPE(PTO2ScopeMode::MANUAL) { - PTO2TaskId prev_update_task = PTO2TaskId::invalid(); - for (uint64_t bn_local = bn; bn_local < bn_end; bn_local++) { uint64_t cur_block_idx = host_block_table[b_idx * block_num + bn_local]; uint64_t valid_len = std::min(block_size, cur_seq - bn_local * block_size); @@ -170,10 +168,6 @@ aicpu_orchestration_entry(const ChipStorageTaskArgs &orch_args, int orch_thread_ pto2_rt_add_dependency(sf_outs.task_id, pv_outs.task_id); pto2_rt_add_dependency(sf_outs.task_id, up_outs.task_id); pto2_rt_add_dependency(pv_outs.task_id, up_outs.task_id); - if (prev_update_task.is_valid()) { - pto2_rt_add_dependency(prev_update_task, up_outs.task_id); - } - prev_update_task = up_outs.task_id; } } } From 988eddc86fc6d1c5c18b6cc111c3051e2f534f05 Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Mon, 6 Apr 2026 01:18:33 +0800 Subject: [PATCH 15/35] Fix: restore deferred manual submit path Manual submit must stay a cheap metadata-recording path and defer\nTensorMap lookup/insert plus dep-pool fanin wiring to manual\nscope_end.\n\nThis reverts the duplicated submit-time work from 0fd6fbc and\nrestores the separate publish/on_scope_end order so manual scopes\ndo not attach fanout edges after releasing their scope reference. --- .../runtime/pto_orchestrator.cpp | 120 +++++++++--------- 1 file changed, 61 insertions(+), 59 deletions(-) diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp index 6930bcdbb..8d4d05cbe 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp @@ -752,7 +752,11 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { } } - orch->scheduler->publish_manual_scope_tasks_and_end_scope(&orch->scope_tasks[begin], count); + orch->scheduler->publish_manual_scope_tasks(&orch->scope_tasks[begin], count); + } + + if (orch->scheduler && count > 0) { + orch->scheduler->on_scope_end(&orch->scope_tasks[begin], count); } // Rewind the task buffer — these entries are no longer needed @@ -857,68 +861,69 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( #endif // === STEP 2: Sync TensorMap validity and optional cleanup === - int32_t sm_last_task_alive = fc.last_task_alive.load(std::memory_order_acquire); + // Manual submit defers TensorMap lookup/insert and dep-pool wiring to + // manual scope_end, so syncing/reclaiming here only duplicates work. + if constexpr (!kManualSubmit) { + int32_t sm_last_task_alive = fc.last_task_alive.load(std::memory_order_acquire); - orch->tensor_map.sync_tensormap(ring_id, sm_last_task_alive); + orch->tensor_map.sync_tensormap(ring_id, sm_last_task_alive); - if (sched) { - orch->rings[ring_id].dep_pool.reclaim(*sched, ring_id, sm_last_task_alive); + if (sched) { + orch->rings[ring_id].dep_pool.reclaim(*sched, ring_id, sm_last_task_alive); + } } CYCLE_COUNT_LAP_RECORD(g_orch_sync_cycle, AicpuPhaseId::ORCH_SYNC, task_id.raw); // === STEP 3: Lookup inputs + materialize runtime-created outputs === - for (int i = 0; i < args.tensor_count(); i++) { - TensorArgType ptype = args.tag(i); - if (ptype == TensorArgType::OUTPUT) { - // Runtime-created OUTPUT tensors are not looked up in the TensorMap since they have no dependencies. - continue; - } - - const Tensor *tensor = args.tensor(i).ptr; - if constexpr (kManualSubmit) { - if (task_owned_by_current_manual_scope(orch, tensor->owner_task_id)) { + if constexpr (!kManualSubmit) { + for (int i = 0; i < args.tensor_count(); i++) { + TensorArgType ptype = args.tag(i); + if (ptype == TensorArgType::OUTPUT) { + // Runtime-created OUTPUT tensors are not looked up in the TensorMap since they have no dependencies. continue; } - } - // Step A: creator retention — all existing tensors extend their creator lifetime. - PTO2TaskId owner = tensor->owner_task_id; - if (owner.is_valid() && sched != nullptr) { - PTO2TaskSlotState *prod_state = - &sched->ring_sched_states[owner.ring()].get_slot_state_by_task_id(owner.local()); - if (!pto2_append_fanin_or_fail( - orch, task_id, i, ptype, prod_state, &fanin_builder, sched, fc, ring_id, "creator retention" - )) { - return result; - } - } + const Tensor *tensor = args.tensor(i).ptr; - // Step B: only INPUT/INOUT need modifier dependency lookup. - if (ptype != TensorArgType::INPUT && ptype != TensorArgType::INOUT) { - continue; - } - if (tensor->manual_dep) { - continue; - } + // Step A: creator retention — all existing tensors extend their creator lifetime. + PTO2TaskId owner = tensor->owner_task_id; + if (owner.is_valid() && sched != nullptr) { + PTO2TaskSlotState *prod_state = + &sched->ring_sched_states[owner.ring()].get_slot_state_by_task_id(owner.local()); + if (!pto2_append_fanin_or_fail( + orch, task_id, i, ptype, prod_state, &fanin_builder, sched, fc, ring_id, "creator retention" + )) { + return result; + } + } - PTO2LookupResult lookup_result; - orch->tensor_map.lookup(*tensor, lookup_result); - - for (int r = 0; r < lookup_result.count; r++) { - PTO2TensorMapEntry &entry = *lookup_result.entries[r].entry; - auto overlap_status = lookup_result.entries[r].overlap_status; - auto prod_ring = entry.producer_task_id.ring(); - auto prod_local = entry.producer_task_id.local(); - PTO2TaskSlotState *prod_state = - &sched->ring_sched_states[prod_ring].get_slot_state_by_task_id(prod_local); - if (!pto2_append_fanin_or_fail( - orch, task_id, i, ptype, prod_state, &fanin_builder, sched, fc, ring_id, "overlap lookup" - )) { - return result; + // Step B: only INPUT/INOUT need modifier dependency lookup. + if (ptype != TensorArgType::INPUT && ptype != TensorArgType::INOUT) { + continue; } - if (ptype == TensorArgType::INOUT && overlap_status == OverlapStatus::COVERED) { - orch->tensor_map.remove_entry(entry); + if (tensor->manual_dep) { + continue; + } + + PTO2LookupResult lookup_result; + orch->tensor_map.lookup(*tensor, lookup_result); + + for (int r = 0; r < lookup_result.count; r++) { + PTO2TensorMapEntry &entry = *lookup_result.entries[r].entry; + auto overlap_status = lookup_result.entries[r].overlap_status; + auto prod_ring = entry.producer_task_id.ring(); + auto prod_local = entry.producer_task_id.local(); + PTO2TaskSlotState *prod_state = + &sched->ring_sched_states[prod_ring].get_slot_state_by_task_id(prod_local); + if (!pto2_append_fanin_or_fail( + orch, task_id, i, ptype, prod_state, &fanin_builder, sched, fc, ring_id, "overlap lookup" + )) { + return result; + } + if (ptype == TensorArgType::INOUT && overlap_status == OverlapStatus::COVERED) { + orch->tensor_map.remove_entry(entry); + } } } } @@ -926,17 +931,14 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( CYCLE_COUNT_LAP_RECORD(g_orch_lookup_cycle, AicpuPhaseId::ORCH_LOOKUP, task_id.raw); // === STEP 4: Register outputs/inouts in TensorMap (must be separate from lookup) === - for (int i = 0; i < args.tensor_count(); i++) { - TensorArgType ptype = args.tag(i); - if (ptype == TensorArgType::INOUT || ptype == TensorArgType::OUTPUT_EXISTING) { - if constexpr (kManualSubmit) { - if (task_owned_by_current_manual_scope(orch, args.tensor(i).ptr->owner_task_id)) { - continue; + if constexpr (!kManualSubmit) { + for (int i = 0; i < args.tensor_count(); i++) { + TensorArgType ptype = args.tag(i); + if (ptype == TensorArgType::INOUT || ptype == TensorArgType::OUTPUT_EXISTING) { + if (!args.tensor(i).ptr->manual_dep) { + orch->tensor_map.insert(*args.tensor(i).ptr, task_id); } } - if (!args.tensor(i).ptr->manual_dep) { - orch->tensor_map.insert(*args.tensor(i).ptr, task_id); - } } } From a247f59341561423ee36ff39708658b4e769efc3 Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Tue, 7 Apr 2026 12:59:07 +0800 Subject: [PATCH 16/35] Update: move manual boundary discovery to submit - update the manual dependency design doc to make submit-time boundary discovery and TensorMap-free manual scope_end explicit - cache and retain external producers during manual submit, then merge them with explicit manual edges at publish time - keep the heavy manual paged-attention benchmark on device 3 moving in the right direction without changing example code --- docs/manual-dep-for-tensormap-design.md | 161 +++++++++---- .../runtime/pto_orchestrator.cpp | 223 +++++++----------- 2 files changed, 203 insertions(+), 181 deletions(-) diff --git a/docs/manual-dep-for-tensormap-design.md b/docs/manual-dep-for-tensormap-design.md index 7d8ff82c8..7057f5211 100644 --- a/docs/manual-dep-for-tensormap-design.md +++ b/docs/manual-dep-for-tensormap-design.md @@ -40,8 +40,8 @@ These decisions are already aligned with the requested direction: 2. Manual scope may not contain another manual scope. 3. The design must not simplify away multi-write cases. 4. For an outer-scope tensor written inside a manual scope, readiness is the writer task completion time, not `scope_end`. -5. Therefore, a task inside a manual scope that writes an outer-scope tensor must still publish that tensor to TensorMap by manual `scope_end`. -6. For an outer-scope tensor read inside a manual scope, the dependency must still be forced by TensorMap/owner-based boundary seeding during manual `scope_end`. +5. Therefore, a task inside a manual scope that writes an outer-scope tensor must still publish that tensor frontier through TensorMap before later submissions depend on it. +6. For an outer-scope tensor read inside a manual scope, the dependency must still be forced by TensorMap/owner-based boundary seeding during manual submit. 7. Tasks created inside a manual scope should be batch-published to the scheduler at `scope_end`, matching `aicpu_build_graph` semantics for explicit dependency closure inside the scope. ## Change Control Requirements @@ -141,13 +141,14 @@ This is why manual dependency integration should work as follows: - do not bind manual dependencies to tensors - at manual `scope_end`, realize manual dependencies directly as normal producer-consumer scheduler edges -So at manual `scope_end`, for each manual task: +So for a task inside manual scope: 1. Start a local dedup buffer such as `fanin_states[]`. -2. Add producers from recorded manual edges. -3. Add producers from outer-tensor creator retention and TensorMap lookup. -4. Dedup all of them together. -5. Run the normal wiring path into: +2. During submit, add producers from outer-tensor creator retention and TensorMap lookup. +3. Cache that deduped external producer set in the task payload. +4. At `scope_end`, add producers from recorded manual edges. +5. Dedup both sources together before wiring the scheduler edges. +6. Run the normal wiring path into: - `payload->fanin_slot_states[]` - `fanin_count` - producer `fanout_head` @@ -159,8 +160,8 @@ Then after publish: Concise conclusion: -- TensorMap discovers tensor-related dependencies -- manual deps bypass discovery +- TensorMap discovers boundary tensor-related dependencies during manual submit +- manual deps bypass discovery and are replayed only at manual `scope_end` - both become the same scheduler edges before publish - execution uses only the scheduler edge machinery, not TensorMap @@ -211,8 +212,8 @@ This is required so all same-scope explicit edges are fully wired before any tas For cross-scope tensors touched by tasks inside `PTO2_SCOPE(PTO2ScopeMode::MANUAL)`: -- outside tasks submitted after the manual scope ends must still be able to discover the internal writer frontier -- therefore the producer frontier for an external tensor written inside the manual scope must become visible to later TensorMap lookups at manual `scope_end` +- outside tasks submitted after a writer task is submitted must still be able to discover that writer frontier through TensorMap +- therefore the producer frontier for an external tensor written inside the manual scope must be updated during manual submit - however the tensor is still not semantically ready until that producer task actually completes So: @@ -224,7 +225,8 @@ The document must not conflate these two mechanisms. More precisely: -- before manual `scope_end`, the task record already exists but TensorMap boundary wiring may still be deferred +- before manual `scope_end`, the task record already exists and TensorMap boundary discovery/publication has already happened +- before manual `scope_end`, the task is still invisible to the executable scheduler graph - after manual `scope_end`, the task becomes part of the executable published graph - once published, the task may enter `READY` immediately or remain `PENDING` depending on whether its dependencies are already satisfied @@ -234,7 +236,7 @@ The following clarifications are recorded to reduce implementation drift and hal 1. Deferred publish does not mean deferred task allocation. - Manual tasks still allocate task ids, slot state, and payload at submit time. -- What is deferred is dependency realization and ready-queue publication. +- What is deferred is explicit same-scope edge realization and ready-queue publication. 2. Manual dependencies are not tracked by TensorMap. - TensorMap is only used for tensor-related producer discovery. @@ -245,8 +247,8 @@ The following clarifications are recorded to reduce implementation drift and hal - The runtime should not keep a second dependency engine alive after publish. - Once the scope is finalized, all dependencies are handled only by the existing scheduler fanin/fanout path. -4. Deferred boundary wiring does not change tensor readiness semantics. -- Outer writes become TensorMap-visible at manual `scope_end`. +4. Submit-time boundary wiring does not change tensor readiness semantics. +- Outer writes become TensorMap-visible at manual submit. - Their semantic readiness is still producer-task completion. ## Tensor categories @@ -525,10 +527,15 @@ For a task submitted inside manual scope: 3. Initialize `fanin_count = 1` and `fanin_refcount = 0` for deferred publication. 4. Return a stable `task_id` immediately so orchestration can call `add_dependency`. 5. Do not realize explicit manual edges into scheduler fanin/fanout yet. -6. Do not realize external TensorMap-derived dependencies yet. -7. Do not publish outer writes into TensorMap yet. +6. Realize outer-boundary producer discovery immediately: + - creator retention from `owner_task_id` + - TensorMap lookup for outer `INPUT` / `INOUT` + - covered-entry removal for outer `INOUT` +7. Publish outer writes to TensorMap immediately for outer `INOUT` / `OUTPUT_EXISTING`. 8. Do not push the task into ready queues during submit. -9. Preserve enough scope-local information so manual `scope_end` can realize all dependencies before publish. +9. Retain every cached external producer strongly enough that it cannot be reclaimed or slot-reused before manual `scope_end`. +10. Cache the deduped external producer set in the task payload so manual `scope_end` can realize the scheduler edges without touching TensorMap. +11. Preserve enough scope-local information so manual `scope_end` can realize explicit same-scope edges before publish. Submit-time task records are still required even though execution is deferred: @@ -568,7 +575,9 @@ That gives a manual pre-publish state: - task records and task ids already exist - explicit edges are only recorded, not yet wired into scheduler fanin/fanout -- external TensorMap-derived edges are also deferred until `scope_end` +- external TensorMap-derived producers are already discovered and cached +- cached external producers are retained so deferred publish cannot attach to a reused slot +- outer writes are already reflected in TensorMap frontier state - the task is still unpublished as executable scheduler work because the publish barrier is not yet released ### What scope_end should do in manual scope @@ -579,10 +588,9 @@ Recommended sequence: 1. For every task directly owned by this manual scope: - realize recorded explicit `add_dependency` edges into scheduler fanin/fanout state - - inspect each tensor arg - - if the tensor is manual-local, skip TensorMap logic - - otherwise run the existing TensorMap/owner dependency logic - - if the task writes an outer tensor, insert its producer frontier into TensorMap + - start from the external producer set cached during submit + - dedup explicit same-scope edges against those cached external producers + - realize the final deduped producer set into scheduler fanin/fanout state 2. After all dependency realization is complete for the scope: - release the publish barrier by incrementing `fanin_refcount` - if `fanin_refcount == fanin_count`, transition to `READY` and push to ready queue @@ -601,36 +609,37 @@ This helper should reuse the existing ready-transition logic as much as possible ### How external dependency replay works -Manual `scope_end` should replay tasks in original submit order, using: +Manual submit should discover external dependencies in original submit order, using: - `scope_tasks[]` for task order - `manual_task_meta[]` for packed tags and edge ranges - `PTO2TaskPayload::tensors[]` for actual tensor values -For each task in that order: +For each task in that order during submit: -1. Realize explicit manual edges whose consumer is this task. -2. Decode tensor tags from `packed_tags`. -3. For each tensor arg: +1. Decode tensor tags from `packed_tags`. +2. For each tensor arg: - if `owner_task_id` belongs to the current manual scope's owned task set, treat it as manual-local and skip TensorMap logic - otherwise treat it as outer/external -4. For outer/external tensors: +3. For outer/external tensors: - apply creator-retention logic from `owner_task_id` - apply existing TensorMap overlap lookup for `INPUT` / `INOUT` +4. Cache the deduped external producer set in the task payload. 5. After lookup for this task: - apply normal TensorMap insertion for outer writes (`INOUT` / `OUTPUT_EXISTING`) -This replay order matters: +This submit order matters: - it preserves current tensormap behavior for multiple writes to outer tensors - earlier outer writes from the same manual scope become visible to later tasks in the same manual scope during replay - that matches the accepted v1 tradeoff that outer tensors may still induce implicit same-scope TensorMap edges +- it requires the same TensorMap validity synchronization that normal auto submit uses before lookup/insert -The replay must not be implemented as: +The split must not be implemented as: -- all lookups for the whole scope first, then all inserts -- all explicit manual edges first, then a second undeduped TensorMap pass -- per-dependency immediate scheduler mutation without first building a deduped producer set for the consumer +- deferring all lookups and inserts to `scope_end` +- wiring scheduler fanout during manual submit +- counting cached external producers and explicit manual edges independently without one dedup pass at publish time Those variants would diverge from current tensormap semantics and are considered incorrect for this design. @@ -638,12 +647,12 @@ Those variants would diverge from current tensormap semantics and are considered For a manual-scope task that reads an outer-scope tensor: -- if the external producer task has already completed when dependency realization happens at manual `scope_end`, that edge should immediately contribute to `fanin_refcount` +- if the external producer task has already completed when scheduler realization happens at manual `scope_end`, that edge should immediately contribute to `fanin_refcount` - then manual `scope_end` releases only the publish barrier, and the task may become `READY` immediately If the external producer has only published its TensorMap frontier but not yet completed: -- the manual-scope consumer is published at manual `scope_end` +- the manual-scope consumer has already cached that producer at submit time and is published at manual `scope_end` - but it remains in published `PENDING` - later producer completion notifies fanout and increments `fanin_refcount` - once `fanin_refcount == fanin_count`, the consumer transitions to `READY` @@ -661,7 +670,7 @@ This is the desired hybrid behavior: - no need for a new scheduler task state - reuse the existing `fanin_count` / `fanin_refcount` / `PENDING -> READY` transition model -The main new behavior is deferred dependency realization plus deferred release of the publish barrier for manual-scope tasks. +The main new behavior is submit-time boundary discovery plus deferred release of explicit same-scope publish for manual-scope tasks. ## Current-Manual-Scope Ownership Without Tensor Changes @@ -722,14 +731,15 @@ The spec needs two explicit rules here. A task inside manual scope that reads an outer-scope tensor: - must still collect the external producer through TensorMap/owner logic -- must include that dependency in its fanin during manual `scope_end`, before manual batch publish +- must cache that dependency during manual submit +- must include that cached dependency in its fanin during manual `scope_end`, before manual batch publish - must not require the user to restate that outer dependency manually ### External writes A task inside manual scope that writes an outer-scope tensor: -- must publish its producer frontier to TensorMap during manual `scope_end` +- must publish its producer frontier to TensorMap during manual submit - must not publish same-scope temporary tensors into TensorMap - may still be `PENDING` and unpublished to the scheduler until manual `scope_end` @@ -868,7 +878,7 @@ This is a strict requirement: - outer read boundary dependency is forced by TensorMap/owner metadata - orchestration code inside the manual scope must not be required to recreate that outer dependency manually -- even though the consumer task itself is only batch-published to the scheduler at manual `scope_end`, its fanin accounting must include the external TensorMap-derived dependency before publication +- even though the consumer task itself is only batch-published to the scheduler at manual `scope_end`, its fanin accounting must include the external TensorMap-derived dependency discovered at submit time ## Nesting Rules @@ -1069,30 +1079,83 @@ Units below are `elapsed_us (orch_us)`. 5. Letting cross-scope writer frontier become visible only after producer completion. - This is too late for later outside submissions made after manual `scope_end`. -6. Realizing manual edges incrementally before `scope_end`. -- This can race with already-live external producers and partially built fanin state. +6. Wiring external producers into scheduler fanout during manual submit. +- This can let unpublished tasks become runnable before `scope_end`. -7. Missing alias/view inheritance of scope ownership. +7. Publishing external writer frontier later than manual submit. +- This makes later boundary lookups see stale producer state and diverges from current tensormap semantics for multiple writes. + +8. Missing a final dedup pass between cached external producers and explicit manual edges. +- This double-counts fanin and can over-release dependencies. + +9. Missing alias/view inheritance of scope ownership. - This causes wrong same-scope vs cross-scope classification. -8. Turning this feature into a broad runtime refactor. +10. Turning this feature into a broad runtime refactor. - This increases regression risk and violates the required change scope. -9. Allowing blocking cross-layer tensor access inside manual scope. +11. Allowing blocking cross-layer tensor access inside manual scope. - `get_tensor_data` and `set_tensor_data` assume published producer state and should fail fast in manual scope. -10. Replacing the existing scheduler edge machinery with a separate manual execution path. +12. Replacing the existing scheduler edge machinery with a separate manual execution path. - This would duplicate fanin/fanout handling, completion notification, and release traversal. - The design requires one unified post-publish scheduler mechanism. +## Dangerous Risks For The Submit/Scope-End Split + +The implementation should explicitly guard the following failure modes before any +performance tuning claims are accepted. + +1. Early-ready bug from submit-time scheduler mutation. +- Manual submit may discover external producers early, but it must not mutate + producer `fanout_head` or consumer ready state early. +- Required safeguard: manual submit may cache producer slot states only. + +2. Stale frontier bug for outer writes. +- If outer `INOUT` / `OUTPUT_EXISTING` writes stay deferred until `scope_end`, + later submissions can miss the newest writer frontier. +- Required safeguard: publish TensorMap frontier at manual submit in original + task order. + +3. Double-accounting bug across cached external fanins and explicit manual edges. +- One producer may be found both through boundary discovery and through an + explicit edge. +- Required safeguard: publish-time fanin construction must run one dedup pass + over both sources before incrementing `fanin_count` or wiring fanout. + +4. Completed-before-publish bug. +- An external producer may already be `COMPLETED` when the manual scope reaches + `scope_end`. +- Required safeguard: publish-time scheduler wiring must detect already-finished + producers and credit `fanin_refcount` exactly once. + +5. Producer-lifetime bug for cached external fanins. +- A cached producer that is not retained may reach `CONSUMED` and have its slot + reused before the manual scope publishes. +- Required safeguard: manual submit must take a real retained reference on each + unique cached producer, and consumer release must drop that same reference. + +6. Scope-abort visibility bug for submit-time outer writes. +- If manual submit mutates TensorMap for outer writes and the scope later fails, + global TensorMap state can point at unpublished internal writers. +- Required safeguard: treat post-submit fatal paths as terminal for the runtime, + and keep the implementation free of late scope validation after submit-time + TensorMap mutation. + +7. Wrong manual-local classification for aliases and views. +- Boundary discovery must skip TensorMap only for tensors whose + `owner_task_id` belongs to the current manual scope, including derived views. +- Required safeguard: keep classification on task provenance, not on a new + tensor-side mode bit. + ## Recommended Implementation Order 1. Add API surface for `add_dependency` and manual scope mode. 2. Add manual-submit APIs with `_manual` suffix returning task ids plus outputs. 3. Add scope-frame mode plus scope-local manual-edge storage. -4. Implement deferred explicit edge realization at manual `scope_end`. -5. Implement manual-local tensor classification from `owner_task_id` plus current manual-scope ownership. -6. Realize outer-tensor TensorMap lookup/insert during manual `scope_end`. +4. Implement submit-time outer-tensor TensorMap lookup/insert with cached external fanins. +5. Keep manual `scope_end` TensorMap-free and realize only explicit same-scope edges plus scheduler publish. +6. Implement manual-local tensor classification from `owner_task_id` plus current manual-scope ownership. 7. Add fail-fast nested-scope-in-manual check and block `get_tensor_data` / `set_tensor_data` in manual scope. 8. Add targeted tests for boundary semantics. 9. Migrate one example and validate. diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp index 8d4d05cbe..2c7d3d69e 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp @@ -612,13 +612,6 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { return; } - for (int32_t ring = 0; ring <= orch->current_ring_id(); ring++) { - int32_t sm_last_task_alive = - orch->sm_handle->header->rings[ring].fc.last_task_alive.load(std::memory_order_acquire); - orch->tensor_map.sync_tensormap(static_cast(ring), sm_last_task_alive); - orch->rings[ring].dep_pool.reclaim(*orch->scheduler, static_cast(ring), sm_last_task_alive); - } - for (int32_t task_idx = 0; task_idx < count; task_idx++) { PTO2ManualTaskMeta &meta = orch->manual_task_meta[manual_meta_begin + task_idx]; PTO2TaskSlotState *slot_state = orch->scope_tasks[begin + task_idx]; @@ -635,7 +628,15 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { auto &dep_pool = orch->rings[ring_id].dep_pool; auto &fc = orch->sm_handle->header->rings[ring_id].fc; PTO2FaninBuilder fanin_builder; - fanin_builder.spill_pool = &orch->rings[ring_id].fanin_pool; + fanin_builder.count = payload->fanin_actual_count; + fanin_builder.spill_start = payload->fanin_spill_start; + fanin_builder.spill_pool = + (payload->fanin_spill_pool != nullptr) ? payload->fanin_spill_pool : &orch->rings[ring_id].fanin_pool; + int32_t cached_external_count = payload->fanin_actual_count; + int32_t cached_inline_count = std::min(cached_external_count, PTO2_FANIN_INLINE_CAP); + for (int32_t i = 0; i < cached_inline_count; i++) { + fanin_builder.inline_slots[i] = payload->fanin_inline_slot_states[i]; + } for (int32_t edge_idx = meta.incoming_edge_head; edge_idx >= 0; edge_idx = orch->manual_edges[edge_idx].next_consumer_edge) { @@ -649,59 +650,6 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { } } - for (int32_t t = 0; t < meta.tensor_count; t++) { - TensorArgType tag = static_cast(meta.tags[t]); - if (tag == TensorArgType::OUTPUT) { - continue; - } - - const Tensor &tensor = payload->tensors[t]; - if ((meta.manual_local_mask & static_cast(1u << t)) != 0) { - continue; - } - - PTO2TaskId owner = tensor.owner_task_id; - if (owner.is_valid()) { - PTO2TaskSlotState *prod_state = - &orch->scheduler->ring_sched_states[owner.ring()].get_slot_state_by_task_id(owner.local()); - if (!pto2_append_fanin_or_fail( - orch, task_id, t, tag, prod_state, &fanin_builder, orch->scheduler, fc, ring_id, - "creator retention" - )) { - return; - } - } - - if (tag != TensorArgType::INPUT && tag != TensorArgType::INOUT) { - continue; - } - if (tensor.manual_dep) { - continue; - } - - PTO2LookupResult lookup_result; - orch->tensor_map.lookup(tensor, lookup_result); - - for (int32_t r = 0; r < lookup_result.count; r++) { - PTO2TensorMapEntry &entry = *lookup_result.entries[r].entry; - auto overlap_status = lookup_result.entries[r].overlap_status; - PTO2TaskId prod_task_id = entry.producer_task_id; - PTO2TaskSlotState *prod_state = - &orch->scheduler->ring_sched_states[prod_task_id.ring()].get_slot_state_by_task_id( - prod_task_id.local() - ); - if (!pto2_append_fanin_or_fail( - orch, task_id, t, tag, prod_state, &fanin_builder, orch->scheduler, fc, ring_id, - "overlap lookup" - )) { - return; - } - if (tag == TensorArgType::INOUT && overlap_status == OverlapStatus::COVERED) { - orch->tensor_map.remove_entry(entry); - } - } - } - int32_t fanin_count = fanin_builder.count; int32_t inline_count = std::min(fanin_count, PTO2_FANIN_INLINE_CAP); int32_t spill_count = fanin_count - inline_count; @@ -717,14 +665,19 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { } int32_t early_finished = 0; - pto2_for_each_fanin_slot_state(*payload, [&](PTO2TaskSlotState *producer_slot) { + int32_t producer_index = 0; + fanin_builder.for_each([&](PTO2TaskSlotState *producer_slot) { PTO2TaskSlotState &producer_slot_state = *producer_slot; + bool cached_external = producer_index < cached_external_count; + producer_index++; #if PTO2_ORCH_PROFILING pto2_fanout_lock(producer_slot_state, g_orch_fanin_atomic_count, g_orch_fanin_wait_cycle); #else pto2_fanout_lock(producer_slot_state); #endif - producer_slot_state.fanout_count += 1; + if (!cached_external) { + producer_slot_state.fanout_count += 1; + } int32_t prod_state = producer_slot_state.task_state.load(std::memory_order_acquire); if (prod_state >= PTO2_TASK_COMPLETED) { early_finished++; @@ -738,19 +691,6 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { slot_state->fanin_refcount.fetch_add(early_finished, std::memory_order_acq_rel); } slot_state->dep_pool_mark = dep_pool.top; - - for (int32_t t = 0; t < meta.tensor_count; t++) { - TensorArgType tag = static_cast(meta.tags[t]); - if (tag != TensorArgType::INOUT && tag != TensorArgType::OUTPUT_EXISTING) { - continue; - } - const Tensor &tensor = payload->tensors[t]; - if ((meta.manual_local_mask & static_cast(1u << t)) != 0 || tensor.manual_dep) { - continue; - } - orch->tensor_map.insert(tensor, task_id); - } - } orch->scheduler->publish_manual_scope_tasks(&orch->scope_tasks[begin], count); } @@ -861,13 +801,11 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( #endif // === STEP 2: Sync TensorMap validity and optional cleanup === - // Manual submit defers TensorMap lookup/insert and dep-pool wiring to - // manual scope_end, so syncing/reclaiming here only duplicates work. - if constexpr (!kManualSubmit) { - int32_t sm_last_task_alive = fc.last_task_alive.load(std::memory_order_acquire); + int32_t sm_last_task_alive = fc.last_task_alive.load(std::memory_order_acquire); - orch->tensor_map.sync_tensormap(ring_id, sm_last_task_alive); + orch->tensor_map.sync_tensormap(ring_id, sm_last_task_alive); + if constexpr (!kManualSubmit) { if (sched) { orch->rings[ring_id].dep_pool.reclaim(*sched, ring_id, sm_last_task_alive); } @@ -876,54 +814,56 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( CYCLE_COUNT_LAP_RECORD(g_orch_sync_cycle, AicpuPhaseId::ORCH_SYNC, task_id.raw); // === STEP 3: Lookup inputs + materialize runtime-created outputs === - if constexpr (!kManualSubmit) { - for (int i = 0; i < args.tensor_count(); i++) { - TensorArgType ptype = args.tag(i); - if (ptype == TensorArgType::OUTPUT) { - // Runtime-created OUTPUT tensors are not looked up in the TensorMap since they have no dependencies. - continue; - } - - const Tensor *tensor = args.tensor(i).ptr; - - // Step A: creator retention — all existing tensors extend their creator lifetime. - PTO2TaskId owner = tensor->owner_task_id; - if (owner.is_valid() && sched != nullptr) { - PTO2TaskSlotState *prod_state = - &sched->ring_sched_states[owner.ring()].get_slot_state_by_task_id(owner.local()); - if (!pto2_append_fanin_or_fail( - orch, task_id, i, ptype, prod_state, &fanin_builder, sched, fc, ring_id, "creator retention" - )) { - return result; - } - } + uint16_t manual_local_mask = 0; + for (int i = 0; i < args.tensor_count(); i++) { + TensorArgType ptype = args.tag(i); + if (ptype == TensorArgType::OUTPUT) { + continue; + } - // Step B: only INPUT/INOUT need modifier dependency lookup. - if (ptype != TensorArgType::INPUT && ptype != TensorArgType::INOUT) { + const Tensor *tensor = args.tensor(i).ptr; + if constexpr (kManualSubmit) { + if (task_owned_by_current_manual_scope(orch, tensor->owner_task_id)) { + manual_local_mask |= static_cast(1u << i); continue; } - if (tensor->manual_dep) { - continue; + } + + PTO2TaskId owner = tensor->owner_task_id; + if (owner.is_valid() && sched != nullptr) { + PTO2TaskSlotState *prod_state = + &sched->ring_sched_states[owner.ring()].get_slot_state_by_task_id(owner.local()); + if (!pto2_append_fanin_or_fail( + orch, task_id, i, ptype, prod_state, &fanin_builder, sched, fc, ring_id, "creator retention" + )) { + return result; } + } - PTO2LookupResult lookup_result; - orch->tensor_map.lookup(*tensor, lookup_result); + if (ptype != TensorArgType::INPUT && ptype != TensorArgType::INOUT) { + continue; + } + if (tensor->manual_dep) { + continue; + } - for (int r = 0; r < lookup_result.count; r++) { - PTO2TensorMapEntry &entry = *lookup_result.entries[r].entry; - auto overlap_status = lookup_result.entries[r].overlap_status; - auto prod_ring = entry.producer_task_id.ring(); - auto prod_local = entry.producer_task_id.local(); - PTO2TaskSlotState *prod_state = - &sched->ring_sched_states[prod_ring].get_slot_state_by_task_id(prod_local); - if (!pto2_append_fanin_or_fail( - orch, task_id, i, ptype, prod_state, &fanin_builder, sched, fc, ring_id, "overlap lookup" - )) { - return result; - } - if (ptype == TensorArgType::INOUT && overlap_status == OverlapStatus::COVERED) { - orch->tensor_map.remove_entry(entry); - } + PTO2LookupResult lookup_result; + orch->tensor_map.lookup(*tensor, lookup_result); + + for (int r = 0; r < lookup_result.count; r++) { + PTO2TensorMapEntry &entry = *lookup_result.entries[r].entry; + auto overlap_status = lookup_result.entries[r].overlap_status; + auto prod_ring = entry.producer_task_id.ring(); + auto prod_local = entry.producer_task_id.local(); + PTO2TaskSlotState *prod_state = + &sched->ring_sched_states[prod_ring].get_slot_state_by_task_id(prod_local); + if (!pto2_append_fanin_or_fail( + orch, task_id, i, ptype, prod_state, &fanin_builder, sched, fc, ring_id, "overlap lookup" + )) { + return result; + } + if (ptype == TensorArgType::INOUT && overlap_status == OverlapStatus::COVERED) { + orch->tensor_map.remove_entry(entry); } } } @@ -931,14 +871,17 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( CYCLE_COUNT_LAP_RECORD(g_orch_lookup_cycle, AicpuPhaseId::ORCH_LOOKUP, task_id.raw); // === STEP 4: Register outputs/inouts in TensorMap (must be separate from lookup) === - if constexpr (!kManualSubmit) { - for (int i = 0; i < args.tensor_count(); i++) { - TensorArgType ptype = args.tag(i); - if (ptype == TensorArgType::INOUT || ptype == TensorArgType::OUTPUT_EXISTING) { - if (!args.tensor(i).ptr->manual_dep) { - orch->tensor_map.insert(*args.tensor(i).ptr, task_id); + for (int i = 0; i < args.tensor_count(); i++) { + TensorArgType ptype = args.tag(i); + if (ptype == TensorArgType::INOUT || ptype == TensorArgType::OUTPUT_EXISTING) { + if constexpr (kManualSubmit) { + if ((manual_local_mask & static_cast(1u << i)) != 0) { + continue; } } + if (!args.tensor(i).ptr->manual_dep) { + orch->tensor_map.insert(*args.tensor(i).ptr, task_id); + } } } @@ -997,10 +940,26 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( cur_slot_state.next_block_idx = 0; if constexpr (kManualSubmit) { + int32_t inline_count = std::min(fanin_builder.count, PTO2_FANIN_INLINE_CAP); + int32_t spill_count = fanin_builder.count - inline_count; + payload->fanin_actual_count = fanin_builder.count; + payload->fanin_spill_start = (spill_count > 0) ? fanin_builder.spill_start : 0; + payload->fanin_spill_pool = (spill_count > 0) ? fanin_builder.spill_pool : nullptr; + for (int i = 0; i < inline_count; i++) { + payload->fanin_inline_slot_states[i] = fanin_builder.inline_slots[i]; + } + fanin_builder.for_each([&](PTO2TaskSlotState *producer_slot) { + PTO2TaskSlotState &producer_slot_state = *producer_slot; +#if PTO2_ORCH_PROFILING + pto2_fanout_lock(producer_slot_state, g_orch_fanin_atomic_count, g_orch_fanin_wait_cycle); +#else + pto2_fanout_lock(producer_slot_state); +#endif + producer_slot_state.fanout_count += 1; + pto2_fanout_unlock(producer_slot_state); + return true; + }); cur_slot_state.fanin_count = 1; - payload->fanin_actual_count = 0; - payload->fanin_spill_start = 0; - payload->fanin_spill_pool = nullptr; cur_slot_state.dep_pool_mark = orch->rings[ring_id].dep_pool.top; } else { auto &dep_pool = orch->rings[ring_id].dep_pool; From 431a9eaaa6f2dcfb5818e211ddf66f9b219e3e5e Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Tue, 7 Apr 2026 14:56:03 +0800 Subject: [PATCH 17/35] Fix: remove manual scope membership scan --- .../runtime/pto_orchestrator.cpp | 40 ++++++++++--------- 1 file changed, 21 insertions(+), 19 deletions(-) diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp index 2c7d3d69e..332c75f77 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp @@ -515,27 +515,29 @@ static int32_t find_current_manual_scope_task_index(const PTO2OrchestratorState } PTO2TaskSlotState *first_slot_state = orch->scope_tasks[begin]; - if (first_slot_state != nullptr) { - PTO2TaskId first_task_id = first_slot_state->task->task_id; - if (first_task_id.ring() == task_id.ring()) { - uint32_t window_size = orch->rings[first_task_id.ring()].task_allocator.window_size(); - uint32_t first_local = first_task_id.local(); - uint32_t task_local = task_id.local(); - uint32_t delta = task_local >= first_local ? task_local - first_local : task_local + window_size - first_local; - if (delta < static_cast(count)) { - PTO2TaskSlotState *candidate = orch->scope_tasks[begin + static_cast(delta)]; - if (candidate != nullptr && candidate->task->task_id == task_id) { - return static_cast(delta); - } - } - } + if (first_slot_state == nullptr) { + return -1; } - for (int32_t i = begin; i < orch->scope_tasks_size; i++) { - PTO2TaskSlotState *slot_state = orch->scope_tasks[i]; - if (slot_state != nullptr && slot_state->task->task_id == task_id) { - return i - begin; - } + PTO2TaskId first_task_id = first_slot_state->task->task_id; + if (first_task_id.ring() != task_id.ring()) { + return -1; + } + + uint32_t first_local = first_task_id.local(); + uint32_t task_local = task_id.local(); + if (task_local < first_local) { + return -1; + } + + uint32_t delta = task_local - first_local; + if (delta >= static_cast(count)) { + return -1; + } + + PTO2TaskSlotState *candidate = orch->scope_tasks[begin + static_cast(delta)]; + if (candidate != nullptr && candidate->task->task_id == task_id) { + return static_cast(delta); } return -1; } From 9f628e8bd221757daeaaf288bd9754f2fe98c380 Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Tue, 7 Apr 2026 16:01:53 +0800 Subject: [PATCH 18/35] Update: speed up manual scope edge replay - dedupe explicit manual edges when they are recorded and keep an exact incoming edge count per consumer - append local explicit producers directly at scope_end and skip lock/task_state checks for unpublished same-scope producers - keep overflow validation and dep-pool publish ordering unchanged so the optimization stays within existing scheduler invariants --- .../runtime/pto_orchestrator.cpp | 81 +++++++++++++++---- .../runtime/pto_orchestrator.h | 1 + 2 files changed, 68 insertions(+), 14 deletions(-) diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp index 332c75f77..e1a0d9a08 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp @@ -639,27 +639,65 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { for (int32_t i = 0; i < cached_inline_count; i++) { fanin_builder.inline_slots[i] = payload->fanin_inline_slot_states[i]; } + int32_t local_edge_count = meta.incoming_edge_count; + int32_t fanin_count = cached_external_count + local_edge_count; + + if (fanin_count > PTO2_MAX_INPUTS) { + LOG_ERROR("========================================"); + LOG_ERROR("FATAL: Dependency Overflow Detected!"); + LOG_ERROR("========================================"); + LOG_ERROR("Task requires more than PTO2_MAX_INPUTS unique fanin dependencies."); + LOG_ERROR(" task_id.raw: %" PRIu64, task_id.raw); + LOG_ERROR(" fanin_count: %d / %d", fanin_count, PTO2_MAX_INPUTS); + LOG_ERROR(" reason: manual explicit dependency"); + LOG_ERROR("This is a runtime dependency-tracking limit."); + LOG_ERROR("========================================"); + orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_DEPENDENCY_OVERFLOW, std::memory_order_release); + orch->fatal = true; + return; + } + // Explicit manual edges are deduped at record time, and current-scope + // producers never appear in cached_external_count because manual submit + // skips local owner/TensorMap discovery for those tensors. + auto append_local_fanin_or_fail = [&](PTO2TaskSlotState *prod_state) { + if (fanin_builder.count < PTO2_FANIN_INLINE_CAP) { + fanin_builder.inline_slots[fanin_builder.count++] = prod_state; + return true; + } + + PTO2FaninPool &fanin_pool = *fanin_builder.spill_pool; + fanin_pool.ensure_space(*orch->scheduler, fc, ring_id, 1); + int32_t spill_idx = fanin_pool.top; + PTO2FaninSpillEntry *entry = fanin_pool.alloc(); + if (entry == nullptr) { + orch->fatal = true; + return false; + } + if (fanin_builder.count == PTO2_FANIN_INLINE_CAP) { + fanin_builder.spill_start = spill_idx; + } + entry->slot_state = prod_state; + fanin_builder.count++; + return true; + }; for (int32_t edge_idx = meta.incoming_edge_head; edge_idx >= 0; edge_idx = orch->manual_edges[edge_idx].next_consumer_edge) { const PTO2ManualEdge &edge = orch->manual_edges[edge_idx]; PTO2TaskSlotState *prod_state = orch->scope_tasks[begin + edge.producer_idx]; - if (!pto2_append_fanin_or_fail( - orch, task_id, edge.consumer_idx, TensorArgType::INPUT, prod_state, &fanin_builder, - orch->scheduler, fc, ring_id, "manual explicit dependency" - )) { + if (!append_local_fanin_or_fail(prod_state)) { return; } } - int32_t fanin_count = fanin_builder.count; - int32_t inline_count = std::min(fanin_count, PTO2_FANIN_INLINE_CAP); - int32_t spill_count = fanin_count - inline_count; - dep_pool.ensure_space(*orch->scheduler, fc, ring_id, fanin_count + 1); + int32_t final_fanin_count = fanin_builder.count; + int32_t inline_count = std::min(final_fanin_count, PTO2_FANIN_INLINE_CAP); + int32_t spill_count = final_fanin_count - inline_count; + dep_pool.ensure_space(*orch->scheduler, fc, ring_id, final_fanin_count + 1); slot_state->task_state.store(PTO2_TASK_PENDING, std::memory_order_relaxed); - slot_state->fanin_count = fanin_count + 1; - payload->fanin_actual_count = fanin_count; + slot_state->fanin_count = final_fanin_count + 1; + payload->fanin_actual_count = final_fanin_count; payload->fanin_spill_start = (spill_count > 0) ? fanin_builder.spill_start : 0; payload->fanin_spill_pool = (spill_count > 0) ? fanin_builder.spill_pool : nullptr; for (int32_t i = 0; i < inline_count; i++) { @@ -669,17 +707,16 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { int32_t early_finished = 0; int32_t producer_index = 0; fanin_builder.for_each([&](PTO2TaskSlotState *producer_slot) { + if (producer_index >= cached_external_count) { + return false; + } PTO2TaskSlotState &producer_slot_state = *producer_slot; - bool cached_external = producer_index < cached_external_count; producer_index++; #if PTO2_ORCH_PROFILING pto2_fanout_lock(producer_slot_state, g_orch_fanin_atomic_count, g_orch_fanin_wait_cycle); #else pto2_fanout_lock(producer_slot_state); #endif - if (!cached_external) { - producer_slot_state.fanout_count += 1; - } int32_t prod_state = producer_slot_state.task_state.load(std::memory_order_acquire); if (prod_state >= PTO2_TASK_COMPLETED) { early_finished++; @@ -689,6 +726,14 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { pto2_fanout_unlock(producer_slot_state); return true; }); + for (int32_t edge_idx = meta.incoming_edge_head; edge_idx >= 0; + edge_idx = orch->manual_edges[edge_idx].next_consumer_edge) { + PTO2TaskSlotState &producer_slot_state = *orch->scope_tasks[begin + orch->manual_edges[edge_idx].producer_idx]; + // Same-scope explicit producers are unpublished until the batch + // publish below, so no scheduler thread can race on fanout state. + producer_slot_state.fanout_count += 1; + producer_slot_state.fanout_head = dep_pool.prepend(producer_slot_state.fanout_head, slot_state); + } if (early_finished > 0) { slot_state->fanin_refcount.fetch_add(early_finished, std::memory_order_acq_rel); } @@ -1148,6 +1193,7 @@ pto2_submit_mixed_task_manual(PTO2OrchestratorState *orch, const MixedKernels &m meta.slot_state = orch->scope_tasks[orch->scope_tasks_size - 1]; meta.scope_task_index = orch->scope_tasks_size - 1 - current_manual_scope_begin(orch); meta.incoming_edge_head = -1; + meta.incoming_edge_count = 0; meta.tensor_count = static_cast(args.tensor_count()); meta.manual_local_mask = 0; for (int32_t i = 0; i < args.tensor_count(); i++) { @@ -1188,6 +1234,12 @@ void pto2_add_dependency(PTO2OrchestratorState *orch, PTO2TaskId producer_id, PT int32_t meta_begin = orch->manual_task_meta_begins[orch->scope_stack_top]; PTO2ManualTaskMeta &consumer_meta = orch->manual_task_meta[meta_begin + consumer_idx]; + for (int32_t edge_idx = consumer_meta.incoming_edge_head; edge_idx >= 0; + edge_idx = orch->manual_edges[edge_idx].next_consumer_edge) { + if (orch->manual_edges[edge_idx].producer_idx == producer_idx) { + return; + } + } int32_t edge_idx = manual_edge_push( orch, PTO2ManualEdge{ @@ -1197,6 +1249,7 @@ void pto2_add_dependency(PTO2OrchestratorState *orch, PTO2TaskId producer_id, PT } ); consumer_meta.incoming_edge_head = edge_idx; + consumer_meta.incoming_edge_count++; } // ============================================================================= diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h index 0a6090d27..833417aab 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h @@ -44,6 +44,7 @@ struct PTO2ManualTaskMeta { PTO2TaskSlotState *slot_state; int32_t scope_task_index; int32_t incoming_edge_head; + uint16_t incoming_edge_count; uint8_t tensor_count; uint16_t manual_local_mask; uint8_t tags[MAX_TENSOR_ARGS]; From c3e5951a00b3776400c45857f8038eeefef114ee Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Wed, 8 Apr 2026 15:33:43 +0800 Subject: [PATCH 19/35] Update: streamline partial-manual paged attention - rewrite the non-unroll partial-manual example to use one manual scope per q tile - move the hub allocation into the manual scope and serialize update tasks explicitly - drop the chunked manual-scope pattern that inflated partial-manual orchestration cost --- .../orchestration/paged_attention_orch.cpp | 155 +++++++++--------- 1 file changed, 76 insertions(+), 79 deletions(-) diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp index 5a0581cfa..6c52520b8 100644 --- a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp +++ b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp @@ -20,8 +20,6 @@ #define FUNC_ONLINE_UPDATE 3 #define FUNC_AIC_HUB 4 #define FUNC_AIV_HUB 5 -#define N_MANUAL_CHUNK 4 - extern "C" { __attribute__((visibility("default"))) PTO2OrchestrationConfig @@ -92,83 +90,82 @@ aicpu_orchestration_entry(const ChipStorageTaskArgs &orch_args, int orch_thread_ Tensor qi = query.view(tile2d_shapes, qi_offsets); Tensor out_view = out.view(tile2d_shapes, out_view_offsets); - Arg params_inplace; - params_inplace.add_output(tile2d_ci); - params_inplace.add_output(scalar_ci); - params_inplace.add_output(scalar_ci); - TaskOutputTensors hub_outs = pto2_rt_submit_aiv_task(FUNC_AIV_HUB, params_inplace); - const Tensor &oi = hub_outs.get_ref(0); - const Tensor &li_update = hub_outs.get_ref(1); - const Tensor &mi_update = hub_outs.get_ref(2); - - for (uint64_t bn = 0; bn < bn_this_batch; bn += N_MANUAL_CHUNK) { - uint64_t bn_end = std::min(bn + static_cast(N_MANUAL_CHUNK), bn_this_batch); - - PTO2_SCOPE(PTO2ScopeMode::MANUAL) { - for (uint64_t bn_local = bn; bn_local < bn_end; bn_local++) { - uint64_t cur_block_idx = host_block_table[b_idx * block_num + bn_local]; - uint64_t valid_len = std::min(block_size, cur_seq - bn_local * block_size); - - uint32_t kv_shapes[2] = { - static_cast(block_size), static_cast(head_dim) - }; - uint32_t kv_offsets[2] = {static_cast(cur_block_idx * block_size), 0}; - Tensor kj = key_cache.view(kv_shapes, kv_offsets); - Tensor vj = value_cache.view(kv_shapes, kv_offsets); - - Arg params_qk; - params_qk.add_input(qi); - params_qk.add_input(kj); - params_qk.add_output(sij_ci); - PTO2ManualSubmitResult qk_outs = pto2_rt_submit_aic_task_manual(FUNC_QK_MATMUL, params_qk); - const Tensor &sij = qk_outs.outputs.get_ref(0); - - uint32_t sij_valid_shapes[2] = { - static_cast(q_tile), static_cast(valid_len) - }; - uint32_t sij_valid_offsets[2] = {0, 0}; - Tensor sij_valid = sij.view(sij_valid_shapes, sij_valid_offsets); - - Arg params_sf; - params_sf.add_input(sij_valid); - params_sf.add_output(pij_f16_ci); - params_sf.add_output(scalar_ci); - params_sf.add_output(scalar_ci); - params_sf.add_scalar(scale_value); - PTO2ManualSubmitResult sf_outs = - pto2_rt_submit_aiv_task_manual(FUNC_SOFTMAX_PREPARE, params_sf); - const Tensor &pij_f16 = sf_outs.outputs.get_ref(0); - const Tensor &mi = sf_outs.outputs.get_ref(1); - const Tensor &li = sf_outs.outputs.get_ref(2); - - Arg params_pv; - params_pv.add_input(pij_f16); - params_pv.add_input(vj); - params_pv.add_output(tile2d_ci); - PTO2ManualSubmitResult pv_outs = pto2_rt_submit_aic_task_manual(FUNC_PV_MATMUL, params_pv); - const Tensor &oi_tmp = pv_outs.outputs.get_ref(0); - - uint64_t is_first = (bn_local == 0) ? 1 : 0; - uint64_t is_last = (bn_local == bn_this_batch - 1) ? 1 : 0; - - Arg params_up; - params_up.add_input(mi); - params_up.add_input(li); - params_up.add_input(oi_tmp); - params_up.add_inout(mi_update); - params_up.add_inout(li_update); - params_up.add_inout(oi); - params_up.add_inout(out_view); - params_up.add_scalar(is_first); - params_up.add_scalar(is_last); - PTO2ManualSubmitResult up_outs = - pto2_rt_submit_aiv_task_manual(FUNC_ONLINE_UPDATE, params_up); - - pto2_rt_add_dependency(qk_outs.task_id, sf_outs.task_id); - pto2_rt_add_dependency(sf_outs.task_id, pv_outs.task_id); - pto2_rt_add_dependency(sf_outs.task_id, up_outs.task_id); - pto2_rt_add_dependency(pv_outs.task_id, up_outs.task_id); - } + PTO2_SCOPE(PTO2ScopeMode::MANUAL) { + Arg params_inplace; + params_inplace.add_output(tile2d_ci); + params_inplace.add_output(scalar_ci); + params_inplace.add_output(scalar_ci); + PTO2ManualSubmitResult hub_outs = pto2_rt_submit_aiv_task_manual(FUNC_AIV_HUB, params_inplace); + const Tensor &oi = hub_outs.outputs.get_ref(0); + const Tensor &li_update = hub_outs.outputs.get_ref(1); + const Tensor &mi_update = hub_outs.outputs.get_ref(2); + PTO2TaskId prev_update_task = hub_outs.task_id; + + for (uint64_t bn = 0; bn < bn_this_batch; bn++) { + uint64_t cur_block_idx = host_block_table[b_idx * block_num + bn]; + uint64_t valid_len = std::min(block_size, cur_seq - bn * block_size); + + uint32_t kv_shapes[2] = { + static_cast(block_size), static_cast(head_dim) + }; + uint32_t kv_offsets[2] = {static_cast(cur_block_idx * block_size), 0}; + Tensor kj = key_cache.view(kv_shapes, kv_offsets); + Tensor vj = value_cache.view(kv_shapes, kv_offsets); + + Arg params_qk; + params_qk.add_input(qi); + params_qk.add_input(kj); + params_qk.add_output(sij_ci); + PTO2ManualSubmitResult qk_outs = pto2_rt_submit_aic_task_manual(FUNC_QK_MATMUL, params_qk); + const Tensor &sij = qk_outs.outputs.get_ref(0); + + uint32_t sij_valid_shapes[2] = { + static_cast(q_tile), static_cast(valid_len) + }; + uint32_t sij_valid_offsets[2] = {0, 0}; + Tensor sij_valid = sij.view(sij_valid_shapes, sij_valid_offsets); + + Arg params_sf; + params_sf.add_input(sij_valid); + params_sf.add_output(pij_f16_ci); + params_sf.add_output(scalar_ci); + params_sf.add_output(scalar_ci); + params_sf.add_scalar(scale_value); + PTO2ManualSubmitResult sf_outs = + pto2_rt_submit_aiv_task_manual(FUNC_SOFTMAX_PREPARE, params_sf); + const Tensor &pij_f16 = sf_outs.outputs.get_ref(0); + const Tensor &mi = sf_outs.outputs.get_ref(1); + const Tensor &li = sf_outs.outputs.get_ref(2); + + Arg params_pv; + params_pv.add_input(pij_f16); + params_pv.add_input(vj); + params_pv.add_output(tile2d_ci); + PTO2ManualSubmitResult pv_outs = pto2_rt_submit_aic_task_manual(FUNC_PV_MATMUL, params_pv); + const Tensor &oi_tmp = pv_outs.outputs.get_ref(0); + + uint64_t is_first = (bn == 0) ? 1 : 0; + uint64_t is_last = (bn == bn_this_batch - 1) ? 1 : 0; + + Arg params_up; + params_up.add_input(mi); + params_up.add_input(li); + params_up.add_input(oi_tmp); + params_up.add_inout(mi_update); + params_up.add_inout(li_update); + params_up.add_inout(oi); + params_up.add_inout(out_view); + params_up.add_scalar(is_first); + params_up.add_scalar(is_last); + PTO2ManualSubmitResult up_outs = + pto2_rt_submit_aiv_task_manual(FUNC_ONLINE_UPDATE, params_up); + + pto2_rt_add_dependency(qk_outs.task_id, sf_outs.task_id); + pto2_rt_add_dependency(sf_outs.task_id, pv_outs.task_id); + pto2_rt_add_dependency(sf_outs.task_id, up_outs.task_id); + pto2_rt_add_dependency(pv_outs.task_id, up_outs.task_id); + pto2_rt_add_dependency(prev_update_task, up_outs.task_id); + prev_update_task = up_outs.task_id; } } } From 77f548f15b15672589e009acf5dd9a67743a0a58 Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Wed, 8 Apr 2026 15:49:44 +0800 Subject: [PATCH 20/35] Update: mark partial-manual paged attention boundaries - mark external inputs and outputs as manual-dep boundaries in the non-unroll partial-manual paged attention example - skip repeated overlap tracking for query, kv-cache, and final output views that are already ordered by the explicit manual dependency chain - keep the manual-scope methodology while improving orch time on real device for both paged-attention cases --- .../orchestration/paged_attention_orch.cpp | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp index 6c52520b8..51b2bad7c 100644 --- a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp +++ b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp @@ -62,10 +62,10 @@ aicpu_orchestration_entry(const ChipStorageTaskArgs &orch_args, int orch_thread_ static_cast(total_blocks_count * block_size), static_cast(head_dim) }; uint32_t out_shapes[2] = {static_cast(batch * num_heads), static_cast(head_dim)}; - Tensor query = make_tensor_external(query_ptr, query_shapes, 2, data_type); - Tensor key_cache = make_tensor_external(kc_ptr, key_cache_shapes, 2, data_type); - Tensor value_cache = make_tensor_external(vc_ptr, value_cache_shapes, 2, data_type); - Tensor out = make_tensor_external(out_ptr, out_shapes, 2, DataType::FLOAT32); + Tensor query = make_tensor_external(query_ptr, query_shapes, 2, data_type, true); + Tensor key_cache = make_tensor_external(kc_ptr, key_cache_shapes, 2, data_type, true); + Tensor value_cache = make_tensor_external(vc_ptr, value_cache_shapes, 2, data_type, true); + Tensor out = make_tensor_external(out_ptr, out_shapes, 2, DataType::FLOAT32, true); int *host_block_table = orch_args.tensor(3).data_as(); int *host_context_lens = orch_args.tensor(4).data_as(); @@ -87,8 +87,8 @@ aicpu_orchestration_entry(const ChipStorageTaskArgs &orch_args, int orch_thread_ uint32_t qi_offsets[2] = {static_cast(cur_offset), 0}; uint32_t out_view_offsets[2] = {static_cast(cur_offset), 0}; - Tensor qi = query.view(tile2d_shapes, qi_offsets); - Tensor out_view = out.view(tile2d_shapes, out_view_offsets); + Tensor qi = query.view(tile2d_shapes, qi_offsets, true); + Tensor out_view = out.view(tile2d_shapes, out_view_offsets, true); PTO2_SCOPE(PTO2ScopeMode::MANUAL) { Arg params_inplace; @@ -109,8 +109,8 @@ aicpu_orchestration_entry(const ChipStorageTaskArgs &orch_args, int orch_thread_ static_cast(block_size), static_cast(head_dim) }; uint32_t kv_offsets[2] = {static_cast(cur_block_idx * block_size), 0}; - Tensor kj = key_cache.view(kv_shapes, kv_offsets); - Tensor vj = value_cache.view(kv_shapes, kv_offsets); + Tensor kj = key_cache.view(kv_shapes, kv_offsets, true); + Tensor vj = value_cache.view(kv_shapes, kv_offsets, true); Arg params_qk; params_qk.add_input(qi); From c1722c14d22276f83c595c986d8a9548d5c28bf9 Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Wed, 8 Apr 2026 15:52:00 +0800 Subject: [PATCH 21/35] Update: refresh manual-dep benchmark findings - replace the stale benchmark section with fresh 2026-04-08 device-2 results - document how the benchmark wrapper selects new, unmodified, and partial-manual variants - record that partial-manual improved on non-unroll but still misses the aicpu_build_graph target --- docs/manual-dep-for-tensormap-design.md | 113 ++++++++++++++++-------- 1 file changed, 75 insertions(+), 38 deletions(-) diff --git a/docs/manual-dep-for-tensormap-design.md b/docs/manual-dep-for-tensormap-design.md index 7057f5211..c01132e0f 100644 --- a/docs/manual-dep-for-tensormap-design.md +++ b/docs/manual-dep-for-tensormap-design.md @@ -981,11 +981,11 @@ Only then move to more complex orchestration such as paged attention. ## Fresh Hardware Benchmark -Fresh benchmark data was rerun on real hardware on 2026-04-06 with: +Fresh benchmark data was rerun on real hardware on 2026-04-08 with: - platform: `a2a3` - device: `2` -- rounds: `5` +- rounds: `10` - pinned PTO-ISA commit: `6622890` - runner: `tools/benchmark_rounds.sh` @@ -1004,23 +1004,29 @@ through the `_partial_manual` paged-attention scenes. The benchmark wrapper enables the variants as follows: -- `./tools/benchmark_rounds.sh -r tensormap_and_ringbuffer` - - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer/paged_attention` - - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll` -- `./tools/benchmark_rounds.sh -r tensormap_and_ringbuffer_unmodified` +- `./tools/benchmark_rounds.sh -d 2 -n 10 -r aicpu_build_graph -c 6622890` + - benchmarks `tests/st/a2a3/aicpu_build_graph/paged_attention` + - benchmarks `tests/st/a2a3/aicpu_build_graph/paged_attention_unroll` +- `./tools/benchmark_rounds.sh -d 2 -n 10 -r tensormap_and_ringbuffer_unmodified -c 6622890` - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention` - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention_unroll` -- `./tools/benchmark_rounds.sh -r tensormap_and_ringbuffer_partial_manual` +- `./tools/benchmark_rounds.sh -d 2 -n 10 -r tensormap_and_ringbuffer -c 6622890` + - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer/paged_attention` + - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll` +- `./tools/benchmark_rounds.sh -d 2 -n 10 -r tensormap_and_ringbuffer_partial_manual -c 6622890` - uses the same ST root as `tensormap_and_ringbuffer` - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual` - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual` -- `./tools/benchmark_rounds.sh -r aicpu_build_graph` - - benchmarks `tests/st/a2a3/aicpu_build_graph/paged_attention` - - benchmarks `tests/st/a2a3/aicpu_build_graph/paged_attention_unroll` -That means the current/new AUTO runtime is already selected directly by `-r -tensormap_and_ringbuffer`, while the manual-mode comparison is selected by `-r -tensormap_and_ringbuffer_partial_manual`. +There is no separate runtime named `partial_manual` in the example `kernel_config.py`. +The partial-manual scenes still declare `RUNTIME_CONFIG["runtime"] = +"tensormap_and_ringbuffer"`, and the benchmark wrapper switches to the +`*_partial_manual` scene directories when `-r tensormap_and_ringbuffer_partial_manual` +is selected. + +Similarly, the current/new AUTO-path runtime is enabled directly by `-r +tensormap_and_ringbuffer`, while the copied side-by-side baseline is enabled by `-r +tensormap_and_ringbuffer_unmodified`. ### Fresh Results @@ -1028,39 +1034,66 @@ Units below are `elapsed_us (orch_us)`. | Workload | Case | `aicpu_build_graph` | `tensormap_and_ringbuffer_unmodified` | `tensormap_and_ringbuffer` | `tensormap_and_ringbuffer_partial_manual` | | --- | --- | --- | --- | --- | --- | -| `paged_attention` | `Case1` | `31864.8 (-)` | `36218.5 (36217.7)` | `36643.8 (36643.0)` | `41611.6 (41605.3)` | -| `paged_attention` | `Case2` | `16295.2 (-)` | `18325.8 (18324.7)` | `18873.5 (18872.6)` | `21047.6 (20983.7)` | -| `paged_attention_unroll` | `Case1` | `1431.7 (-)` | `1321.8 (815.7)` | `1317.2 (833.3)` | `1329.8 (946.2)` | -| `paged_attention_unroll` | `Case2` | `716.7 (-)` | `629.6 (388.7)` | `633.5 (371.4)` | `649.1 (429.3)` | +| `paged_attention` | `Case1` | `31719.8 (-)` | `37161.3 (37160.6)` | `36368.0 (36367.3)` | `34989.4 (34830.0)` | +| `paged_attention` | `Case2` | `16922.9 (-)` | `18913.1 (18912.4)` | `18951.0 (18950.4)` | `18191.2 (17785.0)` | +| `paged_attention_unroll` | `Case1` | `1403.7 (-)` | `1318.7 (854.1)` | `1320.9 (830.9)` | `1324.0 (878.9)` | +| `paged_attention_unroll` | `Case2` | `692.0 (-)` | `623.5 (366.2)` | `637.8 (389.2)` | `639.2 (398.8)` | ### Benchmark Takeaways -1. The current/new AUTO-path runtime is close to the unmodified runtime on three of - the four fresh cells. - - `paged_attention_unroll/Case1`: slightly faster end-to-end, slightly slower in orch - - `paged_attention_unroll/Case2`: slightly slower end-to-end, slightly faster in orch +1. The example rewrite improved the non-unroll partial-manual scene enough to move it + back into the expected ordering on both paged-attention cases. + - `paged_attention/Case1`: `aicpu_build_graph` < `partial_manual` < `tensormap*` + - `paged_attention/Case2`: `aicpu_build_graph` < `partial_manual` < `tensormap*` -2. The remaining AUTO-path gap is the heavy `paged_attention/Case1` cell. - - unmodified: `36218.5 us` - - current/new: `36643.8 us` - - regression: about `+1.2%` +2. The non-unroll target is still not met. + - target cell: `paged_attention/Case1` + - `aicpu_build_graph`: `31719.8 us` + - `partial_manual`: `34989.4 us` + - remaining gap: about `+10.3%` -3. `paged_attention/Case2` still shows a measurable AUTO-path regression. - - unmodified: `18325.8 us` - - current/new: `18873.5 us` - - regression: about `+3.0%` +3. Partial-manual is now clearly better than both tensormap AUTO variants on the + non-unroll scene, but not yet equal to `aicpu_build_graph`. + - `paged_attention/Case1`: about `-5.8%` vs unmodified, about `-3.8%` vs current/new + - `paged_attention/Case2`: about `-3.8%` vs unmodified, about `-4.0%` vs current/new -4. Partial-manual mode is still materially slower on the heavy paged-attention scene. - - `paged_attention/Case1`: about `+14.9%` vs unmodified - - `paged_attention/Case2`: about `+14.8%` vs unmodified +4. The current/new AUTO runtime no longer looks meaningfully worse than the copied + unmodified baseline on the fresh device-2 rerun. + - `paged_attention/Case1`: current/new is about `-2.1%` faster + - `paged_attention/Case2`: current/new is effectively tied (`+0.2%`) + - `paged_attention_unroll`: current/new stays within about `+2%` end-to-end -5. Partial-manual mode also adds visible orch cost on the unroll scene even when - elapsed stays close. - - `paged_attention_unroll/Case1`: `946.2 us` orch vs `815.7 us` unmodified - - `paged_attention_unroll/Case2`: `429.3 us` orch vs `388.7 us` unmodified +5. On the unroll scene, all three tensormap-family runtimes stay faster than + `aicpu_build_graph` end-to-end, but partial-manual still has the highest orch cost + among the tensormap variants. + - `paged_attention_unroll/Case1`: `878.9 us` orch vs `830.9 us` current/new + - `paged_attention_unroll/Case2`: `398.8 us` orch vs `389.2 us` current/new -6. `aicpu_build_graph` remains fastest on the heavy `paged_attention` scene, but it is - slower than the tensormap runtimes on both `paged_attention_unroll` cells. +6. The remaining performance problem is specifically the heavy non-unroll partial-manual + `scope_end` path, not a broad collapse of the AUTO-path runtime. + +### Boundary Annotation Note + +There is still no explicit “scope arguments” API in +`src/a2a3/runtime/tensormap_and_ringbuffer/orchestration/pto_orchestration_api.h`. + +The closest current mechanism is per-tensor `manual_dep=true` on +`make_tensor_external(...)` and derived `view(...)` objects. That mechanism is not a +good substitute for scope-boundary declaration: + +- it is tensor-local, not scope-local +- it suppresses TensorMap lookup/insert for that tensor +- it can easily hide an output frontier that the boundary semantics still need + +For the paged-attention partial-manual example, the stable improvement came from the +orchestration rewrite itself: + +- one manual scope per `q_idx` +- move `AIV_HUB` creation into the manual scope +- add an explicit `prev_update_task -> up_outs.task_id` chain + +The `manual_dep=true` boundary-hint experiments were not kept in the example because +they did not produce a robust win across the non-unroll scene. ## Main Risks @@ -1101,6 +1134,10 @@ Units below are `elapsed_us (orch_us)`. - This would duplicate fanin/fanout handling, completion notification, and release traversal. - The design requires one unified post-publish scheduler mechanism. +13. Using `manual_dep=true` as a fake scope-boundary annotation. +- This can suppress TensorMap work that is still required for cross-scope correctness. +- It also creates unstable performance results because the flag is tensor-local, not scope-local. + ## Dangerous Risks For The Submit/Scope-End Split The implementation should explicitly guard the following failure modes before any From 3e34a71e27ce13945c05df4a15a46749ae38ba04 Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Wed, 8 Apr 2026 15:55:57 +0800 Subject: [PATCH 22/35] Update: refresh manual-dep benchmark findings - replace stale benchmark data in the design doc with fresh device-3 measurements for the four paged-attention runtime lanes - document how benchmark_rounds.sh selects the partial-manual scenes - record the non-unroll boundary-hint A/B results and the safety limits of using manual_dep=true as an example-level boundary annotation --- docs/manual-dep-for-tensormap-design.md | 120 ++++++++++++++---------- 1 file changed, 71 insertions(+), 49 deletions(-) diff --git a/docs/manual-dep-for-tensormap-design.md b/docs/manual-dep-for-tensormap-design.md index c01132e0f..e96029942 100644 --- a/docs/manual-dep-for-tensormap-design.md +++ b/docs/manual-dep-for-tensormap-design.md @@ -984,7 +984,7 @@ Only then move to more complex orchestration such as paged attention. Fresh benchmark data was rerun on real hardware on 2026-04-08 with: - platform: `a2a3` -- device: `2` +- device: `3` - rounds: `10` - pinned PTO-ISA commit: `6622890` - runner: `tools/benchmark_rounds.sh` @@ -1004,16 +1004,16 @@ through the `_partial_manual` paged-attention scenes. The benchmark wrapper enables the variants as follows: -- `./tools/benchmark_rounds.sh -d 2 -n 10 -r aicpu_build_graph -c 6622890` +- `./tools/benchmark_rounds.sh -d 3 -n 10 -r aicpu_build_graph -c 6622890` - benchmarks `tests/st/a2a3/aicpu_build_graph/paged_attention` - benchmarks `tests/st/a2a3/aicpu_build_graph/paged_attention_unroll` -- `./tools/benchmark_rounds.sh -d 2 -n 10 -r tensormap_and_ringbuffer_unmodified -c 6622890` +- `./tools/benchmark_rounds.sh -d 3 -n 10 -r tensormap_and_ringbuffer_unmodified -c 6622890` - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention` - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention_unroll` -- `./tools/benchmark_rounds.sh -d 2 -n 10 -r tensormap_and_ringbuffer -c 6622890` +- `./tools/benchmark_rounds.sh -d 3 -n 10 -r tensormap_and_ringbuffer -c 6622890` - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer/paged_attention` - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll` -- `./tools/benchmark_rounds.sh -d 2 -n 10 -r tensormap_and_ringbuffer_partial_manual -c 6622890` +- `./tools/benchmark_rounds.sh -d 3 -n 10 -r tensormap_and_ringbuffer_partial_manual -c 6622890` - uses the same ST root as `tensormap_and_ringbuffer` - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual` - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual` @@ -1034,43 +1034,41 @@ Units below are `elapsed_us (orch_us)`. | Workload | Case | `aicpu_build_graph` | `tensormap_and_ringbuffer_unmodified` | `tensormap_and_ringbuffer` | `tensormap_and_ringbuffer_partial_manual` | | --- | --- | --- | --- | --- | --- | -| `paged_attention` | `Case1` | `31719.8 (-)` | `37161.3 (37160.6)` | `36368.0 (36367.3)` | `34989.4 (34830.0)` | -| `paged_attention` | `Case2` | `16922.9 (-)` | `18913.1 (18912.4)` | `18951.0 (18950.4)` | `18191.2 (17785.0)` | -| `paged_attention_unroll` | `Case1` | `1403.7 (-)` | `1318.7 (854.1)` | `1320.9 (830.9)` | `1324.0 (878.9)` | -| `paged_attention_unroll` | `Case2` | `692.0 (-)` | `623.5 (366.2)` | `637.8 (389.2)` | `639.2 (398.8)` | +| `paged_attention` | `Case1` | `31318.9 (-)` | `35367.3 (35366.7)` | `36996.3 (36995.6)` | `35187.6 (35030.2)` | +| `paged_attention` | `Case2` | `16844.5 (-)` | `19739.8 (19736.1)` | `19861.8 (19856.8)` | `18685.5 (18274.5)` | +| `paged_attention_unroll` | `Case1` | `1412.7 (-)` | `1321.7 (841.6)` | `1323.9 (831.3)` | `1321.3 (884.4)` | +| `paged_attention_unroll` | `Case2` | `705.5 (-)` | `628.1 (381.6)` | `632.5 (378.9)` | `637.5 (406.4)` | ### Benchmark Takeaways -1. The example rewrite improved the non-unroll partial-manual scene enough to move it - back into the expected ordering on both paged-attention cases. - - `paged_attention/Case1`: `aicpu_build_graph` < `partial_manual` < `tensormap*` - - `paged_attention/Case2`: `aicpu_build_graph` < `partial_manual` < `tensormap*` - -2. The non-unroll target is still not met. +1. The non-unroll target is still not met. - target cell: `paged_attention/Case1` - - `aicpu_build_graph`: `31719.8 us` - - `partial_manual`: `34989.4 us` - - remaining gap: about `+10.3%` - -3. Partial-manual is now clearly better than both tensormap AUTO variants on the - non-unroll scene, but not yet equal to `aicpu_build_graph`. - - `paged_attention/Case1`: about `-5.8%` vs unmodified, about `-3.8%` vs current/new - - `paged_attention/Case2`: about `-3.8%` vs unmodified, about `-4.0%` vs current/new - -4. The current/new AUTO runtime no longer looks meaningfully worse than the copied - unmodified baseline on the fresh device-2 rerun. - - `paged_attention/Case1`: current/new is about `-2.1%` faster - - `paged_attention/Case2`: current/new is effectively tied (`+0.2%`) - - `paged_attention_unroll`: current/new stays within about `+2%` end-to-end - -5. On the unroll scene, all three tensormap-family runtimes stay faster than - `aicpu_build_graph` end-to-end, but partial-manual still has the highest orch cost - among the tensormap variants. - - `paged_attention_unroll/Case1`: `878.9 us` orch vs `830.9 us` current/new - - `paged_attention_unroll/Case2`: `398.8 us` orch vs `389.2 us` current/new - -6. The remaining performance problem is specifically the heavy non-unroll partial-manual - `scope_end` path, not a broad collapse of the AUTO-path runtime. + - `aicpu_build_graph`: `31318.9 us` + - `partial_manual`: `35187.6 us` + - remaining gap: about `+12.4%` + +2. Partial-manual now improves the modified/current tensormap AUTO runtime on both + non-unroll paged-attention cases, but it does not beat `aicpu_build_graph`. + - `paged_attention/Case1`: about `-4.9%` elapsed, about `-5.3%` orch vs current/new + - `paged_attention/Case2`: about `-5.9%` elapsed, about `-8.0%` orch vs current/new + +3. Partial-manual is not yet consistently better than the copied unmodified baseline. + - `paged_attention/Case1`: about `-0.5%` elapsed, about `-1.0%` orch vs unmodified + - `paged_attention/Case2`: about `+5.2%` elapsed, about `+7.8%` orch vs unmodified + +4. The current/new AUTO runtime is still slower than the copied unmodified runtime on + the non-unroll scene. + - `paged_attention/Case1`: about `+4.6%` elapsed, about `+4.6%` orch + - `paged_attention/Case2`: about `+0.6%` elapsed, about `+0.6%` orch + +5. On the unroll scene, all three tensormap-family runtimes remain faster than + `aicpu_build_graph` end-to-end, but partial-manual stays slightly worse than both + AUTO tensormap variants in orch time. + - `paged_attention_unroll/Case1`: `884.4 us` orch vs `841.6 us` unmodified and `831.3 us` current/new + - `paged_attention_unroll/Case2`: `406.4 us` orch vs `381.6 us` unmodified and `378.9 us` current/new + +6. The remaining performance problem is still concentrated in the non-unroll + partial-manual path, especially the replay/publish cost paid at manual `scope_end`. ### Boundary Annotation Note @@ -1083,17 +1081,41 @@ good substitute for scope-boundary declaration: - it is tensor-local, not scope-local - it suppresses TensorMap lookup/insert for that tensor -- it can easily hide an output frontier that the boundary semantics still need - -For the paged-attention partial-manual example, the stable improvement came from the -orchestration rewrite itself: - -- one manual scope per `q_idx` -- move `AIV_HUB` creation into the manual scope -- add an explicit `prev_update_task -> up_outs.task_id` chain - -The `manual_dep=true` boundary-hint experiments were not kept in the example because -they did not produce a robust win across the non-unroll scene. +- if used carelessly, it can hide an output frontier that boundary semantics still need + +For the committed non-unroll partial-manual paged-attention example, the stable +improvement came from two pieces together: + +- the orchestration rewrite already present in the example: + - one manual scope per `q_idx` + - move `AIV_HUB` creation into the manual scope + - add an explicit `prev_update_task -> up_outs.task_id` chain +- explicit `manual_dep=true` boundary hints on the external inputs and external output + views in the example itself + +Fresh device-3 measurements for the non-unroll partial-manual example were: + +| Variant | `paged_attention/Case1` orch | `paged_attention/Case2` orch | +| --- | --- | --- | +| rewrite only, no boundary hints | `36791.9 us` | `19792.7 us` | +| rewrite + input-side hints | `36752.3 us` | `19296.7 us` | +| rewrite + input/output hints | `35068.7 us` | `17668.2 us` | + +So the important hint is not just the external producers (`query`, `key_cache`, +`value_cache`), but also the external consumer path through `out` / `out_view`. +That is consistent with the current runtime behavior: + +- `manual_dep=true` skips TensorMap overlap lookup/insert but still keeps creator + retention through `owner_task_id` +- the explicit `prev_update_task` chain already serializes same-scope `ONLINE_UPDATE` + writes +- marking `out` / `out_view` `manual_dep=true` avoids paying repeated external-output + overlap tracking on every block update when that ordering is already explicit + +This is still not a general “scope arguments” API. It is an example-local optimization +that is only safe when the manual scope already carries the same-scope write ordering +explicitly and there is no same-scope external consumer that depends on TensorMap +publication before manual `scope_end`. ## Main Risks From 97e0242c252c0ea45490d4ed7c6d5b9535361472 Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Wed, 8 Apr 2026 16:39:28 +0800 Subject: [PATCH 23/35] Update: clarify manual-scope dependency model - explain submit-time versus scope-end work in the manual-scope path - document how in-scope and cross-scope tensors are classified and handled - bind the kept example optimizations to measured orch gains and remove stale scope-end frontier wording --- docs/manual-dep-for-tensormap-design.md | 148 +++++++++++++++++++++--- 1 file changed, 135 insertions(+), 13 deletions(-) diff --git a/docs/manual-dep-for-tensormap-design.md b/docs/manual-dep-for-tensormap-design.md index e96029942..a236cde89 100644 --- a/docs/manual-dep-for-tensormap-design.md +++ b/docs/manual-dep-for-tensormap-design.md @@ -165,6 +165,89 @@ Concise conclusion: - both become the same scheduler edges before publish - execution uses only the scheduler edge machinery, not TensorMap +## Implemented Manual-Scope Algorithm + +The current implementation is a submit/scope-end split: + +- manual submit still allocates task ids, task slots, and payloads immediately +- manual submit still does boundary producer discovery for cross-scope tensors +- manual submit still updates TensorMap frontier for cross-scope writes +- manual submit does not publish tasks to the scheduler ready graph +- manual `scope_end` replays only explicit same-scope edges plus cached external fanins, then batch-publishes tasks + +### How tensor arguments are handled + +The runtime decision is per tensor argument, not per scope: + +| Tensor use in a manual-scope task | How dependency is found | Uses TensorMap? | What must be maintained | +| --- | --- | --- | --- | +| tensor created in the current manual scope, then reused in the current manual scope | explicit `add_dependency` only | no | recorded manual edge in scope-local edge buffer | +| outer/external `INPUT` | creator retention plus overlap lookup | yes, at manual submit unless `manual_dep=true` | cached external producer set in task payload | +| outer/external `INOUT` | creator retention plus overlap lookup for incoming state | yes, at manual submit unless `manual_dep=true` | cached external producer set plus updated writer frontier | +| outer/external `OUTPUT_EXISTING` | creator retention only for incoming owner, no overlap lookup | yes for outgoing frontier update unless `manual_dep=true` | updated writer frontier | +| runtime-created `OUTPUT` inside manual scope | no incoming dependency | no immediate lookup | `owner_task_id` on produced tensor so later users can classify it as manual-local | + +`TensorArgType` matters here: + +- `INPUT`: read-only; needs incoming producer discovery, no outgoing frontier update +- `INOUT`: read old value and write new value; needs both incoming producer discovery and outgoing frontier update +- `OUTPUT_EXISTING`: overwrite an existing outer buffer; does not need overlap lookup for an old modifier chain, but still needs outgoing frontier update +- `OUTPUT`: fresh runtime allocation; has no incoming dependency and becomes manual-local to later tasks in the same manual scope + +### What manual submit iterates + +For each submitted task in a manual scope, the runtime iterates tensor args in submit order: + +1. Allocate task id, slot state, and payload immediately. +2. For each tensor arg, classify it as manual-local or outer/external from `owner_task_id` plus current manual-scope ownership. +3. For manual-local tensors: + - skip creator-retention wiring + - skip TensorMap lookup/insert + - rely on explicit `add_dependency` +4. For outer/external tensors: + - keep creator-retention from `owner_task_id` + - run TensorMap overlap lookup for `INPUT` and `INOUT` unless `manual_dep=true` + - cache the deduped external producer set in the task payload + - update TensorMap frontier for outer writes in original submit order unless `manual_dep=true` +5. Leave the task unpublished behind one deferred publish barrier. + +### What manual scope_end iterates + +At manual `scope_end`, the runtime iterates tasks in the current manual scope in original submit order: + +1. Read the cached external producer set from each task payload. +2. Replay explicit same-scope edges recorded by `add_dependency`. +3. Merge and dedup: + - cached external producers + - explicit manual producers +4. Realize the final producer set into the normal scheduler fanin/fanout structures. +5. Release the deferred publish barrier and batch-publish the manual-scope tasks. +6. Release the usual scope-held lifetime reference. + +This is why manual `scope_end` is still expensive on non-unroll paged attention: + +- it walks every manual-scope task +- it merges cached external fanins with explicit same-scope edges +- it mutates scheduler fanin/fanout state in one serial finalize step + +### What state manual scope maintains + +The runtime keeps a small amount of scope-local metadata instead of a second execution engine: + +- `scope_tasks[]`: task order for the current scope +- `manual_task_meta[]`: per-task metadata for manual finalize +- `manual_edges[]`: explicit same-scope producer-consumer edges +- `payload->fanin_slot_states[]`: cached external producers discovered at manual submit +- `fanin_count` plus one deferred publish barrier + +This split was chosen because it preserves the normal scheduler after publish: + +- no second execution-time dependency engine +- no change to the ready queue model +- no change to worker dispatch or completion handling +- only boundary discovery stays on the TensorMap path +- only same-scope replay is deferred to manual `scope_end` + ## Problem Statement If we simply copy `aicpu_build_graph` semantics into `tensormap_and_ringbuffer`, we get a wrong boundary model: @@ -279,10 +362,10 @@ Everything else stays on the existing TensorMap path. ### 2. Outer-scope tensor, written in place - The internal writer must still publish its producer frontier for TensorMap boundary tracking. -- That boundary frontier becomes visible at manual `scope_end`, so later outside submissions can attach to the correct writer task. +- That boundary frontier must become visible at manual submit, in original submit order, so later submissions can attach to the correct writer task immediately. - Readiness of the written tensor is the completion of that writer task. - Multiple writes inside the same manual scope are allowed. -- TensorMap should continue tracking the latest producer frontier exactly as in auto scope once the manual scope is finalized. +- TensorMap should continue tracking the latest producer frontier exactly as in auto scope while the scope is still unpublished to the scheduler. ### 3. Tensor created inside this manual scope and reused only inside this manual scope @@ -447,7 +530,7 @@ What changes in manual scope is only the task API and the time when dependency l - normal submit APIs perform dependency lookup and TensorMap insert immediately - manual submit APIs only allocate the task, copy payload data, and record compact finalize metadata -- manual `scope_end` replays dependency lookup and TensorMap insert from the recorded payload +- manual `scope_end` replays only explicit same-scope edges and combines them with the external producer set already cached at submit This keeps normal-mode APIs unchanged while avoiding a second tensor representation for manual mode. @@ -631,7 +714,7 @@ For each task in that order during submit: This submit order matters: - it preserves current tensormap behavior for multiple writes to outer tensors -- earlier outer writes from the same manual scope become visible to later tasks in the same manual scope during replay +- earlier outer writes from the same manual scope become visible to later tasks in the same manual scope during submit - that matches the accepted v1 tradeoff that outer tensors may still induce implicit same-scope TensorMap edges - it requires the same TensorMap validity synchronization that normal auto submit uses before lookup/insert @@ -659,7 +742,8 @@ If the external producer has only published its TensorMap frontier but not yet c This is the desired hybrid behavior: -- dependency construction happens at manual `scope_end`, before publish +- explicit same-scope dependency replay happens at manual `scope_end`, before publish +- cross-scope dependency discovery already happened at manual submit - dependency satisfaction is still handled by the normal runtime execution path after publish ### Why this is low-risk @@ -712,7 +796,7 @@ Why this is expected to hold: - tasks created in the current manual scope are still protected by the current manual scope reference until manual `scope_end` - tasks created in an outer still-active scope may complete early, but the outer scope still holds their scope reference until that outer scope ends -- therefore an inner manual scope can still discover those outer producers through `owner_task_id` or TensorMap when it finalizes +- therefore an inner manual scope can still rely on the producer state already discovered and retained during manual submit when it finalizes This does not mean the producer task is still runnable or incomplete. It may already be `COMPLETED`; the manual finalize path should then treat it as an already-satisfied dependency. @@ -821,7 +905,7 @@ Manual scope does not change lifetime release semantics: Manual scope also does not change cross-scope readiness semantics: - external tensor readiness is still producer-task completion, not `scope_end` -- but external-writer frontier information must be visible to later TensorMap lookups no later than manual `scope_end` +- but external-writer frontier information must already be visible to later TensorMap lookups at manual submit This manual-scope behavior intentionally combines: @@ -845,8 +929,8 @@ outside task reads C Correct behavior: -- at manual `scope_end`, `t1` publishes `C` to TensorMap -- at manual `scope_end`, `t2` publishes `C` again to TensorMap +- at manual submit, `t1` publishes `C` to TensorMap +- at manual submit, `t2` publishes `C` again to TensorMap - outside reader should see `t2` as the latest producer frontier - because `t1 -> t2` is explicit, `t2` completion is a valid readiness frontier for the final visible state - outer tensors may still create implicit same-scope TensorMap edges inside the manual scope; this is an accepted v1 tradeoff and should be called out in the PR description @@ -1039,6 +1123,32 @@ Units below are `elapsed_us (orch_us)`. | `paged_attention_unroll` | `Case1` | `1412.7 (-)` | `1321.7 (841.6)` | `1323.9 (831.3)` | `1321.3 (884.4)` | | `paged_attention_unroll` | `Case2` | `705.5 (-)` | `628.1 (381.6)` | `632.5 (378.9)` | `637.5 (406.4)` | +### Feature-To-Gain Mapping + +The most important question is which change actually moved performance. + +The rerun below isolates the non-unroll partial-manual example on device `3`. All +three rows already include the same rewritten orchestration structure +(`one manual scope per q_idx`, `AIV_HUB` moved inside manual scope, and explicit +`prev_update_task -> up_outs.task_id` chaining). The only thing changing between rows +is boundary annotation. + +| Optimization step | What changed | `Case1` orch delta | `Case2` orch delta | What it means | +| --- | --- | --- | --- | --- | +| Baseline after rewrite | no `manual_dep` boundary hints | baseline `36791.9 us` | baseline `19792.7 us` | structural rewrite alone is the starting point | +| Add input-side hints | `query`, `key_cache`, `value_cache`, and their views use `manual_dep=true` | `-39.6 us` (`-0.1%`) | `-496.0 us` (`-2.5%`) | minor benefit; input-side TensorMap work is not the main bottleneck | +| Add output-side hints on top | `out` and `out_view` also use `manual_dep=true` | `-1683.6 us` (`-4.6%`) | `-1628.5 us` (`-8.4%`) | major gain; repeated external-output overlap tracking was expensive | +| Full boundary hints vs no hints | inputs + output boundary hints together | `-1723.2 us` (`-4.7%`) | `-2124.5 us` (`-10.7%`) | this is the measurable win that was worth keeping | + +Two conclusions matter: + +1. The kept optimization is not “use `manual_dep` everywhere”. + - The measurable gain came mostly from suppressing repeated external-output + TensorMap work on `out` / `out_view`, where same-scope write ordering is already + carried by explicit manual edges. +2. Input-side `manual_dep=true` alone is not enough. + - It helps a little on `Case2`, but almost nothing on `Case1`. + ### Benchmark Takeaways 1. The non-unroll target is still not met. @@ -1112,6 +1222,15 @@ That is consistent with the current runtime behavior: - marking `out` / `out_view` `manual_dep=true` avoids paying repeated external-output overlap tracking on every block update when that ordering is already explicit +Why this was kept even though `manual_dep` is not the core semantics: + +- manual scope still uses TensorMap for general cross-scope correctness +- this example already has explicit same-scope write ordering for `ONLINE_UPDATE` +- there is no same-scope external consumer that needs `out` / `out_view` to stay on + TensorMap before manual `scope_end` +- so suppressing repeated output overlap tracking is a valid example-level + optimization, not a change to the runtime's semantic model + This is still not a general “scope arguments” API. It is an example-local optimization that is only safe when the manual scope already carries the same-scope write ordering explicitly and there is no same-scope external consumer that depends on TensorMap @@ -1156,9 +1275,11 @@ publication before manual `scope_end`. - This would duplicate fanin/fanout handling, completion notification, and release traversal. - The design requires one unified post-publish scheduler mechanism. -13. Using `manual_dep=true` as a fake scope-boundary annotation. +13. Using `manual_dep=true` as a blanket scope-boundary annotation. - This can suppress TensorMap work that is still required for cross-scope correctness. -- It also creates unstable performance results because the flag is tensor-local, not scope-local. +- It is only safe as a narrowly-scoped example optimization when the same-scope + ordering is already explicit and no early external consumer needs TensorMap + publication for that tensor. ## Dangerous Risks For The Submit/Scope-End Split @@ -1226,7 +1347,8 @@ This design intentionally resolves the central ambiguity: - `scope_end` controls lifetime release - task completion controls semantic readiness -For outer tensors written inside manual scope, TensorMap frontier publication happens at manual `scope_end`, while semantic readiness is still producer-task completion. +For outer tensors written inside manual scope, TensorMap frontier publication happens at +manual submit, while semantic readiness is still producer-task completion. ## File Areas Expected To Change @@ -1244,7 +1366,7 @@ Implement manual dependency as a scope-local override inside `tensormap_and_ring - tensors created in the current manual scope: explicit `add_dependency` - outer tensors: existing TensorMap path -- TensorMap boundary realization for manual scopes: manual `scope_end` +- TensorMap boundary realization for manual scopes: manual submit - semantic readiness of outer writes: writer completion - lifetime release: `scope_end` From e0aa3c48057a2ffb87bfe6ba3799878362661c7f Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Wed, 8 Apr 2026 16:40:02 +0800 Subject: [PATCH 24/35] Update: explain manual-scope design tradeoffs - add a short rationale section tying each manual-scope rule to the incorrect or too-expensive alternative it avoids - keep the design doc aligned with the current implemented split between TensorMap boundary discovery and scope-end explicit-edge replay --- docs/manual-dep-for-tensormap-design.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/docs/manual-dep-for-tensormap-design.md b/docs/manual-dep-for-tensormap-design.md index a236cde89..60a988af8 100644 --- a/docs/manual-dep-for-tensormap-design.md +++ b/docs/manual-dep-for-tensormap-design.md @@ -248,6 +248,19 @@ This split was chosen because it preserves the normal scheduler after publish: - only boundary discovery stays on the TensorMap path - only same-scope replay is deferred to manual `scope_end` +### Why these decisions were made + +Each part of the split exists to avoid a specific incorrect or too-expensive alternative: + +- keep cross-scope producer discovery on TensorMap + - otherwise outer reads and writes would lose the current producer frontier and later submissions could see stale state +- keep same-scope manual-local edges explicit + - otherwise manual mode would still pay repeated TensorMap lookup/insert for the tensors it is trying to optimize +- defer scheduler publication to manual `scope_end` + - otherwise tasks with partially wired explicit edges could become runnable too early +- keep only one post-publish scheduler mechanism + - otherwise the runtime would need a second dependency engine and a second completion path, which is high-risk and unnecessary + ## Problem Statement If we simply copy `aicpu_build_graph` semantics into `tensormap_and_ringbuffer`, we get a wrong boundary model: From 29f1eb6676f3f04c1bf72fd714e5ea7bc3c96e1f Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Wed, 8 Apr 2026 17:06:36 +0800 Subject: [PATCH 25/35] Refactor: remove branch-local unmodified runtime support - delete the copied tensormap_and_ringbuffer_unmodified runtime and ST scenes\n- keep branch docs and benchmark helpers limited to supported runtimes\n- enforce examples/{arch}/{runtime}/{name} in tracked command docs\n- rewrite example and ST path references to use explicit arch prefixes --- .claude/commands/perf-example-device.md | 9 +- .claude/commands/profile.md | 8 +- .claude/commands/test-example-device.md | 11 +- .claude/commands/test-example-sim.md | 9 +- .claude/rules/architecture.md | 5 + CLAUDE.md | 3 +- docs/developer-guide.md | 4 +- docs/manual-dep-for-tensormap-design.md | 46 +- .../docs/INCORE_ORCHESTRATION_GUIDE.md | 8 +- .../vector_example/README.md | 4 +- .../docs/INCORE_ORCHESTRATION_GUIDE.md | 7 +- .../host_build_graph/vector_example/README.md | 20 +- .../docs/INCORE_ORCHESTRATION_GUIDE.md | 10 +- examples/scripts/run_example.py | 8 +- src/a2a3/docs/runtimes.md | 34 +- .../docs/SUBMIT_BY_CLUSTER.md | 4 +- .../aicore/aicore_executor.cpp | 140 - .../aicpu/aicpu_executor.cpp | 2473 ----------------- .../build_config.py | 26 - .../common/intrinsic.h | 141 - .../docs/MULTI_RING.md | 237 -- .../docs/ROADMAP.md | 86 - .../docs/RUNTIME_LOGIC.md | 658 ----- .../docs/SCALAR_DATA_ACCESS.md | 137 - .../docs/SUBMIT_BY_CLUSTER.md | 226 -- .../docs/device_log_profiling.md | 167 -- .../docs/profiling_levels.md | 355 --- .../host/runtime_compile_info.cpp | 18 - .../host/runtime_maker.cpp | 381 --- .../orchestration/common.cpp | 174 -- .../orchestration/pto_orchestration_api.h | 308 -- .../runtime/common.h | 93 - .../runtime/pto2_dispatch_payload.h | 85 - .../runtime/pto_orchestrator.cpp | 759 ----- .../runtime/pto_orchestrator.h | 225 -- .../runtime/pto_ring_buffer.cpp | 78 - .../runtime/pto_ring_buffer.h | 508 ---- .../runtime/pto_runtime2.cpp | 337 --- .../runtime/pto_runtime2.h | 225 -- .../runtime/pto_runtime2_types.h | 557 ---- .../runtime/pto_scheduler.cpp | 220 -- .../runtime/pto_scheduler.h | 819 ------ .../runtime/pto_shared_memory.cpp | 273 -- .../runtime/pto_shared_memory.h | 227 -- .../runtime/pto_submit_types.h | 119 - .../runtime/pto_task_id.h | 50 - .../runtime/pto_tensormap.cpp | 256 -- .../runtime/pto_tensormap.h | 521 ---- .../runtime/pto_types.h | 284 -- .../runtime/runtime.cpp | 144 - .../runtime/runtime.h | 290 -- .../runtime/tensor.h | 493 ---- .../docs/SUBMIT_BY_CLUSTER.md | 6 +- .../paged_attention/README.md | 8 +- .../paged_attention/README.md | 8 +- .../paged_attention/golden.py | 19 - .../paged_attention/kernels/kernel_config.py | 20 - .../paged_attention_unroll/golden.py | 19 - .../kernels/kernel_config.py | 20 - .../paged_attention/README.md | 12 +- tests/ut/py/test_runtime_builder.py | 8 - tools/README.md | 10 +- tools/benchmark_rounds.sh | 17 +- tools/swimlane_converter.py | 2 +- 64 files changed, 114 insertions(+), 12315 deletions(-) delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/aicore/aicore_executor.cpp delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/aicpu/aicpu_executor.cpp delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/build_config.py delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/common/intrinsic.h delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/MULTI_RING.md delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/ROADMAP.md delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/RUNTIME_LOGIC.md delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/SCALAR_DATA_ACCESS.md delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/SUBMIT_BY_CLUSTER.md delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/device_log_profiling.md delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/profiling_levels.md delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/host/runtime_compile_info.cpp delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/host/runtime_maker.cpp delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/orchestration/common.cpp delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/orchestration/pto_orchestration_api.h delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/common.h delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto2_dispatch_payload.h delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_orchestrator.cpp delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_orchestrator.h delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_ring_buffer.cpp delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_ring_buffer.h delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_runtime2.cpp delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_runtime2.h delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_runtime2_types.h delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_scheduler.cpp delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_scheduler.h delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_shared_memory.cpp delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_shared_memory.h delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_submit_types.h delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_task_id.h delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_tensormap.cpp delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_tensormap.h delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_types.h delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/runtime.cpp delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/runtime.h delete mode 100644 src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/tensor.h delete mode 100644 tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention/golden.py delete mode 100644 tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention/kernels/kernel_config.py delete mode 100644 tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention_unroll/golden.py delete mode 100644 tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention_unroll/kernels/kernel_config.py diff --git a/.claude/commands/perf-example-device.md b/.claude/commands/perf-example-device.md index 28c1a50ef..c72fef178 100644 --- a/.claude/commands/perf-example-device.md +++ b/.claude/commands/perf-example-device.md @@ -3,8 +3,9 @@ Benchmark the hardware performance of a single example at $ARGUMENTS. Reference `tools/benchmark_rounds.sh` for the full implementation pattern (device log resolution, timing parsing, reporting format). This skill runs the same logic but for a single example only. 1. Verify `$ARGUMENTS` exists and contains `kernels/kernel_config.py` and `golden.py` -2. Check `command -v npu-smi` — if not found, tell the user this requires hardware and stop -3. **Detect platform**: Run `npu-smi info` and parse the chip name. Map `910B`/`910C` → `a2a3`, `950` → `a5`. If unrecognized, warn and default to `a2a3` -4. Find the lowest-ID idle device (HBM-Usage = 0) from the `npu-smi info` output. If none, stop -5. Run the example following the same pattern as `run_bench()` in `tools/benchmark_rounds.sh`: +2. Require the example path to live under `examples/a2a3/` or `examples/a5/`. If it does not, stop and report that root-level `examples/{runtime}/...` paths are invalid. +3. Check `command -v npu-smi` — if not found, tell the user this requires hardware and stop +4. **Detect platform**: Infer the architecture from the example path (`examples/a2a3/...` → `a2a3`, `examples/a5/...` → `a5`). Use `npu-smi info` only as a sanity check; if the detected chip family conflicts with the path, report the mismatch and stop instead of silently switching platforms. +5. Find the lowest-ID idle device (HBM-Usage = 0) from the `npu-smi info` output. If none, stop +6. Run the example following the same pattern as `run_bench()` in `tools/benchmark_rounds.sh`: - Snapshot logs, run `run_example.py` with `-n 10`, find new log, parse timing, report results diff --git a/.claude/commands/profile.md b/.claude/commands/profile.md index aafe867f1..546bf3dca 100644 --- a/.claude/commands/profile.md +++ b/.claude/commands/profile.md @@ -1,6 +1,8 @@ Run the example at $ARGUMENTS with profiling enabled on hardware. 1. Verify the directory exists and contains `kernels/kernel_config.py` and `golden.py` -2. Run: `python examples/scripts/run_example.py -k $ARGUMENTS/kernels -g $ARGUMENTS/golden.py -p a2a3 --enable-profiling` -3. If the test passes, report the swimlane output file location in `outputs/` -4. Summarize the task statistics from the console output (per-function timing breakdown) +2. Require the example path to live under `examples/a2a3/` or `examples/a5/`. If it does not, stop and report that root-level `examples/{runtime}/...` paths are invalid. +3. Infer the platform from the example path (`examples/a2a3/...` → `a2a3`, `examples/a5/...` → `a5`). +4. Run: `python examples/scripts/run_example.py -k $ARGUMENTS/kernels -g $ARGUMENTS/golden.py -p --enable-profiling` +5. If the test passes, report the swimlane output file location in `outputs/` +6. Summarize the task statistics from the console output (per-function timing breakdown) diff --git a/.claude/commands/test-example-device.md b/.claude/commands/test-example-device.md index ac34dd232..f30736419 100644 --- a/.claude/commands/test-example-device.md +++ b/.claude/commands/test-example-device.md @@ -1,8 +1,9 @@ Run the hardware device test for the example at $ARGUMENTS. 1. Verify the directory exists and contains `kernels/kernel_config.py` and `golden.py` -2. Check `command -v npu-smi` — if not found, tell the user to use `/test-example-sim` instead and stop -3. **Detect platform**: Run `npu-smi info` and parse the chip name. Map `910B`/`910C` → `a2a3`, `950` → `a5`. If unrecognized, warn and default to `a2a3` -4. Read `.github/workflows/ci.yml` to extract the current `-c` (pto-isa commit) flag from the `st-onboard-` job's `./ci.sh` invocation -5. Run: `python examples/scripts/run_example.py -k $ARGUMENTS/kernels -g $ARGUMENTS/golden.py -p -c ` -6. Report pass/fail status with any error output +2. Require the example path to live under `examples/a2a3/` or `examples/a5/`. If it does not, stop and report that root-level `examples/{runtime}/...` paths are invalid. +3. Check `command -v npu-smi` — if not found, tell the user to use `/test-example-sim` instead and stop +4. **Detect platform**: Infer the architecture from the example path (`examples/a2a3/...` → `a2a3`, `examples/a5/...` → `a5`). Use `npu-smi info` only as a sanity check; if the detected chip family conflicts with the path, report the mismatch and stop instead of silently switching platforms. +5. Read `.github/workflows/ci.yml` to extract the current `-c` (pto-isa commit) flag from the `st-onboard-` job's `./ci.sh` invocation +6. Run: `python examples/scripts/run_example.py -k $ARGUMENTS/kernels -g $ARGUMENTS/golden.py -p -c ` +7. Report pass/fail status with any error output diff --git a/.claude/commands/test-example-sim.md b/.claude/commands/test-example-sim.md index 79deecf50..1f0bbb40c 100644 --- a/.claude/commands/test-example-sim.md +++ b/.claude/commands/test-example-sim.md @@ -1,7 +1,8 @@ Run the simulation test for the example at $ARGUMENTS. 1. Verify the directory exists and contains `kernels/kernel_config.py` and `golden.py` -2. Read `.github/workflows/ci.yml` to extract the current `-c` (pto-isa commit) flag from the `st-sim-*` jobs' `./ci.sh` invocations -3. **Detect platform**: Infer the architecture from the example path (e.g., `examples/a2a3/...` → `a2a3sim`, `examples/a5/...` → `a5sim`). If the path doesn't contain an arch prefix, default to `a2a3sim` -4. Run: `python examples/scripts/run_example.py -k $ARGUMENTS/kernels -g $ARGUMENTS/golden.py -p -c ` -5. Report pass/fail status with any error output +2. Require the example path to live under `examples/a2a3/` or `examples/a5/`. If it does not, stop and report that root-level `examples/{runtime}/...` paths are invalid. +3. Read `.github/workflows/ci.yml` to extract the current `-c` (pto-isa commit) flag from the `st-sim-*` jobs' `./ci.sh` invocations +4. **Detect platform**: Infer the architecture from the example path (`examples/a2a3/...` → `a2a3sim`, `examples/a5/...` → `a5sim`). +5. Run: `python examples/scripts/run_example.py -k $ARGUMENTS/kernels -g $ARGUMENTS/golden.py -p -c ` +6. Report pass/fail status with any error output diff --git a/.claude/rules/architecture.md b/.claude/rules/architecture.md index 0125b1822..abc1f78cb 100644 --- a/.claude/rules/architecture.md +++ b/.claude/rules/architecture.md @@ -24,6 +24,11 @@ See [docs/architecture.md](../../docs/architecture.md) for the full diagram, API ## Example / Test Layout +Examples must live under `examples/{arch}/{runtime}/{name}/`. Valid example roots are +`examples/a2a3/` and `examples/a5/`. Paths such as +`examples/host_build_graph//` or `examples/tensormap_and_ringbuffer//` +directly under `examples/` are invalid. + ```text my_example/ golden.py # generate_inputs() + compute_golden() diff --git a/CLAUDE.md b/CLAUDE.md index 269255928..2ecd700a9 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -8,7 +8,7 @@ See [docs/developer-guide.md](docs/developer-guide.md) for full directory struct | ---- | ----------------- | | Platform Developer | `src/{arch}/platform/` | | Runtime Developer | `src/{arch}/runtime/` | -| Codegen Developer | `examples/` | +| Codegen Developer | `examples/{arch}/` | ## Common Commands @@ -37,3 +37,4 @@ clang-format -i 3. Create new subdirectories under your assigned directory as needed 4. When in doubt, ask the user before making changes to other areas 5. **Avoid including private information in documentation or code** such as usernames, absolute paths with usernames, or other personally identifiable information. Use relative paths or generic placeholders instead +6. **Place examples under `examples/{arch}/{runtime}/{name}/`**. Do not create `examples/{runtime}/...` directly under `examples/`. diff --git a/docs/developer-guide.md b/docs/developer-guide.md index 64255dfab..b39823200 100644 --- a/docs/developer-guide.md +++ b/docs/developer-guide.md @@ -106,7 +106,9 @@ When preprocessor guards are used to isolate platform code paths, the `__aarch64 ## Example / Test Layout -Every example and device test follows this structure: +Examples must live under `examples/{arch}/{runtime}/{name}/`, and device scenes must +live under `tests/st/{arch}/{runtime}/{name}/`. Every example and device test follows +this structure: ```text my_example/ diff --git a/docs/manual-dep-for-tensormap-design.md b/docs/manual-dep-for-tensormap-design.md index 60a988af8..e489d6e08 100644 --- a/docs/manual-dep-for-tensormap-design.md +++ b/docs/manual-dep-for-tensormap-design.md @@ -1086,10 +1086,9 @@ Fresh benchmark data was rerun on real hardware on 2026-04-08 with: - pinned PTO-ISA commit: `6622890` - runner: `tools/benchmark_rounds.sh` -The four compared variants are: +The branch-local compared variants are: - `aicpu_build_graph` -- `tensormap_and_ringbuffer_unmodified` - `tensormap_and_ringbuffer` - `tensormap_and_ringbuffer_partial_manual` @@ -1097,6 +1096,10 @@ The four compared variants are: `tensormap_and_ringbuffer_partial_manual` is the same runtime tree, but benchmarked through the `_partial_manual` paged-attention scenes. +The untouched PTO2 baseline is no longer kept in this branch. If a comparison against +an unmodified tensormap runtime is needed, create a temporary worktree from the +baseline commit and run the same benchmark script there. + ### Benchmark Script Selectors The benchmark wrapper enables the variants as follows: @@ -1104,9 +1107,6 @@ The benchmark wrapper enables the variants as follows: - `./tools/benchmark_rounds.sh -d 3 -n 10 -r aicpu_build_graph -c 6622890` - benchmarks `tests/st/a2a3/aicpu_build_graph/paged_attention` - benchmarks `tests/st/a2a3/aicpu_build_graph/paged_attention_unroll` -- `./tools/benchmark_rounds.sh -d 3 -n 10 -r tensormap_and_ringbuffer_unmodified -c 6622890` - - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention` - - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention_unroll` - `./tools/benchmark_rounds.sh -d 3 -n 10 -r tensormap_and_ringbuffer -c 6622890` - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer/paged_attention` - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll` @@ -1122,19 +1122,18 @@ The partial-manual scenes still declare `RUNTIME_CONFIG["runtime"] = is selected. Similarly, the current/new AUTO-path runtime is enabled directly by `-r -tensormap_and_ringbuffer`, while the copied side-by-side baseline is enabled by `-r -tensormap_and_ringbuffer_unmodified`. +tensormap_and_ringbuffer`. ### Fresh Results Units below are `elapsed_us (orch_us)`. -| Workload | Case | `aicpu_build_graph` | `tensormap_and_ringbuffer_unmodified` | `tensormap_and_ringbuffer` | `tensormap_and_ringbuffer_partial_manual` | -| --- | --- | --- | --- | --- | --- | -| `paged_attention` | `Case1` | `31318.9 (-)` | `35367.3 (35366.7)` | `36996.3 (36995.6)` | `35187.6 (35030.2)` | -| `paged_attention` | `Case2` | `16844.5 (-)` | `19739.8 (19736.1)` | `19861.8 (19856.8)` | `18685.5 (18274.5)` | -| `paged_attention_unroll` | `Case1` | `1412.7 (-)` | `1321.7 (841.6)` | `1323.9 (831.3)` | `1321.3 (884.4)` | -| `paged_attention_unroll` | `Case2` | `705.5 (-)` | `628.1 (381.6)` | `632.5 (378.9)` | `637.5 (406.4)` | +| Workload | Case | `aicpu_build_graph` | `tensormap_and_ringbuffer` | `tensormap_and_ringbuffer_partial_manual` | +| --- | --- | --- | --- | --- | +| `paged_attention` | `Case1` | `31318.9 (-)` | `36996.3 (36995.6)` | `35187.6 (35030.2)` | +| `paged_attention` | `Case2` | `16844.5 (-)` | `19861.8 (19856.8)` | `18685.5 (18274.5)` | +| `paged_attention_unroll` | `Case1` | `1412.7 (-)` | `1323.9 (831.3)` | `1321.3 (884.4)` | +| `paged_attention_unroll` | `Case2` | `705.5 (-)` | `632.5 (378.9)` | `637.5 (406.4)` | ### Feature-To-Gain Mapping @@ -1175,22 +1174,13 @@ Two conclusions matter: - `paged_attention/Case1`: about `-4.9%` elapsed, about `-5.3%` orch vs current/new - `paged_attention/Case2`: about `-5.9%` elapsed, about `-8.0%` orch vs current/new -3. Partial-manual is not yet consistently better than the copied unmodified baseline. - - `paged_attention/Case1`: about `-0.5%` elapsed, about `-1.0%` orch vs unmodified - - `paged_attention/Case2`: about `+5.2%` elapsed, about `+7.8%` orch vs unmodified - -4. The current/new AUTO runtime is still slower than the copied unmodified runtime on - the non-unroll scene. - - `paged_attention/Case1`: about `+4.6%` elapsed, about `+4.6%` orch - - `paged_attention/Case2`: about `+0.6%` elapsed, about `+0.6%` orch - -5. On the unroll scene, all three tensormap-family runtimes remain faster than - `aicpu_build_graph` end-to-end, but partial-manual stays slightly worse than both - AUTO tensormap variants in orch time. - - `paged_attention_unroll/Case1`: `884.4 us` orch vs `841.6 us` unmodified and `831.3 us` current/new - - `paged_attention_unroll/Case2`: `406.4 us` orch vs `381.6 us` unmodified and `378.9 us` current/new +3. On the unroll scene, both tensormap-family runtimes remain faster than + `aicpu_build_graph` end-to-end, but partial-manual stays slightly worse than the + AUTO tensormap path in orch time. + - `paged_attention_unroll/Case1`: `884.4 us` orch vs `831.3 us` current/new + - `paged_attention_unroll/Case2`: `406.4 us` orch vs `378.9 us` current/new -6. The remaining performance problem is still concentrated in the non-unroll +4. The remaining performance problem is still concentrated in the non-unroll partial-manual path, especially the replay/publish cost paid at manual `scope_end`. ### Boundary Annotation Note diff --git a/examples/a2a3/aicpu_build_graph/docs/INCORE_ORCHESTRATION_GUIDE.md b/examples/a2a3/aicpu_build_graph/docs/INCORE_ORCHESTRATION_GUIDE.md index 3d52c2e12..d43b83dd8 100644 --- a/examples/a2a3/aicpu_build_graph/docs/INCORE_ORCHESTRATION_GUIDE.md +++ b/examples/a2a3/aicpu_build_graph/docs/INCORE_ORCHESTRATION_GUIDE.md @@ -4,8 +4,8 @@ In aicpu_build_graph, the orchestration function runs on AICPU. It reads device pointers from `runtime->orch_args`, allocates intermediate buffers with `device_malloc`, builds the task dependency graph through the `AicpuBuildApi` function-pointer table, and publishes tasks for scheduling. ## Where To Put Orchestration Code -- Each example keeps orchestration sources under `examples/aicpu_build_graph//kernels/orchestration/`. -- `examples/aicpu_build_graph//kernels/kernel_config.py` defines the orchestration entry point. Example: `ORCHESTRATION = {"source": ".../orchestration.cpp", "function_name": "orchestration"}`. +- Each example keeps orchestration sources under `examples/a2a3/aicpu_build_graph//kernels/orchestration/`. +- `examples/a2a3/aicpu_build_graph//kernels/kernel_config.py` defines the orchestration entry point. Example: `ORCHESTRATION = {"source": ".../orchestration.cpp", "function_name": "orchestration"}`. ## Function Signature Your orchestration entry must be `extern "C"` and match: @@ -60,5 +60,5 @@ Where `api` is `runtime->aicpu_build_api`. - `"0"`: Sequential -- schedulers wait until the builder finishes all tasks. ## Examples -- `examples/aicpu_build_graph/vector_example/kernels/orchestration/orchestration.cpp` -- `examples/aicpu_build_graph/bgemm/kernels/orchestration/bgemm_orch.cpp` +- `examples/a2a3/aicpu_build_graph/vector_example/kernels/orchestration/orchestration.cpp` +- `examples/a2a3/aicpu_build_graph/bgemm/kernels/orchestration/bgemm_orch.cpp` diff --git a/examples/a2a3/aicpu_build_graph/vector_example/README.md b/examples/a2a3/aicpu_build_graph/vector_example/README.md index 49107b056..5e6c3fcd6 100644 --- a/examples/a2a3/aicpu_build_graph/vector_example/README.md +++ b/examples/a2a3/aicpu_build_graph/vector_example/README.md @@ -6,8 +6,8 @@ This example runs the same computation as `host_build_graph_example`, but the ta ```bash python examples/scripts/run_example.py \ - -k examples/aicpu_build_graph/vector_example/kernels \ - -g examples/aicpu_build_graph/vector_example/golden.py \ + -k examples/a2a3/aicpu_build_graph/vector_example/kernels \ + -g examples/a2a3/aicpu_build_graph/vector_example/golden.py \ -p a2a3sim ``` diff --git a/examples/a2a3/host_build_graph/docs/INCORE_ORCHESTRATION_GUIDE.md b/examples/a2a3/host_build_graph/docs/INCORE_ORCHESTRATION_GUIDE.md index fc632cc7b..42182f95f 100644 --- a/examples/a2a3/host_build_graph/docs/INCORE_ORCHESTRATION_GUIDE.md +++ b/examples/a2a3/host_build_graph/docs/INCORE_ORCHESTRATION_GUIDE.md @@ -5,9 +5,8 @@ In host_build_graph, the orchestration function runs on the host. It allocates device buffers, builds the task graph by calling `add_task(runtime, ...)`, and wires dependencies with `add_successor(runtime, ...)`. ## Where To Put Orchestration Code - -- Each example keeps orchestration sources under `examples/host_build_graph//kernels/orchestration/`. -- `examples/host_build_graph//kernels/kernel_config.py` defines the orchestration entry point. Example: `ORCHESTRATION = {"source": ".../example_orch.cpp", "function_name": "build_example_graph"}`. +- Each example keeps orchestration sources under `examples/a2a3/host_build_graph//kernels/orchestration/`. +- `examples/a2a3/host_build_graph//kernels/kernel_config.py` defines the orchestration entry point. Example: `ORCHESTRATION = {"source": ".../example_orch.cpp", "function_name": "build_example_graph"}`. ## Function Signature @@ -37,7 +36,7 @@ A typical host orchestration sequence is: 4. Create tasks with `add_task(runtime, args, num_args, func_id, core_type)`. 5. Add dependency edges with `add_successor(runtime, producer, consumer)`. -Example: see `examples/host_build_graph/vector_example/kernels/orchestration/example_orch.cpp`. +Example: see `examples/a2a3/host_build_graph/vector_example/kernels/orchestration/example_orch.cpp`. ## Kernel Mapping diff --git a/examples/a2a3/host_build_graph/vector_example/README.md b/examples/a2a3/host_build_graph/vector_example/README.md index 20755cfea..974483703 100644 --- a/examples/a2a3/host_build_graph/vector_example/README.md +++ b/examples/a2a3/host_build_graph/vector_example/README.md @@ -52,14 +52,14 @@ This example supports two platforms: ```bash # From repository root python examples/scripts/run_example.py \ - -k examples/host_build_graph/vector_example/kernels \ - -g examples/host_build_graph/vector_example/golden.py \ + -k examples/a2a3/host_build_graph/vector_example/kernels \ + -g examples/a2a3/host_build_graph/vector_example/golden.py \ -p a2a3sim # With verbose output python examples/scripts/run_example.py \ - -k examples/host_build_graph/vector_example/kernels \ - -g examples/host_build_graph/vector_example/golden.py \ + -k examples/a2a3/host_build_graph/vector_example/kernels \ + -g examples/a2a3/host_build_graph/vector_example/golden.py \ -p a2a3sim \ -v ``` @@ -69,21 +69,21 @@ python examples/scripts/run_example.py \ ```bash # From repository root python examples/scripts/run_example.py \ - -k examples/host_build_graph/vector_example/kernels \ - -g examples/host_build_graph/vector_example/golden.py \ + -k examples/a2a3/host_build_graph/vector_example/kernels \ + -g examples/a2a3/host_build_graph/vector_example/golden.py \ -p a2a3 # With specific device ID python examples/scripts/run_example.py \ - -k examples/host_build_graph/vector_example/kernels \ - -g examples/host_build_graph/vector_example/golden.py \ + -k examples/a2a3/host_build_graph/vector_example/kernels \ + -g examples/a2a3/host_build_graph/vector_example/golden.py \ -p a2a3 \ -d 0 # With verbose output python examples/scripts/run_example.py \ - -k examples/host_build_graph/vector_example/kernels \ - -g examples/host_build_graph/vector_example/golden.py \ + -k examples/a2a3/host_build_graph/vector_example/kernels \ + -g examples/a2a3/host_build_graph/vector_example/golden.py \ -p a2a3 \ -v ``` diff --git a/examples/a2a3/tensormap_and_ringbuffer/docs/INCORE_ORCHESTRATION_GUIDE.md b/examples/a2a3/tensormap_and_ringbuffer/docs/INCORE_ORCHESTRATION_GUIDE.md index 2db7dda82..3265db0d9 100644 --- a/examples/a2a3/tensormap_and_ringbuffer/docs/INCORE_ORCHESTRATION_GUIDE.md +++ b/examples/a2a3/tensormap_and_ringbuffer/docs/INCORE_ORCHESTRATION_GUIDE.md @@ -5,9 +5,8 @@ In tensormap_and_ringbuffer, the orchestration function runs on AICPU and builds the graph directly on device. Dependencies are discovered automatically by TensorMap based on tensor overlap, and task memory is allocated from ring buffers. ## Where To Put Orchestration Code - -- Each example keeps orchestration sources under `examples/tensormap_and_ringbuffer//kernels/orchestration/`. -- `examples/tensormap_and_ringbuffer//kernels/kernel_config.py` selects the orchestration source and the runtime `tensormap_and_ringbuffer`. +- Each example keeps orchestration sources under `examples/a2a3/tensormap_and_ringbuffer//kernels/orchestration/`. +- `examples/a2a3/tensormap_and_ringbuffer//kernels/kernel_config.py` selects the orchestration source and the runtime `tensormap_and_ringbuffer`. ## Required Exports @@ -78,6 +77,5 @@ Dependencies are inferred by TensorMap from input/inout/output tensors, so you d Do not call `pto2_rt_orchestration_done` yourself in device mode. The executor wraps the entry call in an outer scope and signals completion after `aicpu_orchestration_entry` returns. ## Examples - -- `examples/tensormap_and_ringbuffer/vector_example/kernels/orchestration/example_orchestration.cpp` (AIV-only tasks) -- `examples/tensormap_and_ringbuffer/bgemm/kernels/orchestration/bgemm_orch.cpp` (mixed AIC + AIV tasks) +- `examples/a2a3/tensormap_and_ringbuffer/vector_example/kernels/orchestration/example_orchestration.cpp` (AIV-only tasks) +- `examples/a2a3/tensormap_and_ringbuffer/bgemm/kernels/orchestration/bgemm_orch.cpp` (mixed AIC + AIV tasks) diff --git a/examples/scripts/run_example.py b/examples/scripts/run_example.py index 89ab84199..76ae16f18 100644 --- a/examples/scripts/run_example.py +++ b/examples/scripts/run_example.py @@ -21,12 +21,12 @@ Examples: # Run hardware example (requires Ascend device) - python examples/scripts/run_example.py -k examples/host_build_graph/vector_example/kernels \ - -g examples/host_build_graph/vector_example/golden.py + python examples/scripts/run_example.py -k examples/a2a3/host_build_graph/vector_example/kernels \ + -g examples/a2a3/host_build_graph/vector_example/golden.py # Run simulation example (no hardware required) - python examples/scripts/run_example.py -k examples/host_build_graph/vector_example/kernels \ - -g examples/host_build_graph/vector_example/golden.py \ + python examples/scripts/run_example.py -k examples/a2a3/host_build_graph/vector_example/kernels \ + -g examples/a2a3/host_build_graph/vector_example/golden.py \ -p a2a3sim # Run with specific device diff --git a/src/a2a3/docs/runtimes.md b/src/a2a3/docs/runtimes.md index 72248e273..7d4a694bd 100644 --- a/src/a2a3/docs/runtimes.md +++ b/src/a2a3/docs/runtimes.md @@ -1,20 +1,20 @@ # Runtime Variants (a2a3) -Four runtime implementations live under `src/a2a3/runtime/`, each providing a different graph-building strategy. The `RUNTIME_CONFIG.runtime` field in `kernel_config.py` selects which runtime to use. +Three runtime implementations live under `src/a2a3/runtime/`, each providing a different graph-building strategy. The `RUNTIME_CONFIG.runtime` field in `kernel_config.py` selects which runtime to use. ## Comparison -| Feature | host_build_graph | aicpu_build_graph | tensormap_and_ringbuffer_unmodified | tensormap_and_ringbuffer | -| ------- | ---------------- | ----------------- | ----------------------------------- | ------------------------ | -| Graph built on | Host CPU | AICPU (device) | AICPU (device) | AICPU (device) | -| Task storage | Fixed `Task[]` array | Fixed `Task[]` array | Ring buffer (`PTO2TaskDescriptor[]`) | Ring buffer (`PTO2TaskDescriptor[]`) | -| Dependencies | Explicit edges | Explicit edges | Auto-derived via TensorMap | Auto-derived via TensorMap, plus optional manual dependencies | -| Memory management | Host-side | Host + device malloc | Ring buffer heap (GM) | Ring buffer heap (GM) | -| Concurrent build+schedule | No | Optional (`build_mode=1`) | Yes (always) | Yes (always) | -| Profiling support | Basic | Basic | Multi-level hierarchy | Multi-level hierarchy | -| Batch/streaming | No | No | Yes (flow control, back-pressure) | Yes (flow control, back-pressure) | -| Thread model | N scheduler threads | 1 builder + N schedulers | 1 orchestrator + 3 schedulers | 1 orchestrator + 3 schedulers | -| Use case | Development, debugging | Reduced host-device transfer | Baseline PTO2 comparison | Production PTO2 with manual-scope extensions | +| Feature | host_build_graph | aicpu_build_graph | tensormap_and_ringbuffer | +| ------- | ---------------- | ----------------- | ------------------------ | +| Graph built on | Host CPU | AICPU (device) | AICPU (device) | +| Task storage | Fixed `Task[]` array | Fixed `Task[]` array | Ring buffer (`PTO2TaskDescriptor[]`) | +| Dependencies | Explicit edges | Explicit edges | Auto-derived via TensorMap, plus optional manual dependencies | +| Memory management | Host-side | Host + device malloc | Ring buffer heap (GM) | +| Concurrent build+schedule | No | Optional (`build_mode=1`) | Yes (always) | +| Profiling support | Basic | Basic | Multi-level hierarchy | +| Batch/streaming | No | No | Yes (flow control, back-pressure) | +| Thread model | N scheduler threads | 1 builder + N schedulers | 1 orchestrator + 3 schedulers | +| Use case | Development, debugging | Reduced host-device transfer | Production PTO2 with manual-scope extensions | ## host_build_graph @@ -47,16 +47,6 @@ The primary production runtime. Uses ring buffers for task slots and output memo - Multi-ring: HeapRing, TaskRing, and DepPool split into 4 independent instances for nested scope isolation - Supports streaming, flow control, large batch sizes, and multi-level profiling -## tensormap_and_ringbuffer_unmodified - -An unmodified clone of the baseline PTO2 runtime, kept side-by-side for apples-to-apples comparison against the extended `tensormap_and_ringbuffer` runtime. - -- Same TensorMap and ring-buffer architecture as the original PTO2 implementation -- No manual-scope dependency extensions -- Intended for benchmarking and regression isolation, not new feature development - -See [tensormap_and_ringbuffer_unmodified/docs/](../runtime/tensormap_and_ringbuffer_unmodified/docs/) for the baseline runtime logic and profiling notes. - See [tensormap_and_ringbuffer/docs/](../runtime/tensormap_and_ringbuffer/docs/): - [RUNTIME_LOGIC.md](../runtime/tensormap_and_ringbuffer/docs/RUNTIME_LOGIC.md) — Full system design diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md b/src/a2a3/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md index c6d0e3ebd..8f954024a 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md @@ -204,8 +204,8 @@ Milestone command (device): ```bash python examples/scripts/run_example.py \ - -k tests/st/tensormap_and_ringbuffer/batch_paged_attention/kernels \ - -g tests/st/tensormap_and_ringbuffer/batch_paged_attention/golden.py \ + -k tests/st/a2a3/tensormap_and_ringbuffer/batch_paged_attention/kernels \ + -g tests/st/a2a3/tensormap_and_ringbuffer/batch_paged_attention/golden.py \ -p a2a3 -d 9 ``` diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/aicore/aicore_executor.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/aicore/aicore_executor.cpp deleted file mode 100644 index 1c03606e4..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/aicore/aicore_executor.cpp +++ /dev/null @@ -1,140 +0,0 @@ -/* - * Copyright (c) PyPTO Contributors. - * This program is free software, you can redistribute it and/or modify it under the terms and conditions of - * CANN Open Software License Agreement Version 2.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - * ----------------------------------------------------------------------------------------------------------- - */ - -#include "aicore/aicore.h" -#include "aicore/performance_collector_aicore.h" -#include "common/perf_profiling.h" -#include "common/platform_config.h" // Register-based communication -#include "pto2_dispatch_payload.h" // NOLINT(build/include_subdir) -#include "runtime.h" // NOLINT(build/include_subdir) - -/** - * Unified function pointer type for kernel dispatch - * - * All kernels follow the same signature: void kernel(__gm__ int64_t* args) - * This enables simple, switch-free dispatch. - */ -typedef void (*UnifiedKernelFunc)(__gm__ int64_t*); - -/** - * Execute task from PTO2DispatchPayload. - * - * Reads function_bin_addr and args from the dispatch payload. - * - * @param payload Pointer to PTO2DispatchPayload in global memory - */ -__aicore__ __attribute__((always_inline)) static void execute_task(__gm__ PTO2DispatchPayload* payload) { - if (payload == nullptr || payload->function_bin_addr == 0) { - return; - } - - UnifiedKernelFunc kernel = (UnifiedKernelFunc)payload->function_bin_addr; - kernel(reinterpret_cast<__gm__ int64_t*>(payload->args)); - OUT_OF_ORDER_STORE_BARRIER(); -} - -/** - * AICore main execution loop - * - * Implements the AICPU-AICore register-based dispatch protocol: - * 1. Wait for AICPU ready signal via handshake buffer - * 2. Report physical core ID and core type, signal AICore ready - * 3. Cache per-core PTO2DispatchPayload pointer from hank->task - * 4. Poll DATA_MAIN_BASE register for task dispatch until exit signal - * - * AICPU writes &s_pto2_payload_per_core[i] to hank->task before setting - * aicpu_ready=1. AICore caches this pointer and reads function_bin_addr + - * args pointer from it on each dispatch. reg_val is a monotonically - * increasing task ID used only for dispatch signaling and ACK/FIN protocol. - * - * @param runtime Pointer to Runtime in global memory - * @param block_idx Block index (core ID) - * @param core_type Core type (AIC or AIV) - */ -__aicore__ __attribute__((weak)) void aicore_execute(__gm__ Runtime* runtime, int block_idx, CoreType core_type) { - __gm__ Handshake* my_hank = (__gm__ Handshake*)(&runtime->workers[block_idx]); - - // Phase 1: Wait for AICPU initialization signal - while (my_hank->aicpu_ready == 0) { - dcci(my_hank, SINGLE_CACHE_LINE); - } - - // Phase 2: Report physical core ID, signal ready - my_hank->physical_core_id = get_physical_core_id(); - OUT_OF_ORDER_STORE_BARRIER(); - my_hank->aicore_regs_ready = 1; - dcci(&my_hank->aicore_regs_ready, SINGLE_CACHE_LINE, CACHELINE_OUT); - while (my_hank->aicpu_regs_ready == 0) { - dcci(&my_hank->aicpu_regs_ready, SINGLE_CACHE_LINE); - } - // Report initial idle status via register - write_reg(RegId::COND, AICORE_IDLE_VALUE); - - // Phase 3: Report core type, signal ready - my_hank->core_type = core_type; - OUT_OF_ORDER_STORE_BARRIER(); - my_hank->aicore_done = block_idx + 1; // Signal ready (use block_idx + 1 to avoid 0) - - dcci(my_hank, SINGLE_CACHE_LINE, CACHELINE_OUT); - - // Cache per-core dispatch payload pointer (set by AICPU before aicpu_ready) - __gm__ PTO2DispatchPayload* payload = reinterpret_cast<__gm__ PTO2DispatchPayload*>(my_hank->task); - - bool profiling_enabled = runtime->enable_profiling; - - // Phase 4: Main execution loop - poll register for tasks until exit signal - // Register encoding: AICPU_IDLE_TASK_ID=idle, task_id=task, AICORE_EXIT_SIGNAL=exit - uint32_t reg_val = AICPU_IDLE_TASK_ID; - uint32_t last_reg_val = AICPU_IDLE_TASK_ID; - - while (true) { - reg_val = static_cast(read_reg(RegId::DATA_MAIN_BASE)); - if (reg_val == AICORE_EXIT_SIGNAL) { - // Signal exit acknowledgment to AICPU - write_reg(RegId::COND, AICORE_EXITED_VALUE); - break; - } - - // Execute task if new (reg_val encoding: AICPU_IDLE_TASK_ID=idle, task_id=task) - if (reg_val == AICPU_IDLE_TASK_ID || reg_val == last_reg_val) { - SPIN_WAIT_HINT(); - continue; - } - - { - uint32_t task_id = reg_val; // Decode: register holds task_id directly - - // Invalidate payload buffer (AICPU updates its content each dispatch) - dcci(payload, ENTIRE_DATA_CACHE); - - write_reg(RegId::COND, MAKE_ACK_VALUE(task_id)); - - // Performance profiling: record start time - uint64_t start_time = get_sys_cnt_aicore(); - - // Execute the task - execute_task(payload); - - // Performance profiling: record task execution - if (profiling_enabled) { - uint64_t end_time = get_sys_cnt_aicore(); - __gm__ PerfBuffer* perf_buf = (__gm__ PerfBuffer*)my_hank->perf_records_addr; - perf_aicore_record_task(perf_buf, task_id, start_time, end_time); - } - - last_reg_val = reg_val; - write_reg(RegId::COND, MAKE_FIN_VALUE(task_id)); - } - } - - // Flush all dirty cache lines to HBM before kernel exit. - dcci(my_hank, SINGLE_CACHE_LINE, CACHELINE_OUT); -} diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/aicpu/aicpu_executor.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/aicpu/aicpu_executor.cpp deleted file mode 100644 index 79b440e09..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/aicpu/aicpu_executor.cpp +++ /dev/null @@ -1,2473 +0,0 @@ -/* - * Copyright (c) PyPTO Contributors. - * This program is free software, you can redistribute it and/or modify it under the terms and conditions of - * CANN Open Software License Agreement Version 2.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - * ----------------------------------------------------------------------------------------------------------- - */ -#include -#include -#include - -#include -#include -#include -#include -#include -#include -#include -#ifdef __linux__ -#include -#endif - -#include "aicpu/device_log.h" -#include "aicpu/device_time.h" -#include "pto2_dispatch_payload.h" -#include "runtime.h" -#include "spin_hint.h" - -// Runtime headers (full struct definition for create/destroy + PTO2_SCOPE) -#include "pto_runtime2.h" -#include "pto_runtime2_types.h" -#include "pto_shared_memory.h" - -// Performance profiling headers -#include "aicpu/performance_collector_aicpu.h" -#include "common/memory_barrier.h" -#include "common/perf_profiling.h" -#include "common/unified_log.h" - -// Register-based communication -#include "aicpu/platform_regs.h" -#include "common/platform_config.h" - -// Core type definitions -#include "common/core_type.h" - -// CoreCallable for resolved dispatch address -#include "callable.h" - -#if PTO2_PROFILING -// Accumulated nanoseconds per sub-step -#define CYCLE_COUNT_START() uint64_t _t0 = get_sys_cnt_aicpu(), _t1 -#define CYCLE_COUNT_LAP(acc) \ - do { \ - _t1 = get_sys_cnt_aicpu(); \ - acc += (_t1 - _t0); \ - _t0 = _t1; \ - } while (0) -#else -#define CYCLE_COUNT_START() -#define CYCLE_COUNT_LAP(acc) -#endif - -// Device orchestration function signature (loaded via dlopen). -// The executor binds the current thread's PTO2Runtime into orchestration TLS -// before calling the user entry. -typedef void (*DeviceOrchestrationFunc)( - const ChipStorageTaskArgs& orch_args, int32_t orch_thread_num, int32_t orch_thread_index); -typedef void (*DeviceOrchestrationBindRuntimeFunc)(PTO2Runtime* rt); - -// Config function exported by orchestration .so -typedef PTO2OrchestrationConfig (*DeviceOrchestrationConfigFunc)(const ChipStorageTaskArgs& orch_args); - -constexpr int32_t MAX_AICPU_THREADS = PLATFORM_MAX_AICPU_THREADS; -constexpr int32_t MAX_AIC_PER_THREAD = PLATFORM_MAX_AIC_PER_THREAD; -constexpr int32_t MAX_AIV_PER_THREAD = PLATFORM_MAX_AIV_PER_THREAD; -constexpr int32_t MAX_CORES_PER_THREAD = PLATFORM_MAX_CORES_PER_THREAD; - -constexpr int32_t MAX_IDLE_ITERATIONS = 800000; // ~20s idle then scheduler gives up (avoid long hang) -constexpr int32_t STALL_LOG_INTERVAL = 50000; // DEV_ALWAYS every N idle iters to debug hang -constexpr int32_t FATAL_ERROR_CHECK_INTERVAL = 1024; // Check orchestrator error every N idle iters -constexpr int32_t STALL_DUMP_READY_MAX = 8; -constexpr int32_t STALL_DUMP_WAIT_MAX = 4; -constexpr int32_t STALL_DUMP_CORE_MAX = 8; -constexpr int32_t PROGRESS_VERBOSE_THRESHOLD = 10; // log every completion for the first N tasks -constexpr int32_t PROGRESS_LOG_INTERVAL = 250; // log every N completions after threshold - -static PTO2Runtime* rt{nullptr}; - -// Per-core dispatch payload storage (one aligned cache line per physical core) -static PTO2DispatchPayload s_pto2_payload_per_core[RUNTIME_MAX_WORKER]; - -// Per-core state: one cache line per core to eliminate false sharing -// and co-locate all hot-path fields for minimal cache misses. -struct alignas(64) CoreExecState { - // --- Hot fields (completion + dispatch, every iteration) --- - uint64_t reg_addr; // offset 0: register address (set once in handshake) - PTO2TaskSlotState* executing_slot_state; // offset 8: slot state for running task - int32_t executing_reg_task_id; // offset 16: register task ID (AICPU_TASK_INVALID = idle) - uint32_t dispatch_seq; // offset 20: monotonic dispatch counter - PTO2SubtaskSlot executing_subslot; // offset 24: which subtask slot is running - uint8_t pad_[3]; // offset 25: alignment padding -#if PTO2_PROFILING - // --- Profiling fields (dispatch path, compile-time gated) --- - uint32_t dispatch_count; // offset 28: dispatched task count (buffer mgmt) - uint64_t dispatch_timestamp; // offset 32: AICPU dispatch timestamp -#endif - // --- Cold fields (init/diagnostics only, never in hot path) --- - int32_t worker_id; // index in runtime.workers[] - uint32_t physical_core_id; // hardware physical core ID - CoreType core_type; // AIC or AIV -}; -static_assert(sizeof(CoreExecState) == 64, "CoreExecState must occupy exactly one cache line"); - -// core_states_ encodes per-cluster core idle/running in 3 bits per cluster: -// bit i*3 = AIC of cluster i (1 = idle, 0 = running) -// bit i*3+1 = AIV0 of cluster i -// bit i*3+2 = AIV1 of cluster i -// Max 21 clusters per tracker (63 bits in uint64_t). -class alignas(64) CoreTracker { - public: - static inline int32_t MAX_CORE_PER_THREAD = 63; - static constexpr int32_t MAX_CLUSTERS = 63 / 3; - - public: - CoreTracker() = default; - - class BitStates { - public: // NOLINT(whitespace/indent) - BitStates() = default; - - explicit BitStates(uint64_t states) : states_(states) {} - void init() { states_ = 0; } - - BitStates operator~() const { return BitStates(~states_); } - BitStates operator&(const BitStates& other) const { return BitStates(states_ & other.states_); } - BitStates operator|(const BitStates& other) const { return BitStates(states_ | other.states_); } - BitStates operator^(const BitStates& other) const { return BitStates(states_ ^ other.states_); } - BitStates operator>>(int32_t offset) const { return BitStates(states_ >> offset); } - BitStates operator<<(int32_t offset) const { return BitStates(states_ << offset); } - void operator&=(const BitStates& other) { states_ &= other.states_; } - void operator|=(const BitStates& other) { states_ |= other.states_; } - void operator^=(const BitStates& other) { states_ ^= other.states_; } - - bool has_value() const { return states_ > 0; } - int32_t count() const { return __builtin_popcountll(states_); } - - // Extract the lowest set bit from mask, clear it, and return its position. - // Returns -1 if mask is empty. - int32_t pop_first() { - if (states_ == 0) return -1; - int32_t pos = __builtin_ctzll(states_); - states_ &= states_ - 1; - return pos; - } - - private: // NOLINT(whitespace/indent) - uint64_t states_{0}; - }; - - public: - void init(int32_t cluster_count) { - cluster_count_ = cluster_count; - aic_mask_.init(); - aiv_mask_.init(); - for (int32_t i = 0; i < cluster_count; i++) { - aic_mask_ |= BitStates(1ULL << (i * 3)); - aiv_mask_ |= BitStates(6ULL << (i * 3)); - } - core_states_ = aic_mask_ | aiv_mask_; - } - - void set_cluster(int32_t cluster_idx, int32_t aic_wid, int32_t aiv0_wid, int32_t aiv1_wid) { - core_id_map_[cluster_idx * 3] = aic_wid; - core_id_map_[cluster_idx * 3 + 1] = aiv0_wid; - core_id_map_[cluster_idx * 3 + 2] = aiv1_wid; - } - - int32_t get_cluster_count() const { return cluster_count_; } - - // --- Running core queries --- - - template - bool has_running_cores() const { - if constexpr (CT == CoreType::AIC) { - return ((~core_states_) & aic_mask_).has_value(); - } else { - return ((~core_states_) & aiv_mask_).has_value(); - } - } - - bool has_any_running_cores() const { return ((~core_states_) & (aic_mask_ | aiv_mask_)).has_value(); } - - template - int32_t get_running_count() const { - if constexpr (CT == CoreType::AIC) { - return ((~core_states_) & aic_mask_).count(); - } else { - return ((~core_states_) & aiv_mask_).count(); - } - } - - // Return an opaque bitmask for iterating running cores of a given type. - // Use pop_first() to extract core bit offsets one at a time. - template - BitStates get_running_cores() const { - if constexpr (CT == CoreType::AIC) { - return (~core_states_) & aic_mask_; - } else { - return (~core_states_) & aiv_mask_; - } - } - - BitStates get_all_running_cores() const { return (~core_states_) & (aic_mask_ | aiv_mask_); } - - // --- Cluster matching --- - - BitStates get_valid_cluster_offset_states(PTO2ResourceShape shape) const { - switch (shape) { - case PTO2ResourceShape::AIC: - return core_states_ & aic_mask_; - case PTO2ResourceShape::AIV: - return ((core_states_ >> 1) | (core_states_ >> 2)) & aic_mask_; - case PTO2ResourceShape::MIX: - return (core_states_ >> 1) & (core_states_ >> 2) & core_states_ & aic_mask_; - } - return BitStates(0ULL); - } - - int32_t get_aic_core_id(int32_t cluster_offset) const { return core_id_map_[cluster_offset]; } - int32_t get_aiv0_core_id(int32_t cluster_offset) const { return core_id_map_[cluster_offset + 1]; } - int32_t get_aiv1_core_id(int32_t cluster_offset) const { return core_id_map_[cluster_offset + 2]; } - - int32_t get_aic_core_offset(int32_t cluster_offset) const { return cluster_offset; } - int32_t get_aiv0_core_offset(int32_t cluster_offset) const { return cluster_offset + 1; } - int32_t get_aiv1_core_offset(int32_t cluster_offset) const { return cluster_offset + 2; } - - bool is_aic_core_idle(int32_t cluster_offset) const { - return ((core_states_ >> cluster_offset) & BitStates(1ULL)).has_value(); - } - bool is_aiv0_core_idle(int32_t cluster_offset) const { - return ((core_states_ >> (cluster_offset + 1)) & BitStates(1ULL)).has_value(); - } - bool is_aiv1_core_idle(int32_t cluster_offset) const { - return ((core_states_ >> (cluster_offset + 2)) & BitStates(1ULL)).has_value(); - } - - // --- State mutation --- - - // Toggle bit at the given bit offset (running <-> idle) - void change_core_state(int32_t bit_offset) { core_states_ ^= BitStates(1ULL << bit_offset); } - - // --- Bit offset <-> worker_id mapping --- - - int32_t get_core_id_by_offset(int32_t offset) const { return core_id_map_[offset]; } - - private: - int32_t cluster_count_; - BitStates aic_mask_; - BitStates aiv_mask_; - BitStates core_states_; - int32_t core_id_map_[63]; // bit_position -> worker_id, max 21 clusters * 3 -}; - -struct AicpuExecutor { - int32_t orch_thread_num_; - int32_t sched_thread_num_; - bool orch_to_sched_{false}; - - // ===== Thread management state ===== - std::atomic thread_idx_{0}; - std::atomic initialized_{false}; - std::atomic init_done_{false}; - std::atomic init_failed_{false}; - std::atomic finished_{false}; - - int32_t thread_num_{0}; - int32_t cores_total_num_{0}; - int32_t thread_cores_num_{0}; // Cores per scheduler thread (0 for orchestrator when thread_num_==4) - int32_t core_count_per_thread_[MAX_AICPU_THREADS]; // Actual core count per thread - int32_t core_assignments_[MAX_AICPU_THREADS][MAX_CORES_PER_THREAD]; - - // Per-core execution state, indexed by core_id (= worker_id) - CoreExecState core_exec_states_[RUNTIME_MAX_WORKER]; - - // Cluster-ordered worker_id lists for core assignment (init-only) - int32_t aic_worker_ids_[MAX_CORES_PER_THREAD]; - int32_t aiv_worker_ids_[MAX_CORES_PER_THREAD]; - int32_t aic_count_{0}; - int32_t aiv_count_{0}; - - // Platform register base address array (set via get_platform_regs()) - uint64_t regs_{0}; - - CoreTracker core_trackers_[MAX_AICPU_THREADS]; - - // ===== Task queue state (managed by scheduler ready queues) ===== - - // Task execution tracking - std::atomic completed_tasks_{0}; - int32_t total_tasks_{0}; - std::atomic finished_count_{0}; - // Device orchestration: set by last orchestrator when graph is built; schedulers poll it. - // volatile prevents the compiler from hoisting the load out of spin loops. - volatile bool orchestrator_done_{false}; - std::atomic pto2_init_done_{false}; - std::atomic runtime_init_ready_{false}; - std::atomic pto2_init_complete_{false}; // init block finished; others wait for this - std::atomic orch_finished_count_{0}; // Number of orchestrator threads that have finished - - // ===== Dynamic core transition state ===== - std::atomic transition_requested_{false}; - std::atomic wait_reassign_{0}; - std::atomic reassigned_{false}; - std::atomic completed_{false}; - - // Orchestration SO handle - defer dlclose until all tasks complete - void* orch_so_handle_{nullptr}; - char orch_so_path_[256]{}; // Path to orchestration SO file for cleanup - - // Shared orchestration function pointer (loaded by first orch thread, used by all) - DeviceOrchestrationFunc orch_func_{nullptr}; - DeviceOrchestrationBindRuntimeFunc orch_bind_runtime_{nullptr}; - const ChipStorageTaskArgs* orch_args_cached_{nullptr}; - - uint64_t* func_id_to_addr_; - uint64_t get_function_bin_addr(int func_id) const { - if (func_id < 0 || func_id >= RUNTIME_MAX_FUNC_ID) return 0; - return func_id_to_addr_[func_id]; - } - - // ===== Methods ===== - int32_t init(Runtime* runtime); - int32_t handshake_all_cores(Runtime* runtime); - bool assign_cores_to_threads(); - void reassign_cores_for_all_threads(); - int32_t resolve_and_dispatch_pto2(Runtime* runtime, int32_t thread_idx); - int32_t shutdown_aicore(Runtime* runtime, int32_t thread_idx, const int32_t* cur_thread_cores, int32_t core_num); - int32_t run(Runtime* runtime); - void deinit(Runtime* runtime); - void emergency_shutdown(Runtime* runtime); - void diagnose_stuck_state( - Runtime* runtime, int32_t thread_idx, const int32_t* cur_thread_cores, int32_t core_num, Handshake* hank); - - template - void check_running_cores_for_completion(int32_t thread_idx, - Handshake* hank, - int32_t& completed_this_turn, - int32_t& cur_thread_completed, - bool& made_progress, - PTO2TaskSlotState* deferred_release_slot_states[], - int32_t& deferred_release_count, - PTO2LocalReadyBuffer* local_bufs -#if PTO2_PROFILING - , - bool profiling_enabled, - uint32_t& phase_complete_count -#endif -#if PTO2_SCHED_PROFILING - , - uint64_t& complete_probe_count, - uint64_t& complete_hit_count, - uint64_t& notify_edges_total, - int32_t& notify_max_degree, - uint64_t& notify_tasks_enqueued, - uint64_t& fanin_edges_total, - int32_t& fanin_max_degree, - uint64_t& sched_complete_perf_cycle -#endif - ) { -#if !PTO2_PROFILING - (void)hank; // NOLINT(readability/casting) -#endif - CoreTracker& tracker = core_trackers_[thread_idx]; - auto running_core_states = tracker.get_running_cores(); - while (running_core_states.has_value()) { - int32_t bit_pos = running_core_states.pop_first(); - int32_t core_id = tracker.get_core_id_by_offset(bit_pos); - CoreExecState& core_exec_state = core_exec_states_[core_id]; - uint64_t reg_addr = core_exec_state.reg_addr; - - int32_t expected_reg_task_id = core_exec_state.executing_reg_task_id; - uint64_t reg_val = read_reg(reg_addr, RegId::COND); - int32_t reg_task_id = EXTRACT_TASK_ID(reg_val); - int32_t reg_state = EXTRACT_TASK_STATE(reg_val); - bool done = reg_task_id == expected_reg_task_id && reg_state == TASK_FIN_STATE; -#if PTO2_SCHED_PROFILING - if (profiling_enabled) { - complete_probe_count++; - if (done) { - complete_hit_count++; - } - } -#endif - - if (done) { - core_exec_state.executing_reg_task_id = AICPU_TASK_INVALID; - PTO2TaskSlotState& slot_state = *core_exec_state.executing_slot_state; - - // Completion: increment atomic counter, trigger task-level completion on last subtask - bool mixed_complete = rt->scheduler.on_subtask_complete(slot_state); - if (mixed_complete) { -#if PTO2_SCHED_PROFILING - PTO2CompletionStats cstats = - rt->scheduler.on_mixed_task_complete(slot_state, thread_idx, local_bufs); - notify_edges_total += cstats.fanout_edges; - if (cstats.fanout_edges > notify_max_degree) notify_max_degree = cstats.fanout_edges; - notify_tasks_enqueued += cstats.tasks_enqueued; - phase_complete_count++; -#else - rt->scheduler.on_mixed_task_complete(slot_state, local_bufs); -#if PTO2_PROFILING - phase_complete_count++; -#endif -#endif - if (deferred_release_count < 256) { - deferred_release_slot_states[deferred_release_count++] = &slot_state; - } else { - DEV_ALWAYS("Thread %d: release", thread_idx); - while (deferred_release_count > 0) { -#if PTO2_SCHED_PROFILING - int32_t fe = rt->scheduler.on_task_release( - *deferred_release_slot_states[--deferred_release_count], thread_idx); -#else - int32_t fe = - rt->scheduler.on_task_release(*deferred_release_slot_states[--deferred_release_count]); -#endif - (void)fe; // NOLINT(readability/casting) -#if PTO2_SCHED_PROFILING - fanin_edges_total += fe; - if (fe > fanin_max_degree) fanin_max_degree = fe; -#endif - } - deferred_release_slot_states[deferred_release_count++] = &slot_state; - } - } - tracker.change_core_state(bit_pos); -#if PTO2_PROFILING - if (profiling_enabled) { -#if PTO2_SCHED_PROFILING - uint64_t t_perf_start = get_sys_cnt_aicpu(); -#endif - Handshake* h = &hank[core_id]; - uint64_t finish_ts = get_sys_cnt_aicpu(); - PerfBuffer* perf_buf = reinterpret_cast(h->perf_records_addr); - - // Pre-extract fanout (platform layer cannot depend on PTO2DepListEntry) - uint64_t fanout_arr[RUNTIME_MAX_FANOUT]; - int32_t fanout_n = 0; - PTO2DepListEntry* cur = slot_state.fanout_head; - while (cur != nullptr && fanout_n < RUNTIME_MAX_FANOUT) { - fanout_arr[fanout_n++] = cur->slot_state->task->task_id.raw; - cur = cur->next; - } - - int32_t perf_slot_idx = static_cast(core_exec_state.executing_subslot); - if (perf_aicpu_complete_record(perf_buf, - static_cast(expected_reg_task_id), - slot_state.task->task_id.raw, - slot_state.task->kernel_id[perf_slot_idx], - CT, - core_exec_state.dispatch_timestamp, - finish_ts, - fanout_arr, - fanout_n) != 0) { - DEV_ERROR("Core %d: perf_aicpu_complete_record failed for task 0x%" PRIx64, - core_id, - static_cast(slot_state.task->task_id.raw)); - } -#if PTO2_SCHED_PROFILING - sched_complete_perf_cycle += (get_sys_cnt_aicpu() - t_perf_start); -#endif - } -#endif - - DEV_DEBUG("Thread %d: %s core %d completed PTO2 task %d (mixed_complete=%d)", - thread_idx, - CT == CoreType::AIC ? "AIC" : "AIV", - core_id, - expected_reg_task_id, - mixed_complete ? 1 : 0); - cur_thread_completed++; - if (mixed_complete) { - completed_this_turn++; - } - made_progress = true; - } - } - } - - static const char* shape_name(PTO2ResourceShape shape) { - switch (shape) { - case PTO2ResourceShape::AIC: - return "AIC"; - case PTO2ResourceShape::AIV: - return "AIV"; - case PTO2ResourceShape::MIX: - return "MIX"; - } - return "UNKNOWN"; - } - - /** - * Returns the dispatch probe order for a given scheduler thread. - * Widest shapes first to avoid consuming cluster resources with narrow tasks. - * Even/odd threads use different fallback orders (AIC-first vs AIV-first) - * to reduce contention on the same ready queue across adjacent threads. - */ - static const PTO2ResourceShape* get_dispatch_order(int32_t thread_idx) { - // Even threads: AIC-first fallback after widest - static constexpr PTO2ResourceShape kEvenOrder[PTO2_NUM_RESOURCE_SHAPES] = { - PTO2ResourceShape::MIX, - PTO2ResourceShape::AIC, - PTO2ResourceShape::AIV, - }; - // Odd threads: AIV-first fallback after widest - static constexpr PTO2ResourceShape kOddOrder[PTO2_NUM_RESOURCE_SHAPES] = { - PTO2ResourceShape::MIX, - PTO2ResourceShape::AIV, - PTO2ResourceShape::AIC, - }; - return (thread_idx % 2 == 0) ? kEvenOrder : kOddOrder; - } - - int pop_ready_tasks_batch(PTO2ResourceShape shape, - int32_t thread_idx, - PTO2LocalReadyBuffer& local_buf, - PTO2TaskSlotState** out, - int max_count -#if PTO2_SCHED_PROFILING - , - uint64_t& pop_hit, - uint64_t& pop_miss, - uint64_t& local_dispatch_count, - uint64_t& sched_dispatch_pop_cycle -#endif - ) { - (void)thread_idx; // NOLINT(readability/casting) -#if PTO2_SCHED_PROFILING - extern uint64_t g_sched_pop_atomic_count[], g_sched_pop_wait_cycle[]; - uint64_t t_pop_start = get_sys_cnt_aicpu(); - int count = rt->scheduler.get_ready_tasks_batch(shape, - local_buf, - out, - max_count, - g_sched_pop_atomic_count[thread_idx], - g_sched_pop_wait_cycle[thread_idx], - local_dispatch_count); - sched_dispatch_pop_cycle += (get_sys_cnt_aicpu() - t_pop_start); - if (count > 0) { - pop_hit += count; - } else { - pop_miss++; - } -#else - int count = rt->scheduler.get_ready_tasks_batch(shape, local_buf, out, max_count); -#endif - return count; - } - - /** - * Build per-core dispatch payload: copy tensor pointers and scalars into - * the per-core args[] array, then populate SPMD local context at the tail. - * - * Reads next_block_idx and block_num directly from the task descriptor - * to populate LocalContext. The caller is responsible for incrementing - * next_block_idx AFTER dispatch. - * - * GlobalContext (sub_block_id) is NOT written here — it is initialized once - * at runtime startup by init_global_context(). - */ - void build_payload(PTO2DispatchPayload& dispatch_payload, PTO2TaskSlotState& slot_state, PTO2SubtaskSlot subslot) { - int32_t slot_idx = static_cast(subslot); - uint64_t callable_addr = get_function_bin_addr(slot_state.task->kernel_id[slot_idx]); - const CoreCallable* callable = reinterpret_cast(callable_addr); - dispatch_payload.function_bin_addr = callable->resolved_addr(); - auto& payload = *slot_state.payload; - int n = 0; - for (int32_t i = 0; i < payload.tensor_count; i++) { - dispatch_payload.args[n++] = reinterpret_cast(&payload.tensors[i]); - } - for (int32_t i = 0; i < payload.scalar_count; i++) { - dispatch_payload.args[n++] = payload.scalars[i]; - } - // Per-dispatch local context (read from slot state) - dispatch_payload.local_context.block_idx = slot_state.next_block_idx; - dispatch_payload.local_context.block_num = slot_state.block_num; - // Store context pointers at fixed suffix positions in args[] - // (GlobalContext content is already set by init_global_context, but the - // pointer must be written each dispatch since args[] is rebuilt entirely) - dispatch_payload.args[SPMD_LOCAL_CONTEXT_INDEX] = reinterpret_cast(&dispatch_payload.local_context); - dispatch_payload.args[SPMD_GLOBAL_CONTEXT_INDEX] = reinterpret_cast(&dispatch_payload.global_context); - } - - void dispatch_subtask_to_core(Runtime* runtime, - int32_t thread_idx, - int32_t core_offset, - PTO2TaskSlotState& slot_state, - PTO2SubtaskSlot subslot -#if PTO2_PROFILING - , - bool profiling_enabled -#endif - ) { - CoreTracker& tracker = core_trackers_[thread_idx]; - auto core_id = tracker.get_core_id_by_offset(core_offset); -#if !PTO2_PROFILING - (void)runtime; // NOLINT(readability/casting) -#endif - CoreExecState& core_exec_state = core_exec_states_[core_id]; - PTO2DispatchPayload& payload = s_pto2_payload_per_core[core_id]; - build_payload(payload, slot_state, subslot); - core_exec_state.executing_subslot = subslot; - core_exec_state.executing_slot_state = &slot_state; -#if PTO2_PROFILING - if (profiling_enabled) { - core_exec_state.dispatch_timestamp = get_sys_cnt_aicpu(); - if (core_exec_state.dispatch_count >= PLATFORM_PROF_BUFFER_SIZE) { - perf_aicpu_switch_buffer(runtime, core_id, thread_idx); - core_exec_state.dispatch_count = 0; - } - core_exec_state.dispatch_count++; - } -#endif - // Per-core monotonic counter for register protocol uniqueness (32-bit). - // PTO2 task_id encodes (ring_id << 32 | local_id); truncation to uint32 loses ring_id, - // so tasks from different rings with the same local_id would write identical DATA_MAIN_BASE - // values. The AICore uses last_reg_val to detect new dispatches and would skip the - // duplicate, while the stale COND register from the previous task (same local_id) would - // cause a false-positive completion. - // PerfRecord.task_id: register token (low 32) until AICPU overwrites with full (ring_id << 32 | local_id). - core_exec_state.dispatch_seq++; - uint32_t reg_task_id = core_exec_state.dispatch_seq & TASK_ID_MASK; - // Skip reserved sentinel range [AICORE_EXIT_SIGNAL, 0x7FFFFFFF] - if (reg_task_id >= AICORE_EXIT_SIGNAL) { - core_exec_state.dispatch_seq += (TASK_ID_MASK - reg_task_id + 1); - reg_task_id = core_exec_state.dispatch_seq & TASK_ID_MASK; - } - write_reg(core_exec_state.reg_addr, RegId::DATA_MAIN_BASE, static_cast(reg_task_id)); - - tracker.change_core_state(core_offset); - core_exec_state.executing_reg_task_id = reg_task_id; - } -}; - -static AicpuExecutor g_aicpu_executor; - -// ===== AicpuExecutor Method Implementations ===== - -/** - * Handshake with all cores and discover their types - * Sets up register addresses for fast dispatch. - */ -int32_t AicpuExecutor::handshake_all_cores(Runtime* runtime) { - Handshake* all_handshakes = reinterpret_cast(runtime->workers); - cores_total_num_ = runtime->worker_count; - - // Validate cores_total_num_ before using as array index - if (cores_total_num_ == 0 || cores_total_num_ > MAX_CORES_PER_THREAD) { - DEV_ERROR("Invalid cores_total_num %d (expected 1-%d)", cores_total_num_, MAX_CORES_PER_THREAD); - return -1; - } - - aic_count_ = 0; - aiv_count_ = 0; - - DEV_INFO("Handshaking with %d cores", cores_total_num_); - - // Step 1: Write per-core payload addresses and send handshake signal - // OUT_OF_ORDER_STORE_BARRIER() ensures task is globally visible before - // aicpu_ready=1, so AICore reads the correct payload pointer after waking up. - for (int32_t i = 0; i < cores_total_num_; i++) { - all_handshakes[i].task = reinterpret_cast(&s_pto2_payload_per_core[i]); - OUT_OF_ORDER_STORE_BARRIER(); - all_handshakes[i].aicpu_ready = 1; - } - - // Get platform physical cores count for validation - uint32_t max_physical_cores_count = platform_get_physical_cores_count(); - - // Step 2: Wait for all cores to respond, collect core type and register addresses - bool handshake_failed = false; - for (int32_t i = 0; i < cores_total_num_; i++) { - Handshake* hank = &all_handshakes[i]; - - while (hank->aicore_regs_ready == 0) { - } - - uint32_t physical_core_id = hank->physical_core_id; - - // Validate physical_core_id before using as array index - if (physical_core_id >= max_physical_cores_count) { - DEV_ERROR("Core %d reported invalid physical_core_id=%u (platform max=%u)", - i, - physical_core_id, - max_physical_cores_count); - handshake_failed = true; - continue; - } - - // Get register address using physical_core_id - uint64_t* regs = reinterpret_cast(regs_); - uint64_t reg_addr = regs[physical_core_id]; - - // Initialize AICore registers after discovery (first round) - platform_init_aicore_regs(reg_addr); - hank->aicpu_regs_ready = 1; - - while (hank->aicore_done == 0) { - } - - CoreType type = hank->core_type; - - core_exec_states_[i].reg_addr = reg_addr; - core_exec_states_[i].worker_id = i; - core_exec_states_[i].physical_core_id = physical_core_id; - core_exec_states_[i].core_type = type; - - if (type == CoreType::AIC) { - aic_worker_ids_[aic_count_++] = i; - DEV_INFO("Core %d: AIC, physical_id=%u, reg_addr=0x%lx", i, physical_core_id, reg_addr); - } else { - aiv_worker_ids_[aiv_count_++] = i; - DEV_INFO("Core %d: AIV, physical_id=%u, reg_addr=0x%lx", i, physical_core_id, reg_addr); - } - } - - if (handshake_failed) { - emergency_shutdown(runtime); - return -1; - } - - DEV_INFO("Core discovery complete: %d AIC, %d AIV", aic_count_, aiv_count_); - return 0; -} - -/** - * Assign discovered cores to scheduler threads - * (Aligned with host_build_graph mechanism) - */ -bool AicpuExecutor::assign_cores_to_threads() { - // Cluster-aligned round-robin assignment: cluster ci -> sched thread ci % divisor. - // Each cluster = 1 AIC + 2 adjacent AIV; the triple is always kept together. - int32_t divisor = (sched_thread_num_ > 0) ? sched_thread_num_ : thread_num_; - int32_t cluster_count = aic_count_; - - // Max clusters any single sched thread can hold: ceil(cluster_count / divisor). - int32_t max_clusters_per_thread = (cluster_count + divisor - 1) / divisor; - thread_cores_num_ = max_clusters_per_thread * 3; - - if (thread_cores_num_ > CoreTracker::MAX_CORE_PER_THREAD) { - DEV_ERROR("Can't assign more then 64 cores in per scheduler"); - return false; - } - - DEV_INFO("Assigning cores (round-robin): %d clusters across %d sched threads (%d AIC, %d AIV)", - cluster_count, - divisor, - aic_count_, - aiv_count_); - - for (int32_t i = 0; i < MAX_CORES_PER_THREAD; i++) { - core_exec_states_[i].executing_reg_task_id = AICPU_TASK_INVALID; - } - - // Count clusters per thread first (round-robin may distribute unevenly) - int32_t clusters_per_thread[MAX_AICPU_THREADS] = {}; - for (int32_t ci = 0; ci < cluster_count; ci++) { - clusters_per_thread[ci % divisor]++; - } - for (int32_t i = 0; i < divisor; i++) { - core_trackers_[i].init(clusters_per_thread[i]); - core_count_per_thread_[i] = 0; - } - - // Mark orchestrator threads explicitly (no cores). - for (int32_t t = divisor; t < thread_num_; t++) { - DEV_INFO("Thread %d: orchestrator (0 cores)", t); - } - - // Per-sched-thread running core index used while filling core_assignments_. - int32_t core_idx[MAX_AICPU_THREADS] = {}; - int32_t cluster_idx_per_thread[MAX_AICPU_THREADS] = {}; - - for (int32_t ci = 0; ci < cluster_count; ci++) { - int32_t t = ci % divisor; - int32_t& idx = core_idx[t]; - - int32_t aic_wid = aic_worker_ids_[ci]; - int32_t aiv0_wid = aiv_worker_ids_[2 * ci]; - int32_t aiv1_wid = aiv_worker_ids_[2 * ci + 1]; - - core_trackers_[t].set_cluster(cluster_idx_per_thread[t]++, aic_wid, aiv0_wid, aiv1_wid); - - core_assignments_[t][idx++] = aic_wid; - core_assignments_[t][idx++] = aiv0_wid; - core_assignments_[t][idx++] = aiv1_wid; - - DEV_INFO("Thread %d: cluster %d (AIC=%d, AIV0=%d, AIV1=%d)", t, ci, aic_wid, aiv0_wid, aiv1_wid); - } - - for (int32_t t = 0; t < divisor; t++) { - core_count_per_thread_[t] = core_idx[t]; - DEV_INFO("Thread %d: total %d cores (%d clusters)", t, core_idx[t], core_trackers_[t].get_cluster_count()); - } - - return true; -} - -/** - * Reassign all cores evenly across all threads (schedulers + orchestrators). - * Called by the last orchestrator thread when orchestration completes. - * Writes into new_core_assignments_ / new_core_count_per_thread_. - */ -void AicpuExecutor::reassign_cores_for_all_threads() { - DEV_INFO("Reassigning cores (cluster-aligned) for %d threads: %d AIC, %d AIV", thread_num_, aic_count_, aiv_count_); - - // Collect running worker_ids from all current trackers - bool running_cores[MAX_CORES_PER_THREAD] = {}; - for (int32_t i = 0; i < thread_num_; i++) { - auto all_running = core_trackers_[i].get_all_running_cores(); - int32_t bp; - while ((bp = all_running.pop_first()) >= 0) { - running_cores[core_trackers_[i].get_core_id_by_offset(bp)] = true; - } - } - - // Count clusters per thread (round-robin across all threads) - int32_t cluster_count = aic_count_; - int32_t clusters_per_thread[MAX_AICPU_THREADS] = {}; - for (int32_t ci = 0; ci < cluster_count; ci++) { - clusters_per_thread[ci % thread_num_]++; - } - - // Re-init all trackers and reset core counts - for (int32_t i = 0; i < thread_num_; i++) { - core_trackers_[i].init(clusters_per_thread[i]); - core_count_per_thread_[i] = 0; - } - - // Assign clusters round-robin and restore running state - int32_t cluster_idx_per_thread[MAX_AICPU_THREADS] = {}; - for (int32_t ci = 0; ci < cluster_count; ci++) { - int32_t t = ci % thread_num_; - - int32_t aic_wid = aic_worker_ids_[ci]; - int32_t aiv0_wid = aiv_worker_ids_[2 * ci]; - int32_t aiv1_wid = aiv_worker_ids_[2 * ci + 1]; - - int32_t cl_idx = cluster_idx_per_thread[t]++; - core_trackers_[t].set_cluster(cl_idx, aic_wid, aiv0_wid, aiv1_wid); - - // init() marks all idle; toggle cores that were running - if (running_cores[aic_wid]) { - core_trackers_[t].change_core_state(cl_idx * 3); - } - if (running_cores[aiv0_wid]) { - core_trackers_[t].change_core_state(cl_idx * 3 + 1); - } - if (running_cores[aiv1_wid]) { - core_trackers_[t].change_core_state(cl_idx * 3 + 2); - } - - core_assignments_[t][core_count_per_thread_[t]++] = aic_wid; - core_assignments_[t][core_count_per_thread_[t]++] = aiv0_wid; - core_assignments_[t][core_count_per_thread_[t]++] = aiv1_wid; - } - - // Log final distribution - DEV_INFO("Core reassignment complete:"); - for (int32_t t = 0; t < thread_num_; t++) { - int32_t aic_running = core_trackers_[t].get_running_count(); - int32_t aiv_running = core_trackers_[t].get_running_count(); - DEV_INFO(" Thread %d: %d cores, %d clusters (AIC running=%d, AIV running=%d)", - t, - core_count_per_thread_[t], - core_trackers_[t].get_cluster_count(), - aic_running, - aiv_running); - } -} - -int32_t AicpuExecutor::init(Runtime* runtime) { - bool expected = false; - if (!initialized_.compare_exchange_strong(expected, true, std::memory_order_acq_rel, std::memory_order_acquire)) { - return 0; - } - - DEV_INFO("AicpuExecutor: Initializing"); - - if (runtime == nullptr) { - DEV_ERROR("runtime is nullptr"); - init_failed_.store(true, std::memory_order_release); - return -1; - } - - func_id_to_addr_ = runtime->func_id_to_addr_; - - // Read execution parameters from runtime - thread_num_ = runtime->sche_cpu_num; - orch_thread_num_ = runtime->orch_thread_num; - sched_thread_num_ = thread_num_ - orch_thread_num_; - orch_to_sched_ = runtime->orch_to_sched; - if (thread_num_ == 0) thread_num_ = 1; - - if (!orch_to_sched_ && sched_thread_num_ == 0) { - DEV_ERROR( - "no scheduler and orch not trans to schedulers when finished, maybe you need set env PTO2_ORCH_TO_SCHED=1 " - "or scale down orch number."); - init_failed_.store(true, std::memory_order_release); - return -1; - } - - if (thread_num_ < 1 || thread_num_ > MAX_AICPU_THREADS) { - DEV_ERROR("Invalid thread_num: %d", thread_num_); - init_failed_.store(true, std::memory_order_release); - return -1; - } - - // Zero all per-core execution state before handshake - memset(core_exec_states_, 0, sizeof(core_exec_states_)); - - // Use handshake mechanism to discover cores (aligned with host_build_graph) - int32_t rc = handshake_all_cores(runtime); - if (rc != 0) { - DEV_ERROR("handshake_all_cores failed"); - init_failed_.store(true, std::memory_order_release); - return -1; - } - - // Dynamically assign cores to threads - if (!assign_cores_to_threads()) { - return -1; - } - - DEV_INFO("Config: threads=%d, cores=%d, cores_per_thread=%d", thread_num_, cores_total_num_, thread_cores_num_); - - // Initialize runtime execution state - // Task count comes from PTO2 shared memory - if (runtime->get_pto2_gm_sm_ptr()) { - auto* header = static_cast(runtime->get_pto2_gm_sm_ptr()); - int32_t pto2_count = 0; - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - pto2_count += header->rings[r].fc.current_task_index.load(std::memory_order_acquire); - } - total_tasks_ = pto2_count > 0 ? pto2_count : 0; - } else { - total_tasks_ = 0; - } - completed_tasks_.store(0, std::memory_order_release); - // Host orchestration: graph already built, no wait needed. Device orch: Thread 3 will set this. - bool orch_on_host = runtime->get_orch_built_on_host(); - DEV_INFO("Init: orch_built_on_host=%d", orch_on_host ? 1 : 0); - orchestrator_done_ = orch_on_host; - - // Initial ready tasks will be populated via scheduler ready queues - - // Clear per-core dispatch payloads - memset(s_pto2_payload_per_core, 0, sizeof(s_pto2_payload_per_core)); - - // Initialize per-core GlobalContext (sub_block_id) based on cluster position. - // This is done once at startup and never modified afterwards. - for (int32_t t = 0; t < sched_thread_num_; t++) { - CoreTracker& tracker = core_trackers_[t]; - for (int32_t c = 0; c < tracker.get_cluster_count(); c++) { - int32_t cluster_offset = c * 3; // Each cluster = 1 AIC + 2 AIV - auto aiv0_id = tracker.get_core_id_by_offset(tracker.get_aiv0_core_offset(cluster_offset)); - auto aiv1_id = tracker.get_core_id_by_offset(tracker.get_aiv1_core_offset(cluster_offset)); - s_pto2_payload_per_core[aiv0_id].global_context.sub_block_id = 0; - s_pto2_payload_per_core[aiv1_id].global_context.sub_block_id = 1; - } - } - - DEV_INFO("Init: PTO2 mode, task count from shared memory"); - - finished_count_.store(0, std::memory_order_release); - - init_done_.store(true, std::memory_order_release); - DEV_INFO("AicpuExecutor: Init complete"); - return 0; -} - -/** - * Shutdown AICore - Send exit signal via registers to all AICore kernels - */ -int32_t AicpuExecutor::shutdown_aicore( - Runtime* runtime, int32_t thread_idx, const int32_t* cur_thread_cores, int32_t core_num) { - (void)runtime; // NOLINT(readability/casting) - if (core_num == 0) return 0; - - DEV_INFO("Thread %d: Shutting down %d cores", thread_idx, core_num); - - for (int32_t i = 0; i < core_num; i++) { - int32_t core_id = cur_thread_cores[i]; - uint64_t reg_addr = core_exec_states_[core_id].reg_addr; - if (reg_addr != 0) { - platform_deinit_aicore_regs(reg_addr); - } else { - DEV_ERROR("Thread %d: Core %d has invalid register address", thread_idx, core_id); - } - } - DEV_INFO("Thread %d: Shutdown complete", thread_idx); - return 0; -} - -int32_t AicpuExecutor::resolve_and_dispatch_pto2(Runtime* runtime, int32_t thread_idx) { - int32_t& core_num = core_count_per_thread_[thread_idx]; - CoreTracker& tracker = core_trackers_[thread_idx]; - DEV_INFO("Thread %d: resolve_and_dispatch_pto2 entry", thread_idx); - - void* sm_base = runtime->get_pto2_gm_sm_ptr(); - if (!sm_base) { - DEV_ERROR("PTO2 dispatch: sm_base is null"); - return -1; - } - DEV_INFO("Thread %d: sm_base=%p", thread_idx, sm_base); - - PTO2SharedMemoryHeader* header = static_cast(sm_base); - DEV_INFO("Thread %d: header=%p, task_desc_offset[0]=%lu, window_size=%lu", - thread_idx, - static_cast(header), - static_cast(header->rings[0].task_descriptors_offset), - static_cast(header->rings[0].task_window_size)); - - Handshake* hank = static_cast(runtime->workers); - DEV_INFO("Thread %d: hank=%p, window_size=%lu", - thread_idx, - static_cast(hank), - static_cast(header->rings[0].task_window_size)); - - // One-time init: assign perf buffers (one thread does it; others wait) - if (!pto2_init_done_.exchange(true, std::memory_order_acq_rel)) { - DEV_INFO("Thread %d: doing one-time init", thread_idx); - -#if PTO2_PROFILING - // Assign perf buffers to cores early so profiling captures all tasks - // (total_tasks written to header later when orchestrator completes) - if (runtime->enable_profiling) { - perf_aicpu_init_profiling(runtime); - // Initialize phase profiling for scheduler threads + orchestrator threads - perf_aicpu_init_phase_profiling(runtime, sched_thread_num_, orch_thread_num_); - perf_aicpu_set_orch_thread_idx(sched_thread_num_); - } -#endif - - DEV_INFO("Thread %d: one-time init done", thread_idx); - pto2_init_complete_.store(true, std::memory_order_release); - } else { - while (!pto2_init_complete_.load(std::memory_order_acquire)) { - SPIN_WAIT_HINT(); - } - } - - DEV_INFO("Thread %d: PTO2 dispatch starting with %d cores", thread_idx, core_num); - int32_t cur_thread_completed = 0; - int32_t idle_iterations = 0; - int32_t last_progress_count = 0; -#if PTO2_PROFILING - bool profiling_enabled = runtime->enable_profiling; -#endif - - // Scheduler profiling counters -#if PTO2_PROFILING - uint64_t sched_scan_cycle = 0; - uint64_t sched_complete_cycle = 0; - uint64_t sched_dispatch_cycle = 0; - uint64_t sched_idle_cycle = 0; - uint64_t sched_loop_count = 0; - uint32_t phase_complete_count = 0; - uint32_t phase_dispatch_count = 0; -#if PTO2_SCHED_PROFILING - uint64_t complete_probe_count = 0; - uint64_t complete_hit_count = 0; - uint64_t notify_edges_total = 0; - int32_t notify_max_degree = 0; - uint64_t notify_tasks_enqueued = 0; - uint64_t fanin_edges_total = 0; - int32_t fanin_max_degree = 0; - uint64_t pop_hit = 0; - uint64_t pop_miss = 0; - uint64_t local_dispatch_count = 0; - uint64_t local_overflow_count = 0; - uint64_t sched_complete_perf_cycle = 0; - uint64_t sched_dispatch_pop_cycle = 0; - uint64_t sched_dispatch_setup_cycle = 0; -#endif -#endif - - // Local-first dispatch buffers (stack-allocated, one per CoreType per scheduling thread). - // Initialized once; must be empty at the start of each iteration. - constexpr int LOCAL_READY_CAP_PER_TYPE = 64; - PTO2TaskSlotState* local_ptrs[PTO2_NUM_RESOURCE_SHAPES][LOCAL_READY_CAP_PER_TYPE]; - PTO2LocalReadyBuffer local_bufs[PTO2_NUM_RESOURCE_SHAPES]; - for (int32_t i = 0; i < PTO2_NUM_RESOURCE_SHAPES; i++) { - local_bufs[i].reset(local_ptrs[i], LOCAL_READY_CAP_PER_TYPE); - } - PTO2TaskSlotState* deferred_release_slot_states[256]; - int32_t deferred_release_count = 0; - - bool cores_released = false; - -#if PTO2_PROFILING - uint64_t sched_start_ts = get_sys_cnt_aicpu(); -#endif - - while (true) { - bool made_progress = false; -#if PTO2_PROFILING - CYCLE_COUNT_START(); - sched_loop_count++; - uint64_t _t0_phase = _t0; -#endif - int32_t task_count = 0; - if (!tracker.has_any_running_cores()) { - bool orch_done = orchestrator_done_; - if (orch_done) { - // Check for orchestrator fatal error — exit immediately - int32_t orch_err = header->orch_error_code.load(std::memory_order_acquire); - if (orch_err != PTO2_ERROR_NONE) { - DEV_ERROR( - "Thread %d: Fatal error (code=%d), sending EXIT_SIGNAL to all cores. " - "completed_tasks=%d, total_tasks=%d", - thread_idx, - orch_err, - completed_tasks_.load(std::memory_order_relaxed), - total_tasks_); - emergency_shutdown(runtime); - completed_.store(true, std::memory_order_release); - break; - } - - // Normal exit: all tasks complete - task_count = total_tasks_; - if (task_count > 0 && completed_tasks_.load(std::memory_order_relaxed) >= task_count) { - completed_.store(true, std::memory_order_release); - DEV_INFO("Thread %d: PTO2 completed tasks %d/%d", - thread_idx, - completed_tasks_.load(std::memory_order_relaxed), - task_count); - break; - } - } - } - - // Check for core transition request (execute once per thread) - if (!cores_released && orch_to_sched_ && transition_requested_.load(std::memory_order_acquire)) { - if (!reassigned_.load(std::memory_order_acquire)) { - wait_reassign_.fetch_add(1, std::memory_order_release); - while (!reassigned_.load(std::memory_order_acquire)) { - if (completed_.load(std::memory_order_acquire)) { - break; - } - SPIN_WAIT_HINT(); - } - if (completed_.load(std::memory_order_acquire)) { - break; - } - } - cores_released = true; - } - -#if PTO2_PROFILING - CYCLE_COUNT_LAP(sched_idle_cycle); -#endif - - // Process completed and dispatch FIRST to minimize Sched (dispatch→finish) latency. - // Sched time = finish_ts - dispatch_ts; recording finish_ts here at loop start reduces - // tail overhead (time from AICore done to AICPU recording finish). - - // Phase 1: Check running cores for completion, process and move to idle - int32_t completed_this_turn = 0; - - // Check AIC running cores - bool try_completed = false; - if (tracker.has_running_cores()) { - try_completed = true; - check_running_cores_for_completion(thread_idx, - hank, - completed_this_turn, - cur_thread_completed, - made_progress, - deferred_release_slot_states, - deferred_release_count, - local_bufs -#if PTO2_PROFILING - , - profiling_enabled, - phase_complete_count -#endif -#if PTO2_SCHED_PROFILING - , - complete_probe_count, - complete_hit_count, - notify_edges_total, - notify_max_degree, - notify_tasks_enqueued, - fanin_edges_total, - fanin_max_degree, - sched_complete_perf_cycle -#endif - ); - } - - // Check AIV running cores - if (tracker.has_running_cores()) { - try_completed = true; - check_running_cores_for_completion(thread_idx, - hank, - completed_this_turn, - cur_thread_completed, - made_progress, - deferred_release_slot_states, - deferred_release_count, - local_bufs -#if PTO2_PROFILING - , - profiling_enabled, - phase_complete_count -#endif -#if PTO2_SCHED_PROFILING - , - complete_probe_count, - complete_hit_count, - notify_edges_total, - notify_max_degree, - notify_tasks_enqueued, - fanin_edges_total, - fanin_max_degree, - sched_complete_perf_cycle -#endif - ); - } - if (completed_this_turn > 0) { -#if PTO2_SCHED_PROFILING - rt->scheduler.tasks_completed.fetch_add(completed_this_turn, std::memory_order_relaxed); -#endif - int32_t prev = completed_tasks_.fetch_add(completed_this_turn, std::memory_order_relaxed); - int32_t new_total = prev + completed_this_turn; - last_progress_count = new_total; - if (thread_idx == 0 && task_count > 0) { - if (new_total <= PROGRESS_VERBOSE_THRESHOLD || - new_total / PROGRESS_LOG_INTERVAL != prev / PROGRESS_LOG_INTERVAL || new_total >= task_count) { - DEV_ALWAYS("PTO2 progress: completed=%d total=%d (%.1f%%)", - new_total, - task_count, - 100.0 * new_total / task_count); - } - } - } - -#if PTO2_PROFILING - if (!try_completed) { - CYCLE_COUNT_LAP(sched_idle_cycle); - } else { - CYCLE_COUNT_LAP(sched_complete_cycle); - if (profiling_enabled && phase_complete_count > 0) { - perf_aicpu_record_phase( - thread_idx, AicpuPhaseId::SCHED_COMPLETE, _t0_phase, _t1, sched_loop_count, phase_complete_count); - _t0_phase = _t1; - phase_complete_count = 0; - } - } -#endif - - bool try_pushed = false; - const PTO2ResourceShape* dispatch_order = get_dispatch_order(thread_idx); - for (int32_t si = 0; si < PTO2_NUM_RESOURCE_SHAPES; si++) { - PTO2ResourceShape shape = dispatch_order[si]; - auto valid_cluster_states = tracker.get_valid_cluster_offset_states(shape); - if (!valid_cluster_states.has_value()) { - continue; - } - auto& local_buf = local_bufs[static_cast(shape)]; - - while (valid_cluster_states.has_value()) { - int want = valid_cluster_states.count(); - PTO2TaskSlotState* batch[CoreTracker::MAX_CLUSTERS]; - int got = pop_ready_tasks_batch(shape, - thread_idx, - local_buf, - batch, - want -#if PTO2_SCHED_PROFILING - , - pop_hit, - pop_miss, - local_dispatch_count, - sched_dispatch_pop_cycle -#endif - ); - if (got == 0) break; - - for (int bi = 0; bi < got; bi++) { - PTO2TaskSlotState* slot_state = batch[bi]; - try_pushed = true; -#if PTO2_SCHED_PROFILING - uint64_t t_setup_start = get_sys_cnt_aicpu(); -#endif - // Dispatch as many blocks as possible for this task using available clusters. - // For block_num=1 the inner body executes exactly once (no overhead). - do { - auto current_valid_cluster_offset = valid_cluster_states.pop_first(); - if (shape == PTO2ResourceShape::MIX) { - // Full-cluster: all active subtasks share the same block_idx. - uint8_t mask = slot_state->active_mask; - if (mask & PTO2_SUBTASK_MASK_AIC) { - dispatch_subtask_to_core(runtime, - thread_idx, - tracker.get_aic_core_offset(current_valid_cluster_offset), - *slot_state, - PTO2SubtaskSlot::AIC -#if PTO2_PROFILING - , - profiling_enabled -#endif - ); - } - if (mask & PTO2_SUBTASK_MASK_AIV0) { - dispatch_subtask_to_core(runtime, - thread_idx, - tracker.get_aiv0_core_offset(current_valid_cluster_offset), - *slot_state, - PTO2SubtaskSlot::AIV0 -#if PTO2_PROFILING - , - profiling_enabled -#endif - ); - } - if (mask & PTO2_SUBTASK_MASK_AIV1) { - dispatch_subtask_to_core(runtime, - thread_idx, - tracker.get_aiv1_core_offset(current_valid_cluster_offset), - *slot_state, - PTO2SubtaskSlot::AIV1 -#if PTO2_PROFILING - , - profiling_enabled -#endif - ); - } - slot_state->next_block_idx++; - } else if (shape == PTO2ResourceShape::AIC) { - dispatch_subtask_to_core(runtime, - thread_idx, - tracker.get_aic_core_offset(current_valid_cluster_offset), - *slot_state, - PTO2SubtaskSlot::AIC -#if PTO2_PROFILING - , - profiling_enabled -#endif - ); - slot_state->next_block_idx++; - } else { // shape == PTO2ResourceShape::AIV - auto core_offset = tracker.is_aiv0_core_idle(current_valid_cluster_offset) - ? tracker.get_aiv0_core_offset(current_valid_cluster_offset) - : tracker.get_aiv1_core_offset(current_valid_cluster_offset); - dispatch_subtask_to_core(runtime, - thread_idx, - core_offset, - *slot_state, - PTO2SubtaskSlot::AIV0 -#if PTO2_PROFILING - , - profiling_enabled -#endif - ); - slot_state->next_block_idx++; - // Refresh idle state so the do-while naturally picks up - // the other AIV in the same cluster on the next iteration. - if (slot_state->next_block_idx < slot_state->block_num) { - valid_cluster_states = tracker.get_valid_cluster_offset_states(shape); - } - } -#if PTO2_PROFILING - phase_dispatch_count += __builtin_popcount(slot_state->active_mask); -#endif - DEV_DEBUG("Thread %d: Dispatched %s task %" PRId64 " block %d/%d to cluster_offset %d", - thread_idx, - shape_name(shape), - static_cast(slot_state->task->task_id.raw), - slot_state->next_block_idx - 1, - slot_state->block_num, - current_valid_cluster_offset); - } while (slot_state->next_block_idx < slot_state->block_num && valid_cluster_states.has_value()); - - // Re-enqueue only if blocks remain after exhausting local clusters - if (slot_state->next_block_idx < slot_state->block_num) { - rt->scheduler.ready_queues[static_cast(shape)].push(slot_state); - } - made_progress = true; -#if PTO2_SCHED_PROFILING - sched_dispatch_setup_cycle += (get_sys_cnt_aicpu() - t_setup_start); -#endif - } - - // lazy update valid_cluster_states - if (!valid_cluster_states.has_value()) { - valid_cluster_states = tracker.get_valid_cluster_offset_states(shape); - } - } - } - - // requeue in global ready queue - for (int32_t si = 0; si < PTO2_NUM_RESOURCE_SHAPES; si++) { - PTO2ResourceShape shape = dispatch_order[si]; - auto& local_buf = local_bufs[static_cast(shape)]; - auto& ready_queue = rt->scheduler.ready_queues[static_cast(shape)]; -#if PTO2_SCHED_PROFILING - local_overflow_count += local_buf.count; -#endif - if (local_buf.count > 0) { - ready_queue.push_batch(local_buf.slot_states, local_buf.count); - local_buf.count = 0; - } - } - -#if PTO2_PROFILING - if (!try_pushed) { - CYCLE_COUNT_LAP(sched_idle_cycle); - } else { - CYCLE_COUNT_LAP(sched_dispatch_cycle); - if (profiling_enabled && phase_dispatch_count > 0) { - perf_aicpu_record_phase( - thread_idx, AicpuPhaseId::SCHED_DISPATCH, _t0_phase, _t1, sched_loop_count, phase_dispatch_count); - _t0_phase = _t1; - phase_dispatch_count = 0; - } - } -#endif - -#if !PTO2_PROFILING - (void)try_completed; // NOLINT(readability/casting) - (void)try_pushed; // NOLINT(readability/casting) -#endif - - if (made_progress) { - idle_iterations = 0; - } else { - // Batch deferred fanin releases during idle. - // Processing all pending releases at once advances the ring faster, - // freeing heap space for the orchestrator without blocking completion polling. - while (deferred_release_count > 0) { -#if PTO2_SCHED_PROFILING - int32_t fe = - rt->scheduler.on_task_release(*deferred_release_slot_states[--deferred_release_count], thread_idx); -#else - int32_t fe = rt->scheduler.on_task_release(*deferred_release_slot_states[--deferred_release_count]); -#endif - (void)fe; // NOLINT(readability/casting) -#if PTO2_SCHED_PROFILING - fanin_edges_total += fe; - if (fe > fanin_max_degree) fanin_max_degree = fe; -#endif - } - idle_iterations++; - - // Check for orchestrator fatal error during idle (every 1024 iterations) - // orch_error_code is set in shared memory by the orchestrator's spin loop - // BEFORE orchestrator_done_ is set, so this catches errors earlier. - if (idle_iterations % FATAL_ERROR_CHECK_INTERVAL == 0) { - int32_t orch_err = header->orch_error_code.load(std::memory_order_acquire); - if (orch_err != PTO2_ERROR_NONE) { - DEV_ERROR("Thread %d: Fatal error detected (code=%d), sending EXIT_SIGNAL to all cores", - thread_idx, - orch_err); - emergency_shutdown(runtime); - completed_.store(true, std::memory_order_release); - break; - } - } - - if (thread_idx == 0 && task_count > 0 && idle_iterations % STALL_LOG_INTERVAL == 0) { - int32_t c = completed_tasks_.load(std::memory_order_relaxed); - DEV_ALWAYS("PTO2 stall: no progress for %d iterations, completed=%d total=%d (last progress at %d)", - idle_iterations, - c, - task_count, - last_progress_count); - // Scan all task slots to find truly stuck tasks using scheduler state - PTO2SchedulerState* sched = &rt->scheduler; - PTO2SharedMemoryHeader* sm_header_diag = static_cast(sm_base); - int32_t cnt_ready = 0, cnt_waiting = 0, cnt_inflight = 0; - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - int32_t ring_task_count = - sm_header_diag->rings[r].fc.current_task_index.load(std::memory_order_relaxed); - for (int32_t si = 0; si < ring_task_count; si++) { - PTO2TaskSlotState& slot_state = sched->get_slot_state(r, si); - PTO2TaskState st = slot_state.task_state.load(std::memory_order_relaxed); - int32_t rc = slot_state.fanin_refcount.load(std::memory_order_relaxed); - int32_t fi = slot_state.fanin_count; - int32_t kid = slot_state.task->kernel_id[0]; - if (st >= PTO2_TASK_COMPLETED) continue; // Already done - if (st == PTO2_TASK_READY || st == PTO2_TASK_RUNNING) { - cnt_inflight++; - continue; - } - // PENDING - if (rc >= fi) { - // Ready (all deps satisfied) but not enqueued — this is the real bug - cnt_ready++; - if (cnt_ready <= STALL_DUMP_READY_MAX) { - DEV_ALWAYS( - " STUCK-READY ring=%d task_id=%" PRId64 - " kernel_id=%d refcount=%d fanin=%d state=%d", // NOLINT(whitespace/line_length) - r, - static_cast(slot_state.task->task_id.raw), - kid, - rc, - fi, - static_cast(st)); - } - } else { - cnt_waiting++; - if (cnt_waiting <= STALL_DUMP_WAIT_MAX) { - DEV_ALWAYS( - " STUCK-WAIT ring=%d task_id=%" PRId64 - " kernel_id=%d refcount=%d fanin=%d state=%d", // NOLINT(whitespace/line_length) - r, - static_cast(slot_state.task->task_id.raw), - kid, - rc, - fi, - static_cast(st)); - } - } - } - } - DEV_ALWAYS(" scan result: stuck_ready=%d stuck_waiting=%d in_flight=%d", - cnt_ready, - cnt_waiting, - cnt_inflight); - // Log this thread's dispatch state - int32_t aic_running = tracker.get_running_count(); - int32_t aiv_running = tracker.get_running_count(); - int32_t total_running = aic_running + aiv_running; - DEV_ALWAYS(" thread=%d running_cores=%d (AIC=%d AIV=%d) core_num=%d", - thread_idx, - total_running, - aic_running, - aiv_running, - core_num); - // Dump running cores - auto all_running = tracker.get_all_running_cores(); - int32_t dump_count = 0; - int32_t bp; - while (dump_count < STALL_DUMP_CORE_MAX && (bp = all_running.pop_first()) >= 0) { - dump_count++; - int32_t cid = tracker.get_core_id_by_offset(bp); - int32_t sw_tid = core_exec_states_[cid].executing_reg_task_id; - int32_t hw_kernel = -1; - if (sw_tid >= 0 && core_exec_states_[cid].executing_slot_state) { - int32_t diag_slot = static_cast(core_exec_states_[cid].executing_subslot); - hw_kernel = core_exec_states_[cid].executing_slot_state->task->kernel_id[diag_slot]; - } - uint64_t cond_reg = read_reg(core_exec_states_[cid].reg_addr, RegId::COND); - DEV_ALWAYS(" core=%d cond=0x%x(state=%d,id=%d) exec_id=%d kernel=%d", - cid, - static_cast(cond_reg), - EXTRACT_TASK_STATE(cond_reg), - EXTRACT_TASK_ID(cond_reg), - sw_tid, - hw_kernel); - } - // Dump cluster state - for (int32_t cli = 0; cli < tracker.get_cluster_count() && cli < STALL_DUMP_CORE_MAX; cli++) { - int32_t offset = cli * 3; - DEV_ALWAYS(" cluster[%d] aic=%d(%s) aiv0=%d(%s) aiv1=%d(%s)", - cli, - tracker.get_aic_core_id(offset), - tracker.is_aic_core_idle(offset) ? "idle" : "busy", - tracker.get_aiv0_core_id(offset), - tracker.is_aiv0_core_idle(offset) ? "idle" : "busy", - tracker.get_aiv1_core_id(offset), - tracker.is_aiv1_core_idle(offset) ? "idle" : "busy"); - } - } - if (idle_iterations > MAX_IDLE_ITERATIONS) { - DEV_ERROR("Thread %d: PTO2 timeout after %d idle iterations", thread_idx, idle_iterations); -#if PTO2_PROFILING - // Benchmark: scheduler lifetime end timestamp on timeout path - uint64_t sched_timeout_ts = get_sys_cnt_aicpu(); - DEV_ALWAYS("Thread %d: sched_start=%" PRIu64 " sched_end(timeout)=%" PRIu64 " sched_cost=%.3fus", - thread_idx, - static_cast(sched_start_ts), - static_cast(sched_timeout_ts), - cycles_to_us(sched_timeout_ts - sched_start_ts)); -#endif - return -1; - } else { - SPIN_WAIT_HINT(); - } -#if PTO2_PROFILING - CYCLE_COUNT_LAP(sched_idle_cycle); - if (profiling_enabled) { - perf_aicpu_record_phase(thread_idx, AicpuPhaseId::SCHED_IDLE_WAIT, _t0_phase, _t1, sched_loop_count, 0); - _t0_phase = _t1; - } -#endif - } - } - -#if PTO2_PROFILING - // Record sched_end before any DEV_ALWAYS to avoid init cost contamination - uint64_t sched_end_ts = get_sys_cnt_aicpu(); - DEV_ALWAYS("Thread %d: sched_start=%" PRIu64 " sched_end=%" PRIu64 " sched_cost=%.3fus", - thread_idx, - static_cast(sched_start_ts), - static_cast(sched_end_ts), - cycles_to_us(sched_end_ts - sched_start_ts)); - - // Scheduler summary logging (always print when PTO2_PROFILING=1) - uint64_t sched_total = sched_complete_cycle + sched_scan_cycle + sched_dispatch_cycle + sched_idle_cycle; - if (sched_total == 0) sched_total = 1; // avoid div-by-zero - -#if PTO2_SCHED_PROFILING - // Two-level tree display: sub-phase breakdown within complete and dispatch - { - PTO2SchedProfilingData sp = pto2_scheduler_get_profiling(thread_idx); - uint64_t otc_total = sp.lock_cycle + sp.fanout_cycle + sp.fanin_cycle + sp.self_consumed_cycle; - uint64_t complete_poll = (sched_complete_cycle > otc_total + sched_complete_perf_cycle) - ? (sched_complete_cycle - otc_total - sched_complete_perf_cycle) - : 0; - uint64_t dispatch_poll = (sched_dispatch_cycle > sched_dispatch_pop_cycle + sched_dispatch_setup_cycle) - ? (sched_dispatch_cycle - sched_dispatch_pop_cycle - sched_dispatch_setup_cycle) - : 0; - - DEV_ALWAYS("Thread %d: === Scheduler Phase Breakdown: total=%.3fus, %d tasks ===", - thread_idx, - cycles_to_us(sched_total), - cur_thread_completed); - - // Level 1: complete - double notify_avg = - cur_thread_completed > 0 ? static_cast(notify_edges_total) / cur_thread_completed : 0.0; - double fanin_avg = - cur_thread_completed > 0 ? static_cast(fanin_edges_total) / cur_thread_completed : 0.0; - DEV_ALWAYS("Thread %d: complete : %.3fus (%.1f%%) [fanout: edges=%" PRIu64 - ", max_degree=%d, avg=%.1f] [fanin: " // NOLINT(whitespace/line_length) - "edges=%" PRIu64 ", max_degree=%d, avg=%.1f]", - thread_idx, - cycles_to_us(sched_complete_cycle), - sched_complete_cycle * 100.0 / sched_total, - static_cast(notify_edges_total), - notify_max_degree, - notify_avg, - static_cast(fanin_edges_total), - fanin_max_degree, - fanin_avg); - - // Level 2: complete sub-phases (percentage relative to complete) - uint64_t c_parent = sched_complete_cycle > 0 ? sched_complete_cycle : 1; - uint64_t complete_miss_count = - (complete_probe_count > complete_hit_count) ? (complete_probe_count - complete_hit_count) : 0; - double complete_hit_rate = complete_probe_count > 0 ? complete_hit_count * 100.0 / complete_probe_count : 0.0; - DEV_ALWAYS("Thread %d: poll : %.3fus (%.1f%%) hit=%" PRIu64 ", miss=%" PRIu64 ", hit_rate=%.1f%%", - thread_idx, - cycles_to_us(complete_poll), - complete_poll * 100.0 / c_parent, - static_cast(complete_hit_count), - static_cast(complete_miss_count), - complete_hit_rate); - DEV_ALWAYS("Thread %d: otc_lock : %.3fus (%.1f%%) work=%.3fus wait=%.3fus atomics=%" PRIu64 "", - thread_idx, - cycles_to_us(sp.lock_cycle), - sp.lock_cycle * 100.0 / c_parent, - cycles_to_us(sp.lock_cycle - sp.lock_wait_cycle), - cycles_to_us(sp.lock_wait_cycle), - static_cast(sp.lock_atomic_count)); - DEV_ALWAYS("Thread %d: otc_fanout : %.3fus (%.1f%%) work=%.3fus wait=%.3fus atomics=%" PRIu64 "", - thread_idx, - cycles_to_us(sp.fanout_cycle), - sp.fanout_cycle * 100.0 / c_parent, - cycles_to_us(sp.fanout_cycle - sp.push_wait_cycle), - cycles_to_us(sp.push_wait_cycle), - static_cast(sp.fanout_atomic_count)); - DEV_ALWAYS("Thread %d: otc_fanin : %.3fus (%.1f%%) atomics=%" PRIu64 "", - thread_idx, - cycles_to_us(sp.fanin_cycle), - sp.fanin_cycle * 100.0 / c_parent, - static_cast(sp.fanin_atomic_count)); - DEV_ALWAYS("Thread %d: otc_self : %.3fus (%.1f%%) atomics=%" PRIu64 "", - thread_idx, - cycles_to_us(sp.self_consumed_cycle), - sp.self_consumed_cycle * 100.0 / c_parent, - static_cast(sp.self_atomic_count)); - DEV_ALWAYS("Thread %d: perf : %.3fus (%.1f%%)", - thread_idx, - cycles_to_us(sched_complete_perf_cycle), - sched_complete_perf_cycle * 100.0 / c_parent); - - // Level 1: dispatch - uint64_t pop_total = pop_hit + pop_miss; - double pop_hit_rate = pop_total > 0 ? pop_hit * 100.0 / pop_total : 0.0; - DEV_ALWAYS("Thread %d: dispatch : %.3fus (%.1f%%) [pop: hit=%" PRIu64 ", miss=%" PRIu64 - ", hit_rate=%.1f%%]", // NOLINT(whitespace/line_length) - thread_idx, - cycles_to_us(sched_dispatch_cycle), - sched_dispatch_cycle * 100.0 / sched_total, - static_cast(pop_hit), - static_cast(pop_miss), - pop_hit_rate); - uint64_t global_dispatch_count = pop_hit - local_dispatch_count; - uint64_t total_dispatched = local_dispatch_count + global_dispatch_count; - double local_hit_rate = total_dispatched > 0 ? local_dispatch_count * 100.0 / total_dispatched : 0.0; - DEV_ALWAYS("Thread %d: local_disp : local=%" PRIu64 ", global=%" PRIu64 ", overflow=%" PRIu64 - ", local_rate=%.1f%%", // NOLINT(whitespace/line_length) - thread_idx, - static_cast(local_dispatch_count), - static_cast(global_dispatch_count), - static_cast(local_overflow_count), - local_hit_rate); - - // Level 2: dispatch sub-phases (percentage relative to dispatch) - uint64_t d_parent = sched_dispatch_cycle > 0 ? sched_dispatch_cycle : 1; - DEV_ALWAYS("Thread %d: poll : %.3fus (%.1f%%)", - thread_idx, - cycles_to_us(dispatch_poll), - dispatch_poll * 100.0 / d_parent); - DEV_ALWAYS("Thread %d: pop : %.3fus (%.1f%%) work=%.3fus wait=%.3fus atomics=%" PRIu64 "", - thread_idx, - cycles_to_us(sched_dispatch_pop_cycle), - sched_dispatch_pop_cycle * 100.0 / d_parent, - cycles_to_us(sched_dispatch_pop_cycle - sp.pop_wait_cycle), - cycles_to_us(sp.pop_wait_cycle), - static_cast(sp.pop_atomic_count)); - DEV_ALWAYS("Thread %d: setup : %.3fus (%.1f%%)", - thread_idx, - cycles_to_us(sched_dispatch_setup_cycle), - sched_dispatch_setup_cycle * 100.0 / d_parent); - - // Level 1: scan - DEV_ALWAYS("Thread %d: scan : %.3fus (%.1f%%)", - thread_idx, - cycles_to_us(sched_scan_cycle), - sched_scan_cycle * 100.0 / sched_total); - - // Level 1: idle - DEV_ALWAYS("Thread %d: idle : %.3fus (%.1f%%)", - thread_idx, - cycles_to_us(sched_idle_cycle), - sched_idle_cycle * 100.0 / sched_total); - - // Average per completion - if (cur_thread_completed > 0) { - DEV_ALWAYS("Thread %d: avg/complete : %.3fus", - thread_idx, - cycles_to_us(sched_complete_cycle) / cur_thread_completed); - } - } -#endif - // Summary line (always print when PTO2_PROFILING=1) - DEV_ALWAYS("Thread %d: Scheduler summary: total_time=%.3fus, loops=%" PRIu64 ", tasks_scheduled=%d", - thread_idx, - cycles_to_us(sched_total), - static_cast(sched_loop_count), - cur_thread_completed); -#endif - -#if PTO2_PROFILING - // Flush performance buffers for cores managed by this thread - if (profiling_enabled) { - perf_aicpu_flush_buffers(runtime, thread_idx, core_assignments_[thread_idx], core_num); - perf_aicpu_flush_phase_buffers(thread_idx); - } -#endif - - return cur_thread_completed; -} - -int32_t AicpuExecutor::run(Runtime* runtime) { - int32_t thread_idx = thread_idx_++; - DEV_INFO("Thread %d: Start", thread_idx); - - // Orchestrator check - if (thread_idx >= sched_thread_num_) { -#if PTO2_PROFILING - uint64_t orch_cycle_start = 0; - int32_t pto2_submitted_tasks = -1; -#endif - int32_t orch_idx = thread_idx - sched_thread_num_; - if (runtime->get_orch_built_on_host()) { - DEV_INFO("Thread %d: Host orchestration mode, no-op (orch_idx=%d)", thread_idx, orch_idx); - } else { - // First orchestrator thread (orch_idx == 0): load SO, create runtime - if (orch_idx == 0) { - DEV_INFO("Thread %d: Primary orchestrator, loading SO via dlopen", thread_idx); - - const void* so_data = runtime->get_device_orch_so_data(); - size_t so_size = runtime->get_device_orch_so_size(); - - if (so_data == nullptr || so_size == 0) { - DEV_ERROR("Thread %d: Device orchestration SO not set", thread_idx); - return -1; - } - - // Try multiple paths that may allow execution on AICPU - char so_path[256]; - bool file_created = false; - const char* candidate_dirs[] = { - "/usr/lib64/aicpu_kernels/0/aicpu_kernels_device", "/usr/lib64", "/lib64", "/var/tmp", "/tmp"}; - const int32_t num_candidates = sizeof(candidate_dirs) / sizeof(candidate_dirs[0]); - - for (int32_t i = 0; i < num_candidates && !file_created; i++) { - snprintf(so_path, sizeof(so_path), "%s/libdevice_orch_%d.so", candidate_dirs[i], getpid()); - int32_t fd = open(so_path, O_WRONLY | O_CREAT | O_TRUNC, 0755); - if (fd < 0) { - DEV_INFO("Thread %d: Cannot create SO at %s (errno=%d), trying next path", - thread_idx, - so_path, - errno); - continue; - } - ssize_t written = write(fd, so_data, so_size); - close(fd); - if (written != static_cast(so_size)) { - DEV_INFO("Thread %d: Cannot write SO to %s (errno=%d), trying next path", - thread_idx, - so_path, - errno); - unlink(so_path); - continue; - } - file_created = true; - DEV_INFO("Thread %d: Created SO file at %s (%zu bytes)", thread_idx, so_path, so_size); - } - - if (!file_created) { - DEV_ERROR("Thread %d: Failed to create SO file in any candidate path", thread_idx); - return -1; - } - - dlerror(); - void* handle = dlopen(so_path, RTLD_LAZY | RTLD_LOCAL); - const char* dlopen_err = dlerror(); - if (handle == nullptr) { - DEV_ERROR("Thread %d: dlopen failed: %s", thread_idx, dlopen_err ? dlopen_err : "unknown"); - unlink(so_path); - return -1; - } - DEV_INFO("Thread %d: dlopen succeeded, handle=%p", thread_idx, handle); - - dlerror(); - auto config_func = - reinterpret_cast(dlsym(handle, "aicpu_orchestration_config")); - - dlerror(); - DeviceOrchestrationFunc orch_func = - reinterpret_cast(dlsym(handle, "aicpu_orchestration_entry")); - const char* dlsym_error = dlerror(); - if (dlsym_error != nullptr) { - DEV_ERROR("Thread %d: dlsym failed: %s", thread_idx, dlsym_error); - dlclose(handle); - unlink(so_path); - return -1; - } - if (orch_func == nullptr) { - DEV_ERROR("Thread %d: dlsym returned NULL for aicpu_orchestration_entry", thread_idx); - dlclose(handle); - unlink(so_path); - return -1; - } - - dlerror(); - auto bind_runtime_func = - reinterpret_cast(dlsym(handle, "pto2_framework_bind_runtime")); - const char* bind_runtime_error = dlerror(); - if (bind_runtime_error != nullptr) { - DEV_INFO("Thread %d: Optional TLS runtime binder not found: %s", thread_idx, bind_runtime_error); - bind_runtime_func = nullptr; - } - - const ChipStorageTaskArgs& args = runtime->get_orch_args(); - int32_t arg_count = args.tensor_count() + args.scalar_count(); - DEV_INFO("Thread %d: sm_ptr=%p, arg_count=%d", thread_idx, runtime->get_pto2_gm_sm_ptr(), arg_count); - for (int32_t i = 0; i < args.tensor_count() && i < 20; i++) { - const ContinuousTensor& t = args.tensor(i); - DEV_INFO("Thread %d: orch_args[%d] = TENSOR(data=0x%lx, ndims=%u, dtype=%u)", - thread_idx, - i, - static_cast(t.data), - t.ndims, - static_cast(t.dtype)); - } - for (int32_t i = 0; i < args.scalar_count() && (args.tensor_count() + i) < 20; i++) { - DEV_INFO("Thread %d: orch_args[%d] = SCALAR(0x%lx)", - thread_idx, - args.tensor_count() + i, - static_cast(args.scalar(i))); - } - - uint64_t task_window_size = PTO2_TASK_WINDOW_SIZE; - uint64_t heap_size = PTO2_HEAP_SIZE; - int32_t expected_arg_count = 0; - if (config_func) { - PTO2OrchestrationConfig cfg = config_func(args); - expected_arg_count = cfg.expected_arg_count; - DEV_INFO("Thread %d: Config: expected_args=%d", thread_idx, expected_arg_count); - } else { - DEV_INFO("Thread %d: No config function, using defaults", thread_idx); - } - - if (expected_arg_count > 0 && arg_count < expected_arg_count) { - DEV_ERROR("Thread %d: arg_count %d < expected %d", thread_idx, arg_count, expected_arg_count); - dlclose(handle); - unlink(so_path); - return -1; - } - - if (runtime->pto2_task_window_size > 0) { - task_window_size = runtime->pto2_task_window_size; - } - if (runtime->pto2_heap_size > 0) { - heap_size = runtime->pto2_heap_size; - } - int32_t dep_pool_capacity = PTO2_DEP_LIST_POOL_SIZE; - if (runtime->pto2_dep_pool_size > 0) { - dep_pool_capacity = static_cast(runtime->pto2_dep_pool_size); - } - DEV_INFO("Thread %d: Ring sizes: task_window=%lu, heap=%lu, dep_pool=%d", - thread_idx, - static_cast(task_window_size), - static_cast(heap_size), - dep_pool_capacity); - - void* sm_ptr = runtime->get_pto2_gm_sm_ptr(); - void* gm_heap = runtime->get_pto2_gm_heap_ptr(); - - uint64_t sm_size = pto2_sm_calculate_size(task_window_size); - PTO2SharedMemoryHandle* sm_handle = - pto2_sm_create_from_buffer(sm_ptr, sm_size, task_window_size, heap_size); - if (!sm_handle) { - DEV_ERROR("Thread %d: Failed to create shared memory handle", thread_idx); - dlclose(handle); - unlink(so_path); - return -1; - } - - rt = pto2_runtime_create_from_sm( - PTO2_MODE_EXECUTE, sm_handle, gm_heap, heap_size, orch_thread_num_, dep_pool_capacity); - if (!rt) { - DEV_ERROR("Thread %d: Failed to create PTO2Runtime", thread_idx); - pto2_sm_destroy(sm_handle); - dlclose(handle); - unlink(so_path); - return -1; - } - -#if PTO2_PROFILING - for (int i = 0; i < orch_thread_num_; i++) { - rt->orchestrators[i].enable_profiling = runtime->enable_profiling; - } -#endif - - // With multi-ring, slot_states are per-ring inside the scheduler. - runtime->set_pto2_slot_states_ptr(nullptr); - - // Store shared state for other orchestrator threads - orch_func_ = orch_func; - orch_bind_runtime_ = bind_runtime_func; - orch_args_cached_ = &args; - orch_so_handle_ = handle; - snprintf(orch_so_path_, sizeof(orch_so_path_), "%s", so_path); - - // All-orchestrator mode: primary orchestrator does one-time init - if (sched_thread_num_ == 0) { - DEV_INFO("Thread %d: All-orchestrator mode, doing one-time init", thread_idx); - if (runtime->enable_profiling) { - perf_aicpu_init_profiling(runtime); - // After transition, all threads become schedulers - perf_aicpu_init_phase_profiling(runtime, thread_num_, orch_thread_num_); - perf_aicpu_set_orch_thread_idx(0); - } - pto2_init_done_.store(true, std::memory_order_release); - pto2_init_complete_.store(true, std::memory_order_release); - DEV_INFO("Thread %d: One-time init done", thread_idx); - } - - runtime_init_ready_.store(true, std::memory_order_release); - } else { - // Non-primary orchestrator: wait for primary to finish setup - while (!runtime_init_ready_.load(std::memory_order_acquire)) { - SPIN_WAIT_HINT(); - } - } - - // Wait for scheduler's one-time init to complete - // (or primary orchestrator's init in all-orchestrator mode) - while (!pto2_init_complete_.load(std::memory_order_acquire)) { - SPIN_WAIT_HINT(); - } - - pto2_set_orch_thread_idx(orch_idx); - -#if PTO2_PROFILING - // Each orchestrator thread sets its own phase buffer index (thread-local) - if (runtime->enable_profiling) { - perf_aicpu_set_orch_thread_idx(thread_idx); - } -#endif - -#if PTO2_PROFILING - orch_cycle_start = get_sys_cnt_aicpu(); -#endif - if (orch_bind_runtime_ != nullptr) { - orch_bind_runtime_(rt); - } - pto2_rt_scope_begin(rt); - orch_func_(*orch_args_cached_, orch_thread_num_, orch_idx); - pto2_rt_scope_end(rt); -#if PTO2_PROFILING - uint64_t orch_cycle_end = get_sys_cnt_aicpu(); - (void)orch_cycle_end; // NOLINT(readability/casting) -#endif - - // Print orchestrator profiling data -#if PTO2_ORCH_PROFILING - PTO2OrchProfilingData p = pto2_orchestrator_get_profiling(); - uint64_t total = - p.sync_cycle + p.alloc_cycle + p.params_cycle + p.lookup_cycle + p.insert_cycle + p.fanin_cycle; - if (total == 0) total = 1; // avoid div-by-zero - DEV_ALWAYS("Thread %d: === Orchestrator Profiling: %" PRId64 " tasks, total=%.3fus ===", - thread_idx, - static_cast(p.submit_count), - cycles_to_us(total)); - DEV_ALWAYS("Thread %d: task+heap_alloc: %.3fus (%.1f%%) work=%.3fus wait=%.3fus atomics=%" PRIu64 "", - thread_idx, - cycles_to_us(p.alloc_cycle), - p.alloc_cycle * 100.0 / total, - cycles_to_us(p.alloc_cycle - p.alloc_wait_cycle), - cycles_to_us(p.alloc_wait_cycle), - static_cast(p.alloc_atomic_count)); - DEV_ALWAYS("Thread %d: sync_tensormap : %.3fus (%.1f%%)", - thread_idx, - cycles_to_us(p.sync_cycle), - p.sync_cycle * 100.0 / total); - DEV_ALWAYS("Thread %d: lookup+dep : %.3fus (%.1f%%)", - thread_idx, - cycles_to_us(p.lookup_cycle), - p.lookup_cycle * 100.0 / total); - DEV_ALWAYS("Thread %d: tensormap_ins : %.3fus (%.1f%%)", - thread_idx, - cycles_to_us(p.insert_cycle), - p.insert_cycle * 100.0 / total); - DEV_ALWAYS("Thread %d: param_copy : %.3fus (%.1f%%) atomics=%" PRIu64 "", - thread_idx, - cycles_to_us(p.params_cycle), - p.params_cycle * 100.0 / total, - static_cast(p.params_atomic_count)); - DEV_ALWAYS("Thread %d: fanin+ready : %.3fus (%.1f%%) work=%.3fus wait=%.3fus atomics=%" PRIu64 "", - thread_idx, - cycles_to_us(p.fanin_cycle), - p.fanin_cycle * 100.0 / total, - cycles_to_us(p.fanin_cycle - p.fanin_wait_cycle), - cycles_to_us(p.fanin_wait_cycle), - static_cast(p.fanin_atomic_count)); - DEV_ALWAYS("Thread %d: avg/task : %.3fus", - thread_idx, - p.submit_count > 0 ? cycles_to_us(total) / p.submit_count : 0.0); - -#if PTO2_TENSORMAP_PROFILING - PTO2TensorMapProfilingData tp = pto2_tensormap_get_profiling(); - DEV_ALWAYS("Thread %d: === TensorMap Lookup Stats ===", thread_idx); - DEV_ALWAYS("Thread %d: lookups : %" PRIu64 ", inserts: %" PRIu64 "", - thread_idx, - static_cast(tp.lookup_count), - static_cast(tp.insert_count)); - DEV_ALWAYS("Thread %d: chain walked : total=%" PRIu64 ", avg=%.1f, max=%d", - thread_idx, - static_cast(tp.lookup_chain_total), - tp.lookup_count > 0 ? static_cast(tp.lookup_chain_total) / tp.lookup_count : 0.0, - tp.lookup_chain_max); - DEV_ALWAYS("Thread %d: overlap checks : %" PRIu64 ", hits=%" PRIu64 " (%.1f%%)", - thread_idx, - static_cast(tp.overlap_checks), - static_cast(tp.overlap_hits), - tp.overlap_checks > 0 ? tp.overlap_hits * 100.0 / tp.overlap_checks : 0.0); -#endif - -#if PTO2_PROFILING - // Write orchestrator summary to shared memory for host-side export (only if profiling enabled) - if (runtime->enable_profiling) { - AicpuOrchSummary orch_summary = {}; - orch_summary.start_time = orch_cycle_start; - orch_summary.end_time = orch_cycle_end; - orch_summary.sync_cycle = p.sync_cycle; - orch_summary.alloc_cycle = p.alloc_cycle; - orch_summary.args_cycle = p.args_cycle; - orch_summary.lookup_cycle = p.lookup_cycle; - orch_summary.heap_cycle = 0; // Now included in alloc_cycle - orch_summary.insert_cycle = p.insert_cycle; - orch_summary.fanin_cycle = p.fanin_cycle; - orch_summary.scope_end_cycle = p.scope_end_cycle; - orch_summary.submit_count = p.submit_count; - perf_aicpu_write_orch_summary(&orch_summary); - } -#endif -#endif - -#if PTO2_PROFILING - // Write core-to-thread mapping (one-time, after orchestration) - if (runtime->enable_profiling) { - perf_aicpu_write_core_assignments( - core_assignments_, core_count_per_thread_, sched_thread_num_, cores_total_num_); - // Flush orchestrator's phase record buffer - perf_aicpu_flush_phase_buffers(thread_idx); - } -#endif - - // Coordinate orchestrator completion - int32_t finished = orch_finished_count_.fetch_add(1, std::memory_order_acq_rel) + 1; - if (finished == orch_thread_num_) { - // Last orchestrator: signal completion and trigger core transition - pto2_rt_orchestration_done(rt); - - void* sm = runtime->get_pto2_gm_sm_ptr(); - PTO2SharedMemoryHeader* sm_header = static_cast(sm); - int32_t pto2_task_count = 0; - if (sm_header) { - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - pto2_task_count += sm_header->rings[r].fc.current_task_index.load(std::memory_order_acquire); - } - } -#if PTO2_PROFILING - pto2_submitted_tasks = pto2_task_count; -#endif - total_tasks_ = pto2_task_count; - if (runtime->enable_profiling && pto2_task_count > 0) { - perf_aicpu_update_total_tasks(runtime, static_cast(pto2_task_count)); - } - orchestrator_done_ = true; - { - int32_t orch_err = 0; - void* sm = runtime->get_pto2_gm_sm_ptr(); - if (sm) { - orch_err = - static_cast(sm)->orch_error_code.load(std::memory_order_relaxed); - } - - // Fatal error: shutdown AICore immediately before core transition. - if (orch_err != PTO2_ERROR_NONE) { - emergency_shutdown(runtime); - completed_.store(true, std::memory_order_release); - } - } - -#if PTO2_ORCH_PROFILING - uint64_t reassign_cycle_start = get_sys_cnt_aicpu(); -#endif - - // Skip core transition on fatal error — cores already shut down above - if (completed_.load(std::memory_order_acquire)) { - // Signal transition to unblock scheduler threads waiting at core transition - transition_requested_.store(true, std::memory_order_release); - reassigned_.store(true, std::memory_order_release); - } else if (orch_to_sched_) { - // Compute new core assignments for all threads and initialize donated slots - DEV_INFO("Thread %d: Set orchestrator_done=true, requesting core transition", thread_idx); -#if PTO2_PROFILING - uint64_t orch_stage_end_ts = get_sys_cnt_aicpu(); -#endif - transition_requested_.store(true, std::memory_order_release); -#if PTO2_PROFILING - DEV_ALWAYS( - "Thread %d: orch_stage_end=%" PRIu64 "", thread_idx, static_cast(orch_stage_end_ts)); -#endif - - // Wait for scheduler threads to acknowledge transition request - // All-orchestrator mode (sched_thread_num_ == 0): skip the wait - if (sched_thread_num_ > 0) { - while (wait_reassign_.load(std::memory_order_acquire) != sched_thread_num_) { - if (completed_.load(std::memory_order_acquire)) { - break; - } - SPIN_WAIT_HINT(); - } - } - if (!completed_.load(std::memory_order_acquire)) { - reassign_cores_for_all_threads(); - reassigned_.store(true, std::memory_order_release); - } - } - -#if PTO2_ORCH_PROFILING - uint64_t reassign_cycle_end = get_sys_cnt_aicpu(); - DEV_ALWAYS("Thread %d: reassign, cost %.3fus (orch_idx=%d)", - thread_idx, - cycles_to_us(reassign_cycle_end - reassign_cycle_start), - orch_idx); -#endif - } else { - // Non-last orchestrator: wait for last orchestrator to finish setup - if (orch_to_sched_) { - while (!transition_requested_.load(std::memory_order_acquire)) { - SPIN_WAIT_HINT(); - } - while (!reassigned_.load(std::memory_order_acquire)) { - if (completed_.load(std::memory_order_acquire)) { - break; - } - SPIN_WAIT_HINT(); - } - } - } - } -#if PTO2_PROFILING - uint64_t orch_end_ts = get_sys_cnt_aicpu(); - DEV_ALWAYS("Thread %d: orch_start=%" PRIu64 " orch_end=%" PRIu64 " orch_cost=%.3fus", - thread_idx, - static_cast(orch_cycle_start), - static_cast(orch_end_ts), - cycles_to_us(orch_end_ts - orch_cycle_start)); - if (pto2_submitted_tasks >= 0) { - DEV_ALWAYS("PTO2 total submitted tasks = %d, already executed %d tasks", - pto2_submitted_tasks, - completed_tasks_.load(std::memory_order_acquire)); - } -#endif - DEV_INFO("Thread %d: Orchestrator completed (orch_idx=%d)", thread_idx, orch_idx); - } - - // Scheduler thread (orchestrator threads skip dispatch when orch_to_sched_ is false) - if (!completed_.load(std::memory_order_acquire) && (thread_idx < sched_thread_num_ || orch_to_sched_)) { - // Device orchestration: wait for primary orchestrator to initialize SM header - if (!runtime->get_orch_built_on_host()) { - while (!runtime_init_ready_.load(std::memory_order_acquire)) { - SPIN_WAIT_HINT(); - } - } - always_assert(rt != nullptr); - int32_t completed = resolve_and_dispatch_pto2(runtime, thread_idx); - DEV_INFO("Thread %d: Executed %d tasks from runtime", thread_idx, completed); - } - - // Always shutdown AICore — even if completed_ was already true. - // platform_deinit_aicore_regs is idempotent; orchestrator threads have - // core_count_per_thread_ == 0 so they skip the loop harmlessly. - { - const int32_t* shutdown_cores = core_assignments_[thread_idx]; - int32_t shutdown_count = core_count_per_thread_[thread_idx]; - if (shutdown_count > 0) { - auto rc = shutdown_aicore(runtime, thread_idx, shutdown_cores, shutdown_count); - if (rc != 0) { - return rc; - } - } - } - - DEV_INFO("Thread %d: Completed", thread_idx); - - // Check if this is the last thread to finish - int32_t prev_finished = finished_count_.fetch_add(1, std::memory_order_acq_rel); - if (prev_finished + 1 == thread_num_) { - finished_.store(true, std::memory_order_release); - // Destroy PTO2 runtime and close orchestration SO (moved from orchestrator path) - if (!runtime->get_orch_built_on_host() && orch_so_handle_ != nullptr) { - // Clear the borrowed pointer in the orchestration SO before destroying - // rt, so g_pto2_current_runtime never points to freed memory. - if (orch_bind_runtime_ != nullptr) { - orch_bind_runtime_(nullptr); - } - pto2_runtime_destroy(rt); - dlclose(orch_so_handle_); - unlink(orch_so_path_); - } - } - - return 0; -} - -void AicpuExecutor::deinit(Runtime* runtime) { - // 1. Invalidate AICPU cache for Runtime address range. - // Next round's Host DMA (rtMemcpy) writes fresh Runtime to HBM but - // bypasses this cache. Invalidating now ensures next round reads from HBM. - cache_invalidate_range(runtime, sizeof(Runtime)); - - // Reset all per-core execution state - for (int32_t i = 0; i < RUNTIME_MAX_WORKER; i++) { - core_exec_states_[i] = {}; - core_exec_states_[i].executing_reg_task_id = AICPU_TASK_INVALID; - } - - // Clear per-core dispatch payloads - memset(s_pto2_payload_per_core, 0, sizeof(s_pto2_payload_per_core)); - - completed_tasks_.store(0, std::memory_order_release); - total_tasks_ = 0; - finished_count_.store(0, std::memory_order_release); - orchestrator_done_ = false; - pto2_init_done_.store(false, std::memory_order_release); - pto2_init_complete_.store(false, std::memory_order_release); - runtime_init_ready_.store(false, std::memory_order_release); - - // Reset core transition state - transition_requested_.store(false, std::memory_order_release); - wait_reassign_.store(0, std::memory_order_release); - reassigned_.store(false, std::memory_order_release); - completed_.store(false, std::memory_order_release); - orch_finished_count_.store(0, std::memory_order_release); - - // Reset core discovery state - aic_count_ = 0; - aiv_count_ = 0; - - regs_ = 0; - orch_func_ = nullptr; - orch_bind_runtime_ = nullptr; - orch_args_cached_ = nullptr; - orch_so_handle_ = nullptr; - orch_so_path_[0] = '\0'; - - // Clear file-scope PTO2Runtime pointer (freed by orchestrator thread before deinit) - rt = nullptr; - - DEV_INFO("DeInit: Runtime execution state reset"); - - initialized_.store(false, std::memory_order_release); - init_done_.store(false, std::memory_order_release); - init_failed_.store(false, std::memory_order_release); - thread_idx_.store(0, std::memory_order_release); - finished_.store(false, std::memory_order_release); - - DEV_INFO("DeInit: AicpuExecutor reset complete"); -} - -void AicpuExecutor::emergency_shutdown(Runtime* runtime) { - DEV_WARN("Emergency shutdown: sending exit signal to all initialized cores"); - Handshake* all_handshakes = reinterpret_cast(runtime->workers); - for (int32_t i = 0; i < cores_total_num_; i++) { - Handshake* hank = &all_handshakes[i]; - OUT_OF_ORDER_STORE_BARRIER(); - hank->aicpu_regs_ready = 1; - if (core_exec_states_[i].reg_addr != 0) { - platform_deinit_aicore_regs(core_exec_states_[i].reg_addr); - } - } - - DEV_WARN("Emergency shutdown complete"); -} - -void AicpuExecutor::diagnose_stuck_state( - Runtime* runtime, int32_t thread_idx, const int32_t* cur_thread_cores, int32_t core_num, Handshake* hank) { - (void)runtime; // NOLINT(readability/casting) - PTO2SchedulerState* sched = &rt->scheduler; - DEV_ALWAYS("========== DIAGNOSTIC REPORT: Thread %d ==========", thread_idx); - - int32_t completed = completed_tasks_.load(std::memory_order_acquire); - int32_t total = total_tasks_; - DEV_ALWAYS("Progress: %d/%d tasks (%.1f%%)", completed, total, total > 0 ? completed * 100.0 / total : 0.0); - - uint64_t aic_ready = 0, aiv_ready = 0, mix_ready = 0; - if (rt) { - aic_ready = sched->ready_queues[static_cast(PTO2ResourceShape::AIC)].size(); - aiv_ready = sched->ready_queues[static_cast(PTO2ResourceShape::AIV)].size(); - mix_ready = sched->ready_queues[static_cast(PTO2ResourceShape::MIX)].size(); - } - DEV_ALWAYS("Ready Queues: AIC=%lu, AIV=%lu, MIX=%lu", aic_ready, aiv_ready, mix_ready); - - int32_t busy_cores = 0; - int32_t idle_cores = 0; - - DEV_ALWAYS("Core Status:"); - for (int32_t i = 0; i < core_num; i++) { - int32_t core_id = cur_thread_cores[i]; - Handshake* h = &hank[core_id]; - const char* core_type_str = core_type_to_string(h->core_type); - - uint64_t reg_addr = core_exec_states_[core_id].reg_addr; - uint64_t reg_val = read_reg(reg_addr, RegId::COND); - int32_t reg_task_id = EXTRACT_TASK_ID(reg_val); - int32_t reg_state = EXTRACT_TASK_STATE(reg_val); - int32_t task_id = core_exec_states_[core_id].executing_reg_task_id; - - if (reg_state != TASK_FIN_STATE || task_id >= 0) { - busy_cores++; - if (task_id >= 0) { - int32_t kernel_id = -1; - if (rt && rt->sm_handle && core_exec_states_[core_id].executing_slot_state) { - int32_t diag_slot = static_cast(core_exec_states_[core_id].executing_subslot); - kernel_id = core_exec_states_[core_id].executing_slot_state->task->kernel_id[diag_slot]; - } - DEV_ALWAYS( - " Core %d [%s, BUSY]: COND=0x%lx (reg_task_id=%d, reg_state=%s), executing_reg_task_id=%d, " - "kernel_id=%d", - core_id, - core_type_str, - reg_val, - reg_task_id, - reg_state == TASK_FIN_STATE ? "FIN" : "ACK", - task_id, - kernel_id); - } else { - DEV_ALWAYS(" Core %d [%s, BUSY]: COND=0x%lx (reg_task_id=%d, reg_state=%s) but task_id not tracked", - core_id, - core_type_str, - reg_val, - reg_task_id, - reg_state == TASK_FIN_STATE ? "FIN" : "ACK"); - } - } else { - idle_cores++; - } - } - - DEV_ALWAYS("Summary: %d busy, %d idle", busy_cores, idle_cores); - - // Diagnose deadlock vs livelock - if (busy_cores == 0 && aic_ready == 0 && aiv_ready == 0 && completed < total) { - DEV_ALWAYS("*** DEADLOCK DETECTED ***"); - DEV_ALWAYS("All cores idle, no ready tasks, but %d tasks incomplete", total - completed); - DEV_ALWAYS("Check PTO2 shared memory for task dependency state"); - } else if (busy_cores > 0) { - DEV_ALWAYS("*** LIVELOCK / HUNG TASK ***"); - DEV_ALWAYS("%d cores executing but no progress", busy_cores); - } - - DEV_ALWAYS("========== END DIAGNOSTIC =========="); -} - -// ===== Public Entry Point ===== - -/** - * aicpu_execute - Main AICPU kernel execution entry point - * - * This is called by DynTileFwkBackendKernelServer in kernel.cpp. - * Orchestrates the complete task runtime execution: - * 1. Initialize executor (thread-safe, first thread only) - * 2. Wait for initialization to complete - * 3. Execute tasks on managed cores - * 4. Cleanup when last thread finishes - * - * @param runtime Pointer to Runtime structure - * @return 0 on success, non-zero on error - */ -extern "C" int32_t aicpu_execute(Runtime* runtime) { - if (runtime == nullptr) { - DEV_ERROR("%s", "Invalid argument: null Runtime pointer"); - return -1; - } - - DEV_INFO("%s", "aicpu_execute: Starting AICPU kernel execution"); - - // Get platform register addresses from platform-level global - g_aicpu_executor.regs_ = get_platform_regs(); - - g_aicpu_executor.init(runtime); - - while (!g_aicpu_executor.init_done_.load(std::memory_order_acquire)) { - if (g_aicpu_executor.init_failed_.load(std::memory_order_acquire)) { - DEV_ERROR("%s", "aicpu_execute: Initialization failed, aborting execution"); - return -1; - } - } - - int32_t rc = g_aicpu_executor.run(runtime); - if (rc != 0) { - DEV_ERROR("aicpu_execute: Thread execution failed with rc=%d", rc); - return rc; - } - - // Last thread cleans up - if (g_aicpu_executor.finished_.load(std::memory_order_acquire)) { - DEV_INFO("aicpu_execute: Last thread finished, cleaning up"); - g_aicpu_executor.deinit(runtime); - } - - DEV_INFO("%s", "aicpu_execute: Kernel execution completed successfully"); - return 0; -} diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/build_config.py b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/build_config.py deleted file mode 100644 index e0a10c982..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/build_config.py +++ /dev/null @@ -1,26 +0,0 @@ -# Copyright (c) PyPTO Contributors. -# This program is free software, you can redistribute it and/or modify it under the terms and conditions of -# CANN Open Software License Agreement Version 2.0 (the "License"). -# Please refer to the License for details. You may not use this file except in compliance with the License. -# THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, -# INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. -# See LICENSE in the root of the software repository for the full text of the License. -# ----------------------------------------------------------------------------------------------------------- -# Tensormap and Ringbuffer Runtime build configuration -# All paths are relative to this file's directory (src/runtime/tensormap_and_ringbuffer/) -# -# This is a device-orchestration runtime where: -# - AICPU thread 3 runs the orchestrator (builds task graph on device) -# - AICPU threads 0/1/2 run schedulers (dispatch tasks to AICore) -# - AICore executes tasks via an aligned PTO2DispatchPayload + pre-built dispatch_args -# -# The "orchestration" directory contains source files compiled into both -# runtime targets AND the orchestration .so (e.g., tensor methods needed -# by the Tensor constructor's validation logic). - -BUILD_CONFIG = { - "aicore": {"include_dirs": ["runtime", "common"], "source_dirs": ["aicore", "orchestration"]}, - "aicpu": {"include_dirs": ["runtime", "common"], "source_dirs": ["aicpu", "runtime", "orchestration"]}, - "host": {"include_dirs": ["runtime", "common"], "source_dirs": ["host", "runtime", "orchestration"]}, - "orchestration": {"include_dirs": ["runtime", "orchestration", "common"], "source_dirs": ["orchestration"]}, -} diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/common/intrinsic.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/common/intrinsic.h deleted file mode 100644 index 9ea70625a..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/common/intrinsic.h +++ /dev/null @@ -1,141 +0,0 @@ -/* - * Copyright (c) PyPTO Contributors. - * This program is free software, you can redistribute it and/or modify it under the terms and conditions of - * CANN Open Software License Agreement Version 2.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - * ----------------------------------------------------------------------------------------------------------- - */ - -/** - * @file intrinsic.h - * @brief SPMD execution context for AICore user kernels - * - * Topology data exposed to user kernels has two distinct lifetimes: - * - * 1. Global topology (per-core, fixed after runtime init): - * - sub_block_id : identifies the AIV lane within a cluster - * (0 = AIV0/left, 1 = AIV1/right). Initialized once at runtime - * startup based on each core's cluster position; never changes. - * Only meaningful for AIV kernels in MIX tasks. - * - * 2. Local topology (per-dispatch, changes each dispatch): - * - block_idx : which logical block the current worker is executing - * - block_num : total number of blocks in this task (= block_dim) - * Written by build_payload() before each dispatch. - * - * Both categories are injected via two pointer slots appended at the tail - * of the kernel args[] array: - * - * args layout: - * [0 .. tensor_count-1] = tensor GM pointers - * [tensor_count .. +scalar_count-1] = scalar values - * ... - * [SPMD_LOCAL_CONTEXT_INDEX] = (uint64_t)&LocalContext (per-dispatch) - * [SPMD_GLOBAL_CONTEXT_INDEX] = (uint64_t)&GlobalContext (per-core) - * - * The suffix positions are compile-time constants and do not depend on the - * runtime tensor_count or scalar_count. - * - * Include this header in AICore kernel source files to use the Get* accessors. - * Do NOT depend on the raw index constants; always use the accessor functions. - * - * On CCEC (real hardware), __gm__ and __aicore__ must be defined before - * including this header (e.g. via or manual #define). - * The #ifndef guards below provide fallbacks for non-kernel builds - * (AICPU, HOST) where these qualifiers are not needed. - */ - -#pragma once - -#include - -#ifndef __gm__ -#define __gm__ -#endif - -#ifndef __aicore__ -#define __aicore__ -#endif - -/** Number of extra pointer slots appended to the args[] tail (LocalContext + GlobalContext). */ -static constexpr int32_t PTO2_EXT_PARAMS_COUNT = 2; - -/** - * Args[] suffix indices for context pointers. - * Derived from MAX_TENSOR_ARGS(16) + MAX_SCALAR_ARGS(128). - * Users should not depend on these values; use the Get* functions below. - */ -static constexpr int32_t SPMD_LOCAL_CONTEXT_INDEX = 144; -static constexpr int32_t SPMD_GLOBAL_CONTEXT_INDEX = 145; - -/** - * Per-core global context, stored in PTO2DispatchPayload. - * Initialized once at runtime startup (init_global_context) based on each - * core's cluster position. Never modified after initialization. - */ -struct GlobalContext { - // AIV lane within cluster: 0=AIV0(left), 1=AIV1(right). - // Used by AIV to select the correct intra-cluster hw instruction. - // Not meaningful for AIC kernels or single-AIV tasks. - int32_t sub_block_id; -}; - -/** - * Per-dispatch local context, stored in PTO2DispatchPayload. - * Written by build_payload() before each dispatch. Different blocks of the - * same task receive different block_idx values but the same block_num. - */ -struct LocalContext { - int32_t block_idx; // Logical block index within the task [0, block_num) - int32_t block_num; // How many logical blocks this task requires. - // Currently fixed to 1 (block_dim > 1 not yet implemented). - // NOT the same as RUNTIME_CONFIG.block_dim in kernel_config.py, - // which controls how many physical cores the runtime launches. -}; - -/** - * Return the AIV lane index within the cluster. - * In a MIX 1C2V task: AIV0(left)=0, AIV1(right)=1. - * - * This value is only meaningful for AIV kernels in MIX tasks. It tells - * the AIV whether it is the left lane or the right lane within the cluster, - * which determines the correct hardware instruction for intra-cluster - * communication. - * - * AIC kernels should NOT call this function. - * Single-AIV tasks have no intra-cluster communication, so sub_block_id - * has no meaning and should not be used. - */ -static __aicore__ inline int32_t get_sub_block_id(__gm__ int64_t* args) { - __gm__ GlobalContext* ctx = - reinterpret_cast<__gm__ GlobalContext*>(static_cast(args[SPMD_GLOBAL_CONTEXT_INDEX])); - return ctx->sub_block_id; -} - -/** - * Return the logical block index assigned to the current worker. - * Range: [0, get_block_num(args)). - * Within the same task, different blocks receive different indices. - */ -static __aicore__ inline int32_t get_block_idx(__gm__ int64_t* args) { - __gm__ LocalContext* ctx = - reinterpret_cast<__gm__ LocalContext*>(static_cast(args[SPMD_LOCAL_CONTEXT_INDEX])); - return ctx->block_idx; -} - -/** - * Return how many logical blocks the current task requires. - * All blocks of the same task see the same value. - * Currently always returns 1 (block_dim>1 not yet implemented). - * - * Note: this is NOT the same as RUNTIME_CONFIG.block_dim in - * kernel_config.py, which controls how many physical cores are launched. - */ -static __aicore__ inline int32_t get_block_num(__gm__ int64_t* args) { - __gm__ LocalContext* ctx = - reinterpret_cast<__gm__ LocalContext*>(static_cast(args[SPMD_LOCAL_CONTEXT_INDEX])); - return ctx->block_num; -} diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/MULTI_RING.md b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/MULTI_RING.md deleted file mode 100644 index 339c1ee5a..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/MULTI_RING.md +++ /dev/null @@ -1,237 +0,0 @@ -# Multi-Ring Buffer Architecture - -> Extension to the PTO2 runtime. For the base architecture, see [RUNTIME_LOGIC.md](RUNTIME_LOGIC.md). - -## 1. Problem - -The single-ring design uses one `last_task_alive` watermark shared by HeapRing, TaskRing, and DepPool. When tasks from an inner scope (e.g., per-block iteration) complete, their resources cannot be reclaimed until **all** prior tasks — including those from the outer scope — also complete. This wastes ring capacity and can trigger deadlocks when ring sizes are small. - -## 2. Solution - -Split HeapRing, TaskRing, and DepPool into arrays of `PTO2_MAX_RING_DEPTH` (4) independent instances. Each scope depth maps to its own ring, with an independent `last_task_alive` watermark. - -``` -Scope depth 0 ──► rings[0] = { HeapRing, TaskRing, DepPool } -Scope depth 1 ──► rings[1] = { HeapRing, TaskRing, DepPool } -Scope depth 2 ──► rings[2] = { HeapRing, TaskRing, DepPool } -Scope depth ≥3 ──► rings[3] = { HeapRing, TaskRing, DepPool } (clamped) -``` - -Inner-scope tasks can now be reclaimed independently without waiting for outer-scope tasks to complete. - -## 3. Task ID Encoding - -Task IDs are widened from 32-bit to 64-bit to carry the ring identity: - -``` -task_id.raw = (ring_id << 32) | local_id -``` - -`PTO2TaskId` exposes direct accessors in `pto_runtime2_types.h`: - -| API | Purpose | -|-----|---------| -| `pto2_make_task_id(ring_id, local_id)` | Compose a 64-bit task ID (`PTO2TaskId`) | -| `task_id.ring()` | Extract `ring_id` (bits 63-32) | -| `task_id.local()` | Extract `local_id` (bits 31-0) | -| `task_id.raw` | Access the packed 64-bit encoding | - -Type changes: - -| Field | Before | After | -|-------|--------|-------| -| `PTO2TaskDescriptor.task_id` | `int32_t` | `PTO2TaskId` | -| `PTO2TensorMapEntry.producer_task_id` | `int32_t` | `PTO2TaskId` | -| `PTO2TaskSlotState.ring_id` | N/A | `uint8_t` (new, denormalized for fast access) | - -## 4. Data Structures - -### 4.1 PTO2RingSet (new) - -Bundles the three per-ring resources into a single aggregate (`pto_ring_buffer.h`): - -```cpp -struct PTO2RingSet { - PTO2HeapRing heap_ring; - PTO2TaskRing task_ring; - PTO2DepListPool dep_pool; -}; -``` - -### 4.2 PTO2OrchestratorState (modified) - -```cpp -// Before: single ring -PTO2HeapRing heap_ring; -PTO2TaskRing task_ring; -PTO2DepListPool dep_pool; - -// After: per-ring array -PTO2RingSet rings[PTO2_MAX_RING_DEPTH]; -int32_t dep_pool_last_reclaimed[PTO2_MAX_RING_DEPTH]; -``` - -Ring selection: `current_ring_id() = min(scope_stack_top, PTO2_MAX_RING_DEPTH - 1)`. - -### 4.3 PTO2SharedMemoryHeader (modified) - -Per-ring flow control and per-ring layout info are grouped together: - -```cpp -struct PTO2RingFlowControl { - std::atomic current_task_index; // task ring head - std::atomic last_task_alive; // task ring tail - std::atomic heap_top; // heap alloc pointer - std::atomic heap_tail; // heap reclaim pointer -}; - -struct PTO2SharedMemoryRingHeader { - PTO2RingFlowControl fc; - uint64_t task_window_size; - uint64_t heap_size; - uint64_t task_descriptors_offset; -}; - -// In header: -PTO2SharedMemoryRingHeader rings[PTO2_MAX_RING_DEPTH]; -``` - -The global `heap_tail_gen` ticket counter is removed; each ring's scheduler state serializes ring-advance via a per-ring try-lock. - -### 4.4 PTO2SharedMemoryHandle (modified) - -Per-ring descriptor and payload arrays: - -```cpp -PTO2TaskDescriptor* task_descriptors[PTO2_MAX_RING_DEPTH]; -PTO2TaskPayload* task_payloads[PTO2_MAX_RING_DEPTH]; -``` - -### 4.5 PTO2SchedulerState (modified) - -```cpp -struct RingSchedState { - PTO2TaskSlotState* slot_states; - int32_t task_window_size; - int32_t task_window_mask; - std::atomic advance_lock; -}; - -RingSchedState ring_sched_states[PTO2_MAX_RING_DEPTH]; -``` - -### 4.6 PTO2TensorMap (modified) - -```cpp -PTO2TensorMapEntry** task_entry_heads[PTO2_MAX_RING_DEPTH]; -int64_t last_task_alives[PTO2_MAX_RING_DEPTH]; -``` - -Entry validity checks and `cleanup_retired` operate per-ring: - -```cpp -bool entry_valid(const PTO2TensorMapEntry& e) { - int32_t ring = e.producer_task_id.ring(); - int32_t local = e.producer_task_id.local(); - return local >= last_task_alives[ring]; -} -``` - -### 4.7 Unchanged Structures - -| Structure | Reason | -|-----------|--------| -| `PTO2DepListEntry` | Stores `PTO2TaskSlotState*` pointer — naturally crosses ring boundaries | -| `PTO2TaskPayload` | `fanin_slot_states[]` are pointers — no ring coupling | -| `PTO2ReadyQueue` | Global ready queues shared across all rings (tasks ready to dispatch regardless of origin ring) | -| `PTO2DispatchPayload` | Built per-dispatch, no ring state needed | - -## 5. Reclamation - -### 5.1 Per-Ring Watermark Advancement - -Each ring's `last_task_alive` advances independently: - -``` -advance_ring_pointers(ring_id): - la = rings[ring_id].fc.last_task_alive - while task_state[la & mask] >= CONSUMED: - advance heap_tail from packed_buffer_end - reset fanin_refcount - CAS(last_task_alive, la, la+1) - la++ -``` - -Per-ring try-locks in the scheduler state prevent concurrent scheduler threads from interleaving heap_tail writes within the same ring. - -### 5.2 Cross-Ring Dependencies - -Dependency edges use `PTO2TaskSlotState*` pointers, which naturally span rings: - -- Ring 1 task depends on ring 0 producer → ring 0's `fanout_head` linked list contains a ring 1 `PTO2TaskSlotState*` -- When ring 0 task completes, it walks its fanout list and decrements ring 1 consumers' `fanin_refcount` -- No special cross-ring logic needed — pointer-based design is ring-agnostic - -### 5.3 DepPool Reclamation - -``` -pto2_dep_pool_reclaim(ring_id): - la = rings[ring_id].fc.last_task_alive - newest_consumed = la - 1 - mark = task_payloads[ring_id][slot(newest_consumed)].dep_pool_mark - if mark > 0: - rings[ring_id].dep_pool.advance_tail(mark) -``` - -Note: dep entries from ring N's pool may appear in ring M's fanout lists. Reclamation is safe because the entries are accessed during fanout traversal (completion time), which always happens before the consumer task — and therefore the dep entry — becomes eligible for reclamation. - -## 6. AICPU Register Protocol Fix - -The AICore dispatch protocol uses 32-bit registers. With multi-ring, `task_id` truncation to 32-bit loses the `ring_id`, causing collisions: - -``` -Ring 0, local_id=0 → DATA_MAIN_BASE = 0 + 1 = 1 -Ring 1, local_id=0 → DATA_MAIN_BASE = 0 + 1 = 1 (collision!) -``` - -AICore uses `last_reg_val` to detect new dispatches — identical values cause skipped tasks and false completions from stale COND registers. - -**Fix**: Per-core monotonic dispatch counter `s_dispatch_seq[core_id]` replaces `task_id` in register writes, guaranteeing unique `DATA_MAIN_BASE` values per core regardless of ring origin. - -## 7. Configuration - -### 7.1 Compile-Time Defaults (per ring) - -| Constant | Default | Total (×4 rings) | -|----------|---------|-------------------| -| `PTO2_TASK_WINDOW_SIZE` | 16384 | 65536 | -| `PTO2_HEAP_SIZE` | 256 MB | 1 GB | -| `PTO2_DEP_LIST_POOL_SIZE` | 16384 | 65536 | - -### 7.2 Runtime Environment Overrides - -Uniform (applies to all rings): - -``` -PTO2_RING_TASK_WINDOW=1024 -PTO2_RING_HEAP=1048576 -PTO2_RING_DEP_POOL=1024 -``` - -In `kernel_config.py`: - -```python -RUNTIME_ENV = { - "PTO2_RING_TASK_WINDOW": "128", - "PTO2_RING_HEAP": "262144", - "PTO2_RING_DEP_POOL": "256", -} -``` - -### 7.3 Sizing Guidelines - -- `task_window` must be ≥ max tasks in any single scope + headroom for concurrent scopes -- `heap` must accommodate peak output buffer allocation across all in-flight tasks on that ring -- `dep_pool` must be ≥ total dependency entries for all in-flight tasks on that ring -- On hardware, back-pressure latency is higher than in simulation — size conservatively -- Adding inner `PTO2_SCOPE` reduces peak per-ring usage, enabling smaller sizes diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/ROADMAP.md b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/ROADMAP.md deleted file mode 100644 index 0dae7b8c8..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/ROADMAP.md +++ /dev/null @@ -1,86 +0,0 @@ -# PTO2 Runtime Roadmap: Advanced Scheduling Features - -This document outlines the planned features and architectural changes for the PTO2 runtime system, specifically focusing on advanced cluster-aware and block-level scheduling semantics. - -## 1. In-Cluster Function Group Scheduling - -**Goal:** Enable co-scheduling of multiple tasks onto the same physical hardware cluster to leverage local interconnects and optimize data locality. - -### Concept -An **in-cluster function group** consists of all incore functions submitted between an `allocate_cluster()` and a `free_cluster()` call (or within a managed scope). The runtime treats this group as a co-scheduled unit: every task in the group executes on the **same physical cluster** (identified by a `clusterID`). - -### Required Architectural Changes - -#### 1. Task Descriptor Extension -The `PTO2TaskDescriptor` will be extended to record function group membership: -- `cluster_id` (int32_t): ID of the allocated cluster (-1 = unconstrained). -- `group_id` (int32_t): Function group identifier. - -#### 2. Orchestration API Additions -```cpp -// Allocate a cluster. Blocks if no cluster is available. -int32_t pto2_rt_allocate_cluster(PTO2Runtime* rt); - -// Release a cluster back to the free pool. -void pto2_rt_free_cluster(PTO2Runtime* rt, int32_t cluster_id); - -// Submit a task constrained to a specific cluster. -void pto2_rt_submit_task_clustered(PTO2Runtime* rt, int kernel_id, - int worker_type, Arg* args, - int n, int32_t cluster_id); -``` - -#### 3. Scheduler Enhancements -- **Cluster ↔ Core mapping**: A static, platform-specific mapping from `cluster_id` to the set of physical cores (e.g., cluster 0 = {AIC0, AIV0, AIV1}). -- **Cluster-Aware Dispatch**: When popping a task, if `cluster_id >= 0`, the scheduler dispatches it *only* to a core belonging to that specific cluster. -- **Cluster Free Pool**: A ring or bitset tracking free clusters to handle allocation and release. -- **Back-Pressure**: `pto2_rt_allocate_cluster` will implement a spin-wait pattern with deadlock detection, similar to the existing task and heap rings. - ---- - -## 2. `block_incore` (SPMD → MPMD) Task Submission - -**Goal:** Support executing a single logical SPMD block function as multiple independent MPMD tasks across available cores. - -### Execution Model -At the runtime level, the orchestration layer will **expand** a single `block_incore` call (with a specified `block_dim`) into `block_dim` individual tasks, each with a distinct `block_id`. - -```cpp -// Orchestration expansion logic -PTO2_SCOPE(rt) { - for (int bid = 0; bid < block_dim; bid++) { - // ... build args with make_scalar_param(bid) ... - pto2_rt_submit_task(rt, KERNEL_FUNC_ID, PTO2_WORKER_VECTOR, args, 4); - } -} -``` - -### Future Optimization Path -While the initial implementation will use O(N) expansion (submitting N individual task descriptors), future optimizations may include: -- **Batch Descriptors**: A single descriptor containing a `block_dim` field. -- **Group-Aware Dispatch**: The scheduler scans one descriptor and expands it into `block_dim` hardware dispatches. -- **Shared-Tensor Optimization**: Reducing TensorMap entries by having one entry per logical tensor instead of per-block tensor. - ---- - -## 3. `block_incore` as InCore Function (Cube + Vector) - -**Goal:** Allow a `block_incore` function to be a composite subgraph requiring both AIC (Cube) and AIV (Vector) cores working cooperatively on the same data block. - -### Execution Model -When combined with cluster allocation, both the cube and vector tasks of each block are pinned to the **same cluster**. This ensures they execute on co-located cores and can utilize local interconnects (e.g., `PIPE_IN`/`PIPE_OUT`) without round-tripping to Global Memory. - -```cpp -// Each block runs its cube and vector kernels on the same cluster -int32_t cid = pto2_rt_allocate_cluster(rt); -PTO2_SCOPE(rt) { - pto2_rt_submit_task_clustered(rt, CUBE_KERNEL, PTO2_WORKER_CUBE, ..., cid); - pto2_rt_submit_task_clustered(rt, VEC_KERNEL, PTO2_WORKER_VECTOR, ..., cid); -} -pto2_rt_free_cluster(rt, cid); -``` - -### Data Structure Impact Summary -- `PTO2TaskDescriptor`: Add `cluster_id`, `group_id`, `block_id`, `block_dim`. -- `PTO2SharedMemoryHeader`: Add cluster free pool tracking. -- **Scheduler**: Cluster-aware dispatch logic and cluster-to-core mapping tables. diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/RUNTIME_LOGIC.md b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/RUNTIME_LOGIC.md deleted file mode 100644 index ecadc65c5..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/RUNTIME_LOGIC.md +++ /dev/null @@ -1,658 +0,0 @@ -# PTO2 Runtime System Design - -## Overview - -PTO2 (Parallel Task Orchestration v2) is a runtime system for executing task graphs on Ascend AI processors. It coordinates four layers of execution: - -- **Host** (x86/ARM CPU): compiles kernels, allocates device memory, initializes the Runtime, and launches AICPU/AICore threads. -- **AICPU** (device ARM cores): runs the orchestrator (task graph builder) and scheduler threads. -- **AICore** (AI compute cores): executes kernel functions dispatched by the scheduler. -- **Shared Memory** (Global Memory): ring buffers, task descriptors, heap, and TensorMap shared between orchestrator and schedulers. - -``` -┌───────────────────────────────────────────────────────────────────────┐ -│ Host (CPU) │ -│ golden.py → code_runner.py → compile kernels → init Runtime │ -│ → upload binaries → launch AICPU/AICore → collect results │ -└───────────────────────────┬───────────────────────────────────────────┘ - │ device memory / GM -┌───────────────────────────▼───────────────────────────────────────────┐ -│ AICPU (4 threads) │ -│ Thread 3: Orchestrator (builds task graph) │ -│ Threads 0-2: Schedulers (dispatch tasks to AICore) │ -│ │ -│ ┌─────────────────────────────────────────────────────────────────┐ │ -│ │ Shared Memory (GM) │ │ -│ │ SharedMemoryHeader │ TaskDescriptors[] │ DepListPool │ │ -│ │ GM Heap (output buffers) │ │ -│ └─────────────────────────────────────────────────────────────────┘ │ -│ │ -│ Scheduler ──Handshake/Registers──► AICore workers (AIC + AIV) │ -└───────────────────────────────────────────────────────────────────────┘ -``` - ---- - -## 1. Runtime Variants - -Three runtime backends exist under `src/runtime/`, each representing a different orchestration and scheduling strategy. - -### 1.1 host_build_graph - -The host builds the complete task graph before launching device execution. The orchestration SO is loaded and executed on the host CPU. - -- **Task storage**: fixed `Task[]` array (up to 131072 tasks) -- **Scheduling**: AICPU receives the pre-built graph and dispatches tasks by traversing dependencies -- **Use case**: development and debugging; no device-side orchestration overhead - -### 1.2 aicpu_build_graph - -The orchestration runs on an AICPU thread, building the task graph on device. Supports concurrent build + schedule (`build_mode=1`). - -- **Task storage**: same `Task[]` array as host_build_graph -- **AicpuBuildApi**: `add_task`, `add_successor_conditional`, `publish_task`, `device_malloc` -- **Use case**: reduced host→device data transfer; graph can depend on device-side data - -### 1.3 tensormap_and_ringbuffer (PTO2) - -The primary production runtime. Uses ring buffers for task slots and output memory, with a TensorMap for automatic dependency tracking. - -- **Task storage**: `PTO2TaskDescriptor[]` in shared memory ring buffer -- **Memory**: GM Heap ring for output buffer allocation -- **Dependencies**: automatically derived from tensor read/write patterns via TensorMap -- **Thread model**: 3 scheduler threads + 1 orchestrator thread on AICPU -- **Multi-ring**: HeapRing, TaskRing, and DepPool are split into `PTO2_MAX_RING_DEPTH` (4) independent instances for nested scope isolation. See [MULTI_RING.md](MULTI_RING.md) for details. -- **Use case**: production workloads; supports streaming, flow control, and large batch sizes - ---- - -## 2. Platform Abstraction - -Two platform implementations exist under `src/platform/`, sharing a common interface. - -### 2.1 a2a3 (Real Ascend Hardware) - -| Component | Description | -|-----------|-------------| -| `device_runner.cpp` | Uses CANN APIs: `rtMalloc`, `rtMemcpy`, `rtLaunchKernel` | -| `memory_allocator.cpp` | Wraps `rtMalloc`/`rtFree` with allocation tracking | -| `aicore/kernel.cpp` | `KERNEL_ENTRY(aicore_kernel)` → `aicore_execute` | -| `aicpu/kernel.cpp` | `DynTileFwkBackendKernelServer` entry → `aicpu_execute` | -| `spin_hint.h` | ARM `wfe`/`yield` instructions for efficient spinning | - -### 2.2 a2a3sim (Thread Simulation) - -| Component | Description | -|-----------|-------------| -| `device_runner.cpp` | Uses `std::thread` to simulate AICPU/AICore | -| `memory_allocator.cpp` | Wraps `malloc`/`free` | -| `aicore/kernel.cpp` | `aicore_execute_wrapper` sets `g_sim_reg_base` per core | -| `upload_kernel_binary` | `dlopen` kernel SO, `dlsym` entry point | - -### 2.3 Platform Constants (`platform_config.h`) - -| Constant | Value | Description | -|----------|-------|-------------| -| `PLATFORM_MAX_BLOCKDIM` | 24 | Maximum blocks (each = 1 AIC + 2 AIV) | -| `PLATFORM_MAX_AICPU_THREADS` | 4 | AICPU thread count (3 schedulers + 1 orchestrator) | -| `PLATFORM_MAX_AIC_PER_THREAD` | 24 | Max AIC cores per scheduler thread | -| `PLATFORM_MAX_AIV_PER_THREAD` | 48 | Max AIV cores per scheduler thread | -| `PLATFORM_PROF_SYS_CNT_FREQ` | 50 MHz | System counter frequency for profiling | - ---- - -## 3. Shared Memory Layout - -The orchestrator and schedulers communicate through a contiguous shared memory region in Global Memory (GM). Each ring level has its own TaskDescriptor and DepListPool sections. See [MULTI_RING.md §4.3–4.4](MULTI_RING.md) for the per-ring shared memory header and handle layout. - -``` -┌─────────────────────────────┐ offset 0 -│ PTO2SharedMemoryHeader │ (flow control, config, sync flags) -├─────────────────────────────┤ aligned -│ PTO2TaskDescriptor[N] │ N = task_window_size (default 65536) -├─────────────────────────────┤ aligned -│ PTO2DepListEntry[M+1] │ M = dep_list_pool_size (entry 0 = NULL sentinel) -└─────────────────────────────┘ -``` - -### 3.1 SharedMemoryHeader Fields - -| Field | Writer | Reader | Purpose | -|-------|--------|--------|---------| -| `current_task_index` | Orchestrator | Scheduler | Next task ID to allocate (task ring head) | -| `last_task_alive` | Scheduler | Orchestrator | Oldest still-active task (task ring tail) | -| `heap_top` | Orchestrator | Scheduler | Heap ring allocation pointer | -| `heap_tail` | Scheduler | Orchestrator | Heap ring reclamation pointer | -| `heap_tail_gen` | Scheduler | Scheduler | Ticket counter for serialized `heap_tail` writes | -| `orchestrator_done` | Orchestrator | Scheduler | Signals orchestration completion | -| `task_window_size` | Init | Both | Number of task slots | -| `heap_size` | Init | Both | Heap total size | -| `dep_list_pool_size` | Init | Both | Dependency list pool size | -| `task_descriptors_offset` | Init | Both | Offset to TaskDescriptor array in SM | -| `dep_list_pool_offset` | Init | Both | Offset to DepListPool in SM | -| `total_size` | Init | Both | Total shared memory size | -| `graph_output_ptr` | Orchestrator | Host | Address of final output (packed buffer) | -| `graph_output_size` | Orchestrator | Host | Size of final output in bytes | - -### 3.2 Size Calculation - -``` -total = ALIGN(Header) + ALIGN(window_size * sizeof(TaskDescriptor)) - + ALIGN((dep_pool_size + 1) * sizeof(DepListEntry)) -``` - -Alignment is 64 bytes (`PTO2_ALIGN_SIZE`). - ---- - -## 4. Ring Buffer Mechanisms - -> **Multi-ring extension**: All three ring buffers (TaskRing, HeapRing, DepPool) are replicated per scope depth. Each ring level has independent watermarks and reclamation. See [MULTI_RING.md](MULTI_RING.md) for details. - -### 4.1 Task Ring - -The task ring manages task slot allocation with back-pressure flow control. - -**Structure** (`PTO2TaskRing`): -- `descriptors`: pointer to `TaskDescriptor[]` in shared memory -- `window_size`: number of slots (power of 2) -- `current_index`: next task ID to allocate (monotonically increasing) -- `last_alive_ptr`: pointer to `header->last_task_alive` - -**Slot mapping**: `slot = task_id & (window_size - 1)` - -**Allocation** (`pto2_task_ring_alloc`): -``` -active_count = current_index - *last_alive_ptr -if active_count < window_size - 1: - allocate slot, advance current_index -else: - spin-wait (back-pressure from scheduler) -``` - -**Reclamation**: Scheduler threads advance `last_task_alive` via lock-free CAS when the oldest task reaches state CONSUMED (4). This frees slots for reuse. - -**Flow control**: When the ring is full, the orchestrator blocks until the scheduler advances `last_task_alive`. With `PTO2_RING_TASK_WINDOW=16` and 208 tasks, slots are recycled ~13 times each. - -### 4.2 Heap Ring - -The heap ring manages output buffer allocation from a circular GM heap. - -**Structure** (`PTO2HeapRing`): -- `base`: GM heap base address -- `size`: total heap size (default 1 GB) -- `top`: allocation pointer (local to orchestrator) -- `tail_ptr`: pointer to `header->heap_tail` (updated by scheduler) - -**Allocation**: Buffers are allocated contiguously from `top`. When reaching the end, allocation wraps to the beginning if `tail` has advanced far enough. Buffers never straddle the wrap-around boundary. - -**Reclamation**: When `last_task_alive` advances past a task, its `packed_buffer_end` is used to advance `heap_tail`, freeing the memory region. - -### 4.3 Dependency List Pool - -A simple bump allocator for `PTO2DepListEntry` nodes used in fanin/fanout linked lists. - -- **Entry 0**: NULL sentinel (`task_id=-1, next_offset=0`) -- **Allocation**: `pool->top++`, wraps around when full -- **Reclamation**: implicit — old entries become unreachable as `last_task_alive` advances - -### 4.4 Flow Control and Back-Pressure - -The ring buffer mechanism provides **flow control** between the orchestrator (producer) and the scheduler (consumer). When a ring is exhausted, the orchestrator **blocks** — it cannot submit new tasks or allocate more output memory until the scheduler reclaims slots/space by advancing the watermarks. - -**Task Ring back-pressure**: When `active_count = current_index - last_task_alive >= window_size - 1`, `pto2_task_ring_alloc` spin-waits until the scheduler completes tasks and advances `last_task_alive`. - -**Heap Ring back-pressure**: When the heap has insufficient contiguous space, `pto2_heap_ring_alloc` spin-waits until the scheduler advances `heap_tail` past completed tasks' output buffers. - -**TensorMap pool back-pressure**: When the entry pool is exhausted, `new_entry()` spin-waits on `pto2_orchestrator_sync_tensormap(force=true)` until cleanup frees entries (see Section 5.4). - -This back-pressure is essential for correctness with small ring sizes — for example, with `PTO2_RING_TASK_WINDOW=16` and 208 tasks, the orchestrator blocks ~192 times, each time waiting for the scheduler to drain completed tasks before continuing. - -### 4.5 Deadlock Detection - -A ring that is **too small** can cause a **deadlock**. The root cause is the scope mechanism: each task's `fanout_count` includes a reference from its owning scope. The scope reference is only released when `scope_end()` runs — but `scope_end()` is called by the orchestrator, which is blocked waiting for ring space. This creates a circular dependency: - -``` -Orchestrator blocked on task_ring_alloc (ring full) - → needs scheduler to advance last_task_alive - → needs tasks to reach CONSUMED state (fanout_count == 0) - → needs scope_end() to release scope reference - → needs orchestrator to continue - → DEADLOCK -``` - -The runtime detects this automatically by counting spin iterations in the allocation functions: - -**Periodic BLOCKED warnings** (every 10,000 spins): -``` -[TaskRing] BLOCKED (Flow Control): current=208, last_alive=192, active=16/16 (100.0%), spins=10000 -[HeapRing] BLOCKED: requesting 4096 bytes, available=0, top=65536, tail=0, spins=10000 -``` - -**Deadlock detection** (after 100,000 spins with no progress): -``` -FATAL: Flow Control Deadlock Detected! -Task Ring is FULL and no progress after 100000 spins. - - Active tasks: 16 - - Window size: 16 -Root Cause: - Tasks cannot transition to CONSUMED state because fanout_count - includes 1 for the owning scope, and scope_end() requires the - orchestrator to continue — creating a circular dependency. -Solution: - Recommended: 32 (at least 2x current active tasks) -``` - -The FATAL message is logged to the device log and the process exits. The solution is to increase the ring size so that it can hold at least all tasks within the largest parallel scope. For example, if a scope submits 13 tasks, `task_window >= 14` is required (13 + 1 to distinguish full from empty). - -**Sizing guideline**: `task_window_size` must be larger than the maximum number of tasks in any single `PTO2_SCOPE`. A safe choice is `2 × max_tasks_per_scope` or simply the default 65536 for production. - ---- - -## 5. TensorMap and Automatic Dependency Tracking - -### 5.1 Purpose - -TensorMap maintains a mapping from tensor memory regions to their producer task IDs. When a new task reads a tensor (INPUT/INOUT), TensorMap automatically discovers the producer and establishes a dependency edge. - -### 5.2 Hash Table Design - -- **Key**: tensor base address (`buffer.addr`) -- **Value**: producer task ID, with overlap detection for sub-regions -- **Overlap**: `COVERED` (new region fully contains old) or `OTHER` (partial overlap) -- Sub-tensors of the same base tensor hash to the same bucket, enabling overlap detection - -### 5.3 Entry Pool Management - -Unlike the Task Ring and Heap Ring, TensorMap entries are **not** managed by a ring buffer. Instead, a **fixed-size pool + free list** is used: - -1. **Free list first**: `free_entry_list[]` stores pointers to released entries. Allocation pops from here (O(1)). -2. **Bump allocation**: if free list is empty, `entry_pool[next_entry_idx++]` allocates from the end of the pool. -3. **Blocking reclaim**: if the pool is fully exhausted, `pto2_orchestrator_sync_tensormap(force=true)` reads the latest `last_task_alive` and calls `cleanup_retired` to batch-free all entries belonging to retired tasks, returning them to the free list. - -This design avoids the complexity of ring-based wrapping while still being bounded by `PTO2_TENSORMAP_POOL_SIZE` (default 65536 entries). - -### 5.4 Stale Entry Cleanup: Three-Layer Defense - -TensorMap must ensure entries for retired tasks (`producer_task_id < last_task_alive`) are removed, so that: -- The pool does not grow unboundedly (capacity is finite) -- Lookup performance does not degrade as stale entries accumulate in bucket chains - -Three complementary mechanisms achieve this: - -**Layer 1 — Chain Truncation during Lookup** (lazy, per-bucket): - -Since `insert` always prepends to the bucket head, entries in each bucket chain are in **descending task_id order**. When `pto2_tensormap_lookup` encounters the first stale entry (`producer_task_id < last_task_alive`), all subsequent entries in the chain are guaranteed stale too. The entire tail is truncated in one operation using `prev_in_bucket` pointers for O(1) unlinking. - -This guarantees lookup only traverses valid entries — O(valid_entries_in_bucket), not O(total_entries). - -**Layer 2 — Periodic Batch Cleanup** (`cleanup_retired`, per-task): - -Every time the orchestrator submits a task (Step 0 of `pto2_submit_task`), it calls `pto2_orchestrator_sync_tensormap`. When `last_task_alive` has advanced by more than `PTO2_TENSORMAP_CLEANUP_INTERVAL` (default 64) tasks since the last cleanup, `pto2_tensormap_cleanup_retired` runs: - -This uses the **per-task entry chain** (`task_entry_head[task_slot]`) — each task's entries are doubly-linked together at insert time via `next_in_task`/`prev_in_task`, allowing O(entries_per_task) cleanup without scanning the entire pool or all buckets. Freed entries are returned to `free_entry_list` for immediate reuse. - -**Layer 3 — Back-Pressure on Pool Exhaustion** (blocking): - -If both the free list and bump region are depleted, `new_entry()` blocks until `pto2_orchestrator_sync_tensormap(force=true)` frees entries by advancing `last_task_alive` through `cleanup_retired`. - -This forms a back-pressure mechanism analogous to the Task Ring's flow control. - -**Summary**: - -| Layer | Trigger | Method | Guarantees | -|-------|---------|--------|------------| -| Chain Truncation | Every lookup | Truncate stale tail of bucket chain | Lookup only visits valid entries | -| Periodic Cleanup | Every 64 retired tasks | Walk per-task chains, free entries | Pool capacity reclaimed in bounded time | -| Pool Back-Pressure | Pool exhausted | Block until scheduler advances watermark | Hard capacity bound, no OOM | - -In steady state, the number of valid TensorMap entries ≈ `active_tasks × avg_outputs_per_task`. With the default `task_window=65536` and `pool_size=65536`, this is well within bounds. With small windows (e.g., `task_window=16`), active entries are even fewer (~16 × a few), and cleanup runs frequently. - -### 5.5 Dependency Discovery Flow - -When `pto2_submit_task` processes parameters: - -1. **INPUT/INOUT**: `pto2_tensormap_lookup` searches for overlapping producers (with chain truncation) -2. For each producer found: `pto2_add_consumer_to_producer` adds the dependency -3. **OUTPUT/INOUT**: `pto2_tensormap_insert` registers the current task as the new producer at bucket head -4. Stale entries are pruned lazily during lookup (Layer 1) and periodically by cleanup (Layer 2) - ---- - -## 6. Task Descriptor and States - -### 6.1 PTO2TaskDescriptor (Hot Path) - -| Field | Description | -|-------|-------------| -| `task_id` | Canonical mixed-task ID (64-bit: `ring_id << 32 | local_id`). See [MULTI_RING.md §3](MULTI_RING.md). | -| `kernel_id[3]` | Per-slot kernel IDs: `[AIC, AIV0, AIV1]`; `INVALID_KERNEL_ID` = inactive | -| `active_mask` | Bitmask of active subtask slots: `bit0=AIC`, `bit1=AIV0`, `bit2=AIV1` | -| `subtask_done_mask` | Atomic bitmask; each subtask sets its done bit on completion | -| `fanin_count` | Number of producer dependencies | -| `fanout_lock` | Per-task spinlock for concurrent fanout modification | -| `fanout_head` | Head of fanout consumer list (pointer, protected by `fanout_lock`) | -| `fanout_count` | 1 (scope ref) + number of consumers | -| `packed_buffer_base` | Start of packed buffer in GM Heap | -| `packed_buffer_end` | End of packed buffer (for heap reclamation) | - -### 6.1b PTO2TaskPayload (Cold Path) - -| Field | Description | -|-------|-------------| -| `tensors[16]` | Tensor descriptors for parameters | -| `scalar_value[16]` | Scalar parameter values | -| `is_tensor[16]` | Whether each parameter is tensor or scalar | -| `param_count` | Number of valid parameters | -| `fanin_slot_states[]` | Producer slot state pointers (used by `on_task_release`) | -| `fanin_actual_count` | Actual fanin count | - -### 6.2 Task State Machine - -``` - [0] PENDING ──fanin satisfied──► [1] READY ──dispatch──► [2] RUNNING - ▲ │ - │ ▼ - slot recycled ◄── [4] CONSUMED ◄──fanout done── [3] COMPLETED -``` - -In the scheduler's `task_state[]` array (`std::atomic`): -- **0 (PENDING)**: waiting for dependencies (`fanin_refcount < fanin_count`) -- **1 (READY)**: all dependencies satisfied, waiting in ready queue -- **2 (RUNNING)**: currently executing on a worker -- **3 (COMPLETED)**: hardware execution complete, output may still be in use -- **4 (CONSUMED)**: output fully consumed, buffers can be released - ---- - -## 7. Orchestrator - -### 7.1 PTO2OrchestratorState - -The orchestrator runs on AICPU Thread 3 and builds the task graph by calling the user-provided orchestration function. - -Key members: -- `rings[PTO2_MAX_RING_DEPTH]`: per-ring `PTO2RingSet` (HeapRing + TaskRing + DepPool). See [MULTI_RING.md §4.2](MULTI_RING.md). -- `tensor_map`, `tensor_pool`: dependency tracking -- `scope_tasks[]`, `scope_begins[]`, `scope_stack_top`: scope nesting stack (flat buffer partitioned by level) -- `scheduler`: pointer to scheduler state (for simulated mode or `init_task_on_submit`) -- `gm_heap_base`, `gm_heap_size`: GM heap for output buffers - -### 7.2 Task Submission Flow (`pto2_submit_task`) - -| Step | Operation | -|------|-----------| -| 0 | `pto2_orchestrator_sync_tensormap` — prune stale TensorMap entries | -| 1 | `pto2_task_ring_alloc` — allocate task slot (may block on flow control) | -| 2 | Initialize task descriptor, copy parameters | -| 3 | **Lookup**: for each INPUT/INOUT param, search TensorMap for producers | -| 4 | **Dependency**: `pto2_add_consumer_to_producer` for each producer found | -| 5 | **Heap alloc**: `pto2_alloc_packed_buffer` for OUTPUT args (addr=0) | -| 6 | **Insert**: register OUTPUT/INOUT args in TensorMap | -| 7 | **Fanin**: finalize `fanin_count`; if `init_task_on_submit`, call scheduler's `init_task` | -| 8 | **Publish**: `STORE_RELEASE(current_task_index)` makes task visible to scanners | - -### 7.3 Lock Protocol for Concurrent Dependency Setup - -The orchestrator and scheduler run concurrently. When adding a consumer to a producer's fanout list: - -1. **Orchestrator acquires** the producer's `fanout_lock` via `pto2_fanout_lock(task)` (CAS spin-lock) -2. **Normal path**: prepend consumer to the producer's fanout list, increment `fanout_count` -3. **Release** `fanout_lock` - -The scheduler's completion handler mirrors this: -1. Mark `task_state[slot] = COMPLETED` -2. **Acquire** `fanout_lock`, read `fanout_head`, **release** lock -3. Traverse fanout list, incrementing each consumer's `fanin_refcount` -4. Mark `task_state[slot] = CONSUMED` when `fanout_refcount` reaches `fanout_count` - -This lock protocol guarantees every consumer is accounted for exactly once. - -### 7.4 Scope Mechanism (`PTO2_SCOPE`) - -Scopes control the lifetime of intermediate buffers. Each scope: -- Tracks tasks submitted within it via a flat `scope_tasks[]` buffer partitioned by `scope_begins[]` -- On `scope_end`: increments `fanout_refcount` for scope tasks; when it reaches `fanout_count`, the task's packed buffer can be reclaimed - -```cpp -PTO2_SCOPE(rt) { - // Tasks submitted here belong to this scope - pto2_rt_submit_aic_task(FUNC_QK, args); - pto2_rt_submit_aiv_task(FUNC_SF, args); -} -// scope_end: scope reference released from all tasks above -``` - ---- - -## 8. Scheduler - -### 8.1 Thread Model - -With `aicpu_thread_num=4`, the AICPU runs 4 threads: - -| Thread | Role | Cores | -|--------|------|-------| -| 0 | Scheduler | 6 AIC + ~13 AIV | -| 1 | Scheduler | 6 AIC + ~13 AIV | -| 2 | Scheduler | 6 AIC + ~13 AIV | -| 3 | Orchestrator | none | - -Core assignment: AICs and AIVs are divided equally among the 3 scheduler threads. - -### 8.2 Scheduler Main Loop - -Each scheduler thread runs a tight loop with two main phases: - -**Phase 1 — Completion Handling**: -- Poll register `COND` on each managed core -- When `TASK_FIN_STATE` detected: record completion timestamps, call `on_subtask_complete(task_id, subslot)` to set the done bit; when `subtask_done_mask == active_mask`, trigger `on_mixed_task_complete(task_id)` which marks `task_state[slot] = COMPLETED`, acquires fanout lock, traverses fanout list (incrementing consumers' `fanin_refcount`), marks `task_state[slot] = CONSUMED`, and advances `last_task_alive` watermark - -**Phase 2 — Dispatch**: -- For each idle core: pop a task from the matching shape-based ready queue (lock-free MPMC Vyukov queue, one per resource shape) -- Build `PTO2DispatchPayload` from `TaskDescriptor` with `task_id`, `subslot`, `kernel_id`, and `core_type` -- Write task pointer to `Handshake.task`, signal AICore via register `DATA_MAIN_BASE` - -After these phases, the scheduler updates profiling headers and checks for termination (all tasks completed and orchestrator done). - -### 8.3 Ready Queue Design - -Ready queues use a lock-free bounded MPMC (Vyukov) design: - -- One `PTO2ReadyQueue` per resource shape (5 shapes: `AIC_ONLY`, `AIV_X1`, `AIV_X2`, `AIC_AIV_X1`, `AIC_AIV_X2`) -- **Push**: any thread (orchestrator via `init_task`, or scheduler on completion) pushes newly-ready tasks to the queue matching `pto2_active_mask_to_shape(task->active_mask)` -- **Pop**: scheduler threads pop from the queue matching the idle core's resource shape -- Per-slot sequence counters prevent ABA problems -- `enqueue_pos` and `dequeue_pos` are on separate cache lines to avoid false sharing - -### 8.4 Watermark Advancement (last_task_alive) - -After a task reaches state CONSUMED (4), the scheduler tries to advance `last_task_alive`: - -``` -while la < current_task_index: - if task_state[la & mask] < CONSUMED: break - reset fanin_refcount[la & mask] = 0 - CAS(last_task_alive, la, la+1) - advance heap_tail from task's packed_buffer_end - la++ -``` - -This is lock-free (CAS-based) and multiple scheduler threads can attempt it concurrently. The `heap_tail_gen` ticket counter serializes `heap_tail` writes to ensure tasks' buffer regions are freed in order. - ---- - -## 9. AICore Worker Interaction - -### 9.1 Handshake Protocol - -Each AICore worker has a `Handshake` struct in shared memory: - -| Field | Direction | Purpose | -|-------|-----------|---------| -| `task` | AICPU→AICore | Pointer to `PTO2DispatchPayload` | -| `control` | AICPU→AICore | 0=normal, 1=shutdown | -| `perf_records_addr` | AICPU→AICore | Performance buffer address | - -### 9.2 Register-Based Dispatch - -Instead of polling `Handshake.task_status`, the production protocol uses hardware registers. - -> **Multi-ring note**: `task_id` is 64-bit but registers are 32-bit. A per-core monotonic dispatch counter (`s_dispatch_seq`) replaces `task_id` in register writes to prevent collisions. See [MULTI_RING.md §6](MULTI_RING.md). - -| Register | Direction | Usage | -|----------|-----------|-------| -| `DATA_MAIN_BASE` | AICPU→AICore | Write `task_id` to dispatch (idle=0x7FFFFFFD); `EXIT_SIGNAL` to shut down | -| `COND` | AICore→AICPU | `[bit31=state, bits30:0=task_id]`: ACK (state=0) or FIN (state=1) | - -**AICore execution loop**: -1. Poll `DATA_MAIN_BASE` for value != AICPU_IDLE_TASK_ID -2. Read payload from `Handshake.task` -3. Write ACK to `COND` -4. Execute kernel function via `func_id_to_addr` lookup -5. Write FIN to `COND` - -### 9.3 PTO2DispatchPayload - -Built by the scheduler from `PTO2TaskDescriptor`: - -| Field | Description | -|-------|-------------| -| `task_id` | Mixed-task identifier (for completion aggregation) | -| `subslot` | Which subtask slot this dispatch represents (`AIC`, `AIV0`, or `AIV1`) | -| `kernel_id` | Function ID for this subtask slot | -| `core_type` | AIC or AIV | -| `function_bin_addr` | GM address of compiled kernel binary | -| `num_args` | Number of arguments | -| `args[]` | Tensor addresses and scalar values | - ---- - -## 10. Kernel and Orchestration Loading - -### 10.1 Kernel Binary Loading - -1. **Host** compiles each kernel source (`.cpp`) into a binary (`.o` or `.so`) -2. `host_api.upload_kernel_binary(func_id, binary, size)` uploads to GM -3. The returned GM address is stored in `Runtime.func_id_to_addr_[func_id]` -4. When dispatching, the scheduler copies this address into `PTO2DispatchPayload.function_bin_addr` - -### 10.2 Orchestration SO Loading - -1. **Host** compiles the orchestration source into a shared library (`.so`) -2. The SO binary is embedded into `Runtime.device_orch_so_storage_[]` and copied to device -3. **AICPU Thread 3** writes the SO to a temp file, calls `dlopen` -4. `dlsym("aicpu_orchestration_config")` returns configuration (expected arg count) -5. `dlsym("aicpu_orchestration_entry")` returns the orchestration function pointer -6. Thread 3 creates a `PTO2Runtime`, calls the orchestration function within a `PTO2_SCOPE` -7. After orchestration completes: `dlclose`, delete temp file - -### 10.3 Thread Startup Synchronization - -| Flag | Set by | Waited by | Purpose | -|------|--------|-----------|---------| -| `runtime_init_ready_` | Thread 3 | Threads 0-2 | Runtime and SM handle initialized | -| `pto2_init_done_` | First init thread | Others | One-time memset of arrays started (exchange guard) | -| `pto2_init_complete_` | Init thread | Thread 3 + others | One-time init of per-task arrays done | - -Startup sequence: -1. Thread 3: create SM handle + runtime → set `runtime_init_ready_` -2. Scheduler threads: wait for `runtime_init_ready_` → one thread wins `pto2_init_done_` exchange → memset per-task arrays → set `pto2_init_complete_`; other threads wait for `pto2_init_complete_` -3. Thread 3: wait for `pto2_init_complete_` → configure orchestrator-scheduler pointers -4. Scheduler threads: enter main loop -5. Thread 3: call orchestration function → set `orchestrator_done_` - ---- - -## 11. PTO2 Orchestration API - -The orchestration API is defined in `pto_orchestration_api.h`. Orchestration code depends only on this header. - -### 11.1 Core API - -| Function/Macro | Purpose | -|----------------|---------| -| `pto2_rt_submit_task(mixed_kernels, args)` | Submit a mixed task with `MixedKernels` struct | -| `pto2_rt_submit_aic_task(kernel_id, args)` | Convenience: submit AIC-only task | -| `pto2_rt_submit_aiv_task(kernel_id, args)` | Convenience: submit AIV-only task | -| `PTO2_SCOPE() { ... }` | RAII scope for buffer lifetime | -| `pto2_rt_orchestration_done()` | Signal orchestration complete | - -### 11.2 Parameter Construction - -| Function | Description | -|----------|-------------| -| `make_tensor_external(ptr, shapes, ndim, dtype)` | Wrap an existing device pointer as a tensor | -| `TensorCreateInfo(shapes, ndim, dtype)` | Describe a runtime-created output buffer | -| `Arg::add_input(tensor)` | INPUT parameter — read by the task | -| `Arg::add_output(create_info)` | OUTPUT parameter — runtime allocates and returns a Tensor | -| `Arg::add_inout(tensor)` | INOUT parameter — existing tensor read then written | -| `Arg::add_scalar(value)` | 64-bit scalar parameter | - -### 11.3 Resource Shapes - -Tasks are queued by resource shape, which is derived from the `active_mask` in the `MixedKernels` struct: - -| Shape | Active Mask | Description | -|-------|-------------|-------------| -| `AIC_ONLY` | AIC only | AIC cores (matrix multiplication) | -| `AIV_X1` | AIV0 or AIV1 only | Single AIV core (vector operations) | -| `AIV_X2` | AIV0 + AIV1 | Two AIV cores | -| `AIC_AIV_X1` | AIC + one AIV | AIC + single AIV core | -| `AIC_AIV_X2` | AIC + AIV0 + AIV1 | Full cluster (AIC + two AIV cores) | - -### 11.4 Orchestration Export Interface - -Each orchestration `.so` must export: - -```cpp -extern "C" PTO2OrchestrationConfig aicpu_orchestration_config(uint64_t* args, int arg_count); -extern "C" void aicpu_orchestration_entry(uint64_t* args, int arg_count, int orch_thread_num, int orch_thread_index); -``` - ---- - -## 12. Example: Batch Paged Attention - -### 12.1 Kernel Configuration (`kernel_config.py`) - -```python -KERNELS = [ - {"func_id": 0, "name": "QK", "source": "aic/aic_qk_matmul.cpp", "core_type": "aic"}, - {"func_id": 1, "name": "SF", "source": "aiv/aiv_softmax_prepare.cpp", "core_type": "aiv"}, - {"func_id": 2, "name": "PV", "source": "aic/aic_pv_matmul.cpp", "core_type": "aic"}, - {"func_id": 3, "name": "UP", "source": "aiv/aiv_online_update.cpp", "core_type": "aiv"}, - {"func_id": 5, "name": "AIV_HUB", "source": "aiv/aiv_hub.cpp", "core_type": "aiv"}, -] - -ORCHESTRATION = { - "source": "orchestration/paged_attention_orch.cpp", - "function_name": "aicpu_orchestration_entry", -} - -RUNTIME_CONFIG = { - "runtime": "tensormap_and_ringbuffer", - "aicpu_thread_num": 4, - "block_dim": 24, -} -``` - -### 12.2 Orchestration Structure - -```cpp -void aicpu_orchestration_entry(uint64_t* args, int arg_count, int orch_thread_num, int orch_thread_index) { - // Unpack args: query, key_cache, value_cache, block_table, context_lens, out, config - for (q_idx = 0; q_idx < q_loop; q_idx++) { - for (batch_start = 0; batch_start < batch; batch_start += IN_CORE_BATCH) { - PTO2_SCOPE() { - // Describe accumulator tensors (oi, li, mi) with TensorCreateInfo - // Submit AIV_HUB to initialize accumulators - for (bn = 0; bn < max_bn; bn++) { - // Allocate intermediate tensors (sij, pij, mij, lij, oi_new) - // Submit QK (CUBE) → SF (VECTOR) → PV (CUBE) → UP (VECTOR) - } - } - } - } -} -``` diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/SCALAR_DATA_ACCESS.md b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/SCALAR_DATA_ACCESS.md deleted file mode 100644 index 98a4893ea..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/SCALAR_DATA_ACCESS.md +++ /dev/null @@ -1,137 +0,0 @@ -# Scalar Data Access — get/set_tensor_data Design - -## 1. Overview - -During task graph construction, orchestration sometimes needs to read InCore kernel results (for control-flow decisions) or write initial values into tensors. `get_tensor_data` / `set_tensor_data` provide **blocking** cross-layer data access, allowing orchestration to safely read and write tensor data. - -**Core design principle**: Reuse the existing TensorMap dependency tracking mechanism — no new synchronization infrastructure. - -## 2. API - -```cpp -// Blocking read: returns value at the given indices (default: raw uint64_t bits) -// Specify T for typed read: float val = get_tensor_data(tensor, 1, idx); -template -T get_tensor_data(const Tensor& tensor, uint32_t ndims, const uint32_t indices[]); - -// Blocking write: stores value at the given indices (type deduced from argument) -// Typed write: set_tensor_data(tensor, 1, idx, 42.0f); -template -void set_tensor_data(Tensor& tensor, uint32_t ndims, const uint32_t indices[], T value); -``` - -Both call into the runtime through the ops table — orchestration .so needs no runtime symbol linkage. - -## 3. Blocking Interface Design - -### 3.1 get_tensor_data Flow - -```text -addr null-check → TensorMap lookup → spin-wait producer COMPLETED → compute flat offset → memcpy read -``` - -- **addr null-check**: `buffer.addr == 0` means unallocated — log error, return 0 -- **TensorMap lookup**: find producer task by `buffer.addr` -- **spin-wait**: wait until producer `task_state >= PTO2_TASK_COMPLETED` -- **No producer** (`lookup_result.count == 0`): skip waiting, read immediately - -### 3.2 set_tensor_data Flow - -```text -addr null-check → TensorMap lookup → spin-wait producer COMPLETED → spin-wait consumers done → memcpy write -``` - -One extra step versus get_tensor_data: wait for all consumers to finish (`fanout_refcount >= fanout_count - 1`, excluding the scope reference). - -### 3.3 Timeout - -- Uses cycle counter (`get_sys_cnt_aicpu()`), checked every 1024 spins -- Threshold: `PTO2_TENSOR_DATA_TIMEOUT_CYCLES` (~10 s at 1.5 GHz) -- On timeout: sets `orch.fatal = true`, preventing further task submission - -## 4. add_output with Initial Value - -```cpp -TensorCreateInfo ci(shapes, ndims, dtype); -ci.set_initial_value(initial_value); -args.add_output(ci); -``` - -**Mechanism**: - -1. `ci.set_initial_value(value)` marks the create-info with an initial value before submission -2. `add_output(ci)` stores a pointer to `ci` in `Arg` (the original must remain valid until submit) -3. During payload init, the output tensor is materialized via `init_from_create_info()` which triggers the fill -4. Fill strategy: - - Small buffer (< 64 B): element-by-element memcpy directly into dst - - Large buffer (≥ 64 B): fill the first 64 bytes as a template block, then bulk-memcpy in 64 B chunks; partial tail copy for remainder - -**Constraint**: existing tensors are write targets only through `add_inout()`. - -## 5. Scalar Dependencies via 1-Element Tensors - -Traditional scalars (`Arg::add_scalar`) are one-way inputs with no TensorMap tracking. For cross-task scalar values, use a 1-element tensor as the carrier: - -```cpp -uint32_t shapes[1] = {1}; -TensorCreateInfo scalar_ci(shapes, 1, DataType::FLOAT32); - -// Submit with initial value and keep the returned tensor -scalar_ci.set_initial_value(float_to_u64(77.0f)); -Arg args; -args.add_output(scalar_ci); -TaskOutputTensors outs = pto2_rt_submit_aiv_task(FUNC_NOOP, args); -const Tensor& scalar_tensor = outs.get_ref(0); - -// Orchestration-side blocking read (waits for kernel completion) -uint32_t idx[1] = {0}; -float val = get_tensor_data(scalar_tensor, 1, idx); -``` - -**Advantage**: Fully reuses existing TensorMap (producer tracking, fanin/fanout dependencies) — no new infrastructure needed. - -## 6. Data Hazard Analysis - -Three actors: - -- **Kernel**: InCore task submitted via add_input/add_output/add_inout (asynchronous execution) -- **Orch Read**: orchestration calls `get_tensor_data` (blocking read) -- **Orch Write**: orchestration calls `set_tensor_data` (blocking write) - -### Hazard Matrix (earlier operation → later operation) - -| # | Earlier Op | Later Op | Hazard | Guarantee | Safe? | -| - | ---------- | -------- | ------ | --------- | ----- | -| 1 | Kernel write (OUTPUT) | Orch Read | RAW | spin-wait producer COMPLETED | Yes | -| 2 | Kernel write (OUTPUT) | Orch Write | WAW | spin-wait producer COMPLETED | Yes | -| 3 | Kernel read (INPUT) | Orch Write | WAR | spin-wait fanout_refcount | **Needs INOUT** | -| 4 | Kernel read-write (INOUT) | Orch Read | RAW | spin-wait producer COMPLETED | Yes | -| 5 | Kernel read-write (INOUT) | Orch Write | WAW+WAR | spin-wait producer + consumers | Yes | -| 6 | Orch Write | Kernel read (INPUT) | RAW | blocking completes before next submit | Yes | -| 7 | Orch Write | Kernel write (OUTPUT) | WAW | same — serial guarantee | Yes | -| 8 | Orch Read | Kernel write (OUTPUT) | WAR | same — serial guarantee | Yes | -| 9–12 | Orch ↔ Orch | — | — | same-thread serial execution | Yes | - -### Key Design Points - -**Scenario #3 is the only case requiring special attention**: - -TensorMap tracks only producers (OUTPUT/INOUT), not pure INPUT consumers. If a tensor is only registered via `add_input()`, TensorMap has no producer entry for it. `set_tensor_data`'s `wait_for_tensor_ready()` sees `lookup_result.count == 0` and returns immediately — but the kernel may still be reading → **WAR data race**. - -**Solution**: For tensors that may later be written via `set_tensor_data`, use `add_inout()` instead of `add_input()`. INOUT registers a producer entry in TensorMap, enabling `set_tensor_data` to track all consumers through `fanout_refcount`. - -**Scenarios #6–8 serial guarantee**: - -get/set_tensor_data are blocking calls, and orchestration is single-threaded serial submission. After a blocking operation completes, subsequent code (including task submissions) executes strictly afterward. - -## 7. External Tensor Behavior - -`make_tensor_external()` creates tensors with a pre-set `buffer.addr` (pointing to host-allocated device memory). - -| Scenario | Behavior | -| -------- | -------- | -| External tensor never submitted as OUTPUT/INOUT | No TensorMap entry — get/set execute immediately | -| External tensor previously submitted as OUTPUT/INOUT | TensorMap has producer entry — get/set spin-wait | -| External tensor submitted as INPUT, then set_tensor_data | **WAR risk** — must use INOUT instead (same as scenario #3) | - -**Key rule**: If an external tensor will later be written via `set_tensor_data`, all prior kernel accesses must use `add_inout()`, not `add_input()`. diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/SUBMIT_BY_CLUSTER.md b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/SUBMIT_BY_CLUSTER.md deleted file mode 100644 index 54e0f4196..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/SUBMIT_BY_CLUSTER.md +++ /dev/null @@ -1,226 +0,0 @@ -# Submit by Cluster - Requirements and Main-Branch-Aligned Design - -## 1. Goal - -Define a single, main-branch-aligned specification for PTO2 cluster submission that combines: - -1. Product requirements (what must be true). -2. Runtime design (how it is implemented on current main baseline). - -The target model is: one submitted graph node is one `MixedTask`, and dispatch/completion is mixed-task-granular. - -## 2. Background and Motivation - -Future Ascend hardware is expected to provide stronger locality within an AICore cluster (`1 AIC + 2 AIV`). -The runtime therefore needs a "submit together, run together" model for related AIC/AIV kernels. - -Legacy per-task submit (`kernel_id + worker_type`) cannot express atomic co-dispatch of multiple kernels to one cluster. - -## 3. Scope - -### In Scope - -1. New orchestration-facing submit API for cluster-aware mixed submission. -2. Runtime/backend scheduler and executor changes to treat a mixed submit as one atomic scheduling unit. -3. Dependency gating, readiness, dispatch, completion, and reclamation at mixed-task granularity. -4. AIV slot equivalence (`AIV0` and `AIV1` are equivalent execution targets). - -### Out of Scope - -1. User-facing cluster pinning (`allocate_cluster/free_cluster`-style APIs). -2. New worker types beyond AIC/AIV. -3. Cross-cluster user placement policies. -4. Hardware topology changes beyond `1 AIC + 2 AIV` per cluster. - -## 4. Main-Branch Baseline Constraints - -Design must preserve the current main runtime architecture: - -1. Multi-orchestrator runtime wiring (`orchestrators[]`, `orch_count`, thread-local `pto2_current_orch_idx`). -2. Executor threading split (orchestrator threads vs scheduler threads), and post-orchestrator transition (`transition_requested_` + `reassign_cores_for_all_threads()`). -3. Shared-memory hot/cold split (`PTO2TaskDescriptor` hot + `PTO2TaskPayload` cold). - -## 5. Terminology - -1. `cluster`: one physical unit with `1 AIC + 2 AIV`. -2. `MixedKernels`: 3 submit slots (`AIC`, `AIV0`, `AIV1`) with `INVALID_KERNEL_ID` for inactive slots. -3. `MixedTask`: one runtime graph node created by one submit call. -4. `active_mask`: bitmask of active subtask slots. -5. `resource shape`: normalized lane demand class of a mixed task. - -## 6. API Contract - -```cpp -inline constexpr int32_t INVALID_KERNEL_ID = -1; - -struct MixedKernels { - int32_t aic_kernel_id{INVALID_KERNEL_ID}; - int32_t aiv0_kernel_id{INVALID_KERNEL_ID}; - int32_t aiv1_kernel_id{INVALID_KERNEL_ID}; -}; - -static inline void pto2_rt_submit_task(PTO2Runtime* rt, - const MixedKernels& mixed_kernels, - Arg* args, - int32_t num_args); - -static inline void pto2_rt_submit_aic_task(PTO2Runtime* rt, - int32_t kernel_id, - Arg* args, - int32_t num_args); - -static inline void pto2_rt_submit_aiv_task(PTO2Runtime* rt, - int32_t kernel_id, - Arg* args, - int32_t num_args); -``` - -Rules: - -1. One submit call creates one `MixedTask`. -2. All active slots share the same `args` and `num_args`. -3. At least one slot must be active. -4. `aiv0_kernel_id` and `aiv1_kernel_id` are semantically equivalent. -5. Wrappers are orchestration sugar only (inline in orchestration API); no dedicated runtime ops entries. -6. Submit-contract types are defined once in a shared header-only submit-types surface consumed by orchestration and runtime headers. -7. Invalid submits follow existing PTO2 behavior (`always_assert`), not a new recoverable return-code API. - -## 7. Data Model (Requirements + Design) - -`PTO2TaskDescriptor` (hot path) carries mixed-task identity/state: - -1. `task_id` -2. `active_mask` -3. `subtask_done_mask` -4. `kernel_id[3]` for `(AIC, AIV0, AIV1)` -5. dependency heads/counters and packed-buffer metadata - -`PTO2TaskPayload` (cold path) carries: - -1. shared args/tensors/scalars copied once per mixed submit -2. fanin mixed-task IDs -3. other cold-path submit metadata - -Producer identity in TensorMap is mixed-task ID end-to-end. - -## 8. Scheduling Model - -### 8.1 Resource Shapes - -Runtime uses shape-based ready queues (not worker-type queues): - -1. `AIC_ONLY` -2. `AIV_X1` -3. `AIV_X2` -4. `AIC_AIV_X1` -5. `AIC_AIV_X2` - -Queueing key is normalized resource shape (not raw slot label). - -### 8.2 Atomic Cluster Dispatch - -1. Dispatch decision unit is one mixed task. -2. For multi-slot mixed tasks, partial launch is forbidden. -3. A mixed task is dispatchable only when one local owned cluster can satisfy all required lanes. -4. Compatible mixed tasks may co-reside over time if they use disjoint free lanes. - -### 8.3 Dependency and Completion - -1. Fanin release/readiness remains dependency-correct and graph-level. -2. Two-stage completion: - - `on_subtask_complete(task_id, subslot)` - - `on_mixed_task_complete(task_id)` only when `subtask_done_mask == active_mask` -3. Downstream release is triggered once per mixed task completion, not once per subslot. - -## 9. Executor Ownership and Numbering - -### 9.1 Canonical Flattened Numbering (Unchanged) - -Given `block_dim` clusters: - -1. AIC IDs: `[0, block_dim)` -2. AIV IDs: `[block_dim, 3 * block_dim)` -3. Cluster `i`: `{i, block_dim + i, 2 * block_dim + i}` - -This project-defined flattened numbering is kept unchanged. - -### 9.2 Cluster Ownership - -1. One cluster must be owned by one scheduler domain/thread at a time. -2. No split-cluster ownership in either: - - initial `assign_cores_to_threads()` - - post-orchestrator `reassign_cores_for_all_threads()` -3. Lane occupancy bookkeeping must remain consistent with ownership after reassignment. - -## 10. Functional Requirements - -### 10.1 Valid Mixed Shapes - -1. AIC only -2. AIV only (1 or 2 AIV lanes) -3. AIC + 1 AIV -4. AIC + 2 AIV - -### 10.2 Runtime Behavior per Submit - -1. Validate submit arguments. -2. Allocate mixed-task ID and initialize descriptor/payload once. -3. Build fanin/fanout at mixed-task granularity. -4. Enqueue by shape when ready. -5. Dispatch all active lanes atomically when resources allow. -6. Aggregate completion and release downstream once. - -## 11. Non-Functional Requirements - -1. Correctness: no dependency violation, no partial mixed-task dispatch. -2. Determinism: dependency-correct ordering preserved; AIV lane choice may vary but remains semantically equivalent. -3. Fairness: resource-aware polling heuristic is allowed; strict starvation-free guarantee across all shapes is not required. -4. Performance: no obvious regression for non-cluster workflows. -5. Observability: lifecycle visibility for submit/ready/dispatch/block/complete. - -## 12. Acceptance Criteria - -Feature is accepted when: - -1. Orchestration compiles and submits via `MixedKernels` API/wrappers. -2. Scheduler dispatches each mixed task as one cluster scheduling decision. -3. Dependencies gate mixed-task readiness correctly. -4. AIV execution remains cluster-local and semantically equivalent across lanes. -5. Existing non-cluster workflows continue to pass without behavior regression. -6. Cluster ownership is never split across scheduler domains before/after transition. - -## 13. Verification Matrix - -Recommended validation coverage: - -1. Mapping correctness for cluster-to-core ID relation. -2. Atomic dispatch for multi-slot shapes. -3. Dependency gating and completion aggregation (`done_mask == active_mask`). -4. Lane-occupancy co-residency behavior for compatible shapes. -5. Multi-orchestrator and core-transition ownership stability. -6. Invalid submit handling (`always_assert` path). -7. Regression coverage for existing examples/tests. - -Milestone command (device): - -```bash -python examples/scripts/run_example.py \ - -k tests/st/tensormap_and_ringbuffer/batch_paged_attention/kernels \ - -g tests/st/tensormap_and_ringbuffer/batch_paged_attention/golden.py \ - -p a2a3 -d 9 -``` - -Final validation: - -```bash -./ci.sh -``` - -## 14. Resolved Decisions - -1. Legacy orchestration-facing single-task submit is replaced by mixed submit contract. -2. Invalid mixed submits fail with existing submit-time assert behavior. -3. Per-cluster concurrent capacity is lane-occupancy-driven, not a fixed constant. -4. Submit-contract types live in one shared header-only surface. -5. Resource-aware dispatch heuristics are allowed without a strict starvation-free guarantee. - diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/device_log_profiling.md b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/device_log_profiling.md deleted file mode 100644 index 010e6c682..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/device_log_profiling.md +++ /dev/null @@ -1,167 +0,0 @@ -# PTO2 Device Log Profiling Guide - -## How to Find Device Logs - -AICPU logs (via `DEV_ALWAYS`) are written by CANN's **dlog** subsystem and do **not** appear in the `run_example.py` terminal output. They are written to CANN's device log directory: - -```text -$HOME/ascend/log/debug/device-/device-_.log -``` - -Each run produces a new log file (or appends to an existing one). Find the most recent file by modification time: - -```bash -ls -lt $HOME/ascend/log/debug/device-/ | head -5 -``` - -## Log Structure Overview - -A single run produces two profiling blocks in the device log: - -| Block | Emitted by | Function | Content | -| ----- | ---------- | -------- | ------- | -| **Orchestrator Profiling** | Thread 3 (orchestrator) | `aicpu_orchestration_entry` | Time breakdown of graph construction on device | -| **PTO2 Scheduler Summary** | Threads 0/1/2 (schedulers) | `resolve_and_dispatch_pto2` | Per-thread scheduling statistics, phase timing, and lock contention | - -All timing values are in microseconds (us), converted from AICPU cycle counters. - ---- - -## Block 1: Orchestrator Profiling - -Thread 3 loads the orchestration `.so` via `dlopen`, calls `aicpu_orchestration_entry`, and prints a profiling summary after it returns. - -### Example (from a real run: batch=64, 16704 tasks) - -```text -Thread 3: Calling aicpu_orchestration_entry from SO -aicpu_orchestration_entry ">>>>>> batch = 64" -Thread 3: aicpu_orchestration_entry returned, cost 20943.940us -Thread 3: === Orchestrator Profiling: 16704 tasks, total=14601.580us === -Thread 3: sync_tensormap : 286.300us (2.0%) -Thread 3: task_ring_alloc: 380.400us (2.6%) -Thread 3: param_copy : 2147.800us (14.7%) -Thread 3: lookup+dep : 7290.300us (49.9%) -Thread 3: heap_alloc : 701.500us (4.8%) -Thread 3: tensormap_ins : 1890.380us (12.9%) -Thread 3: fanin+ready : 1207.400us (8.3%) -Thread 3: finalize+SM : 697.500us (4.8%) -Thread 3: scope_end : 364.080us -Thread 3: avg/task : 0.874us -Thread 3: PTO2 total submitted tasks = 16704 -``` - -### Field Reference - -| Field | Source (`pto_orchestrator.cpp`) | Description | -| ----- | ------------------------------- | ----------- | -| **cost** | Wall-clock around `orch_func()` call | Total time including orchestration logic + scope overhead | -| **total** | Sum of all sub-steps below | Accumulated time inside `pto2_submit_task` across all tasks | -| **sync_tensormap** | `g_orch_sync_cycle` | TensorMap validity sync and optional cleanup before each submission | -| **task_ring_alloc** | `g_orch_alloc_cycle` | Allocating a task slot from the task ring buffer | -| **param_copy** | `g_orch_args_cycle` | Copying param descriptors + tensor descriptor copies into task-owned storage | -| **lookup+dep** | `g_orch_lookup_cycle` | TensorMap lookup for inputs/inouts + building fanin/fanout dependency edges | -| **heap_alloc** | `g_orch_heap_cycle` | Allocating packed output buffers from the heap ring | -| **tensormap_ins** | `g_orch_insert_cycle` | Inserting output/inout tensors into the TensorMap | -| **fanin+ready** | `g_orch_fanin_cycle` | Building the fanin list + checking if task is already ready (Step 5/5b) | -| **scope_end** | `g_orch_scope_end_cycle` | `pto2_scope_end` overhead (notifying scheduler of scope completion) | -| **avg/task** | `total / submit_count` | Average orchestrator time per task submission | - -### Interpreting the Numbers - -- **cost > total**: The difference is overhead outside `pto2_submit_task` (the orchestration user code itself, scope_begin/end, TensorCreateInfo construction, etc.). -- **lookup+dep** is typically the dominant cost (~50%) because it involves TensorMap hash lookups and building dependency edges with spinlock-protected fanout list insertions. -- **param_copy** scales with the number of parameters per task. -- **avg/task < 1us** indicates efficient graph construction. - ---- - -## Block 2: PTO2 Scheduler Summary - -Each of the 3 scheduler threads (Thread 0, 1, 2) prints its own summary after completing all tasks. The output has two sub-sections: **summary** and **phase breakdown**. - -### Example (Thread 0, from a different run: batch=1, 1044 tasks) - -```text -Thread 0: completed=352 tasks in 3477.420us (147 loops, 2.4 tasks/loop) -Thread 0: --- Phase Breakdown --- -Thread 0: complete: 1485.020us (42.7%) [fanout: edges=432, max_degree=2, avg=1.2] [fanin: edges=320, max_degree=3, avg=0.9] -Thread 0: scan: 14.400us (0.4%) -Thread 0: dispatch: 1973.060us (56.7%) [pop: hit=352, miss=3043, hit_rate=10.4%] -Thread 0: idle: 4.940us (0.1%) -``` - -### Summary Line - -```text -Thread N: completed=X tasks in Yus (Z loops, W tasks/loop) -``` - -| Field | Description | -| ----- | ----------- | -| **completed** | Number of tasks this thread processed to completion | -| **Y us** | Total scheduler loop time (sum of all phase cycles) | -| **Z loops** | Number of scheduler loop iterations | -| **W tasks/loop** | Average tasks completed per loop iteration; higher = better throughput | - -### Phase Breakdown - -The scheduler loop runs four phases each iteration. Each phase's time is accumulated across all loop iterations. - -| Phase | What it does | Inline stats | -| ----- | ------------ | ------------ | -| **complete** | Polls handshake on each managed core; when a core completes, calls `on_subtask_complete(task_id, subslot)` to set the done bit; when `subtask_done_mask == active_mask`, triggers `on_mixed_task_complete` which traverses fanout list (notify consumers) and fanin list (release producers) | `fanout`: edges/max_degree/avg for consumer notification; `fanin`: edges/max_degree/avg for producer release | -| **scan** | Updates the perf profiling header with latest scheduler state | — | -| **dispatch** | For each idle core, pops a task from the shape-based ready queue via `get_ready_task(shape)`, builds the dispatch payload, and writes the task to the core's handshake register | `pop`: `hit` = successful pops (task dispatched), `miss` = empty queue pops, `hit_rate` = hit/(hit+miss) | -| **idle** | Scheduler loop iteration where no progress was made (no completions, no dispatches) | — | - -**Interpreting phase percentages:** - -- **dispatch** is typically the largest (~55-60%) because it includes ready-queue pops (with spinlock), payload construction, and cache flush (`dc cvac` + `dsb sy`). -- **complete** is the second largest (~40-45%) because it traverses both fanout (CAS-based fanin decrement, conditional ready-queue push) and fanin (release_producer, check_consumed, ring pointer advancement). -- **scan** is small (<1%) — only updates the perf header. -- **idle** is negligible when tasks are flowing; high idle% indicates the scheduler is starved. - -**Interpreting pop hit_rate:** - -- **High hit_rate (>50%)**: Ready queue is well-supplied; dispatch is efficient. -- **Low hit_rate (<10%)**: Ready queue is mostly empty when cores become idle. The bottleneck is upstream (orchestrator submission speed or fanout resolution latency), not dispatch itself. - -### Per-Task Averages - -Divide each thread's phase times by its `completed` count to get per-task scheduling cost: - -| Metric | Formula | Typical value | -| ------ | ------- | ------------- | -| Scheduling overhead per task | total_time / completed | ~5-10 us/task | -| Dispatch per task | dispatch_time / completed | ~3-6 us/task | -| Complete per task | complete_time / completed | ~2-4 us/task | - ---- - -## Cross-Referencing with Host Profiling - -When `--enable-profiling` is used, the host terminal prints a **Task Statistics by Function** table with `Total_Exec` (total AICore kernel execution time). Combined with device log data: - -| Metric | Source | Description | -| ------ | ------ | ----------- | -| Avg kernel exec time | `Total_Exec / total_tasks` (host) | Time AICore spends executing each kernel | -| Avg scheduling overhead | `sum(thread_total) / total_tasks` (device log) | Time AICPU spends scheduling each task | -| Sched/Exec ratio | scheduling / execution | Scheduling overhead relative to kernel execution | - -A high sched/exec ratio (e.g., >3x) indicates that scheduling overhead dominates, and optimizations should target the scheduler's dispatch hot path (cache flush, payload construction) or upstream task flow. - ---- - -## Quick Reference: Extracting Profiling Data - -```bash -# Find the latest device log for device 2 -ls -t $HOME/ascend/log/debug/device-2/device-*.log | head -1 - -# Extract orchestrator profiling (Thread 3) -grep "Thread 3:" - -# Extract scheduler profiling (Threads 0/1/2) -grep -E "Thread [012]:" -``` diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/profiling_levels.md b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/profiling_levels.md deleted file mode 100644 index 0578b5327..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/docs/profiling_levels.md +++ /dev/null @@ -1,355 +0,0 @@ -# PTO Runtime2 Profiling Levels - -This document describes the profiling macro hierarchy and logging control in the PTO Runtime2 system. - -## Overview - -PTO Runtime2 uses a hierarchical profiling system with compile-time macros to control profiling code compilation and log output. The `enable_profiling` runtime flag controls data collection (performance buffers, shared memory writes) but does NOT control log output. - -## Profiling Macro Hierarchy - -``` -PTO2_PROFILING (base level, default=1) -├── PTO2_ORCH_PROFILING (orchestrator, default=0, requires PTO2_PROFILING=1) -| └──PTO2_TENSORMAP_PROFILING (tensormap, default=0, requires PTO2_ORCH_PROFILING=1) -├── PTO2_SCHED_PROFILING (scheduler, default=0, requires PTO2_PROFILING=1) -└── --enable-profiling (Dump profiling merged swimlane json file for visualization, requires PTO2_PROFILING=1) - -``` - -### Compile-Time Validation - -Each sub-level macro requires `PTO2_PROFILING=1`: - -```cpp -#if PTO2_ORCH_PROFILING && !PTO2_PROFILING -#error "PTO2_ORCH_PROFILING requires PTO2_PROFILING=1" -#endif - -#if PTO2_SCHED_PROFILING && !PTO2_PROFILING -#error "PTO2_SCHED_PROFILING requires PTO2_PROFILING=1" -#endif - -#if PTO2_TENSORMAP_PROFILING && !PTO2_ORCH_PROFILING -#error "PTO2_TENSORMAP_PROFILING requires PTO2_ORCH_PROFILING=1" -#endif -``` - -## Profiling Levels - -### Level 0: No Profiling (PTO2_PROFILING=0) - -**What's compiled:** -- Debug/diagnostic logs (always present) -- Progress tracking (`PTO2 progress: completed=...`) -- Stall detection and dump (triggered only after `MAX_IDLE_ITERATIONS` idle loops) -- Deadlock/livelock detection (`diagnose_stuck_state`, called on stall) - -**What's NOT compiled:** -- All `CYCLE_COUNT_*` timing counters (`sched_*_cycle`, orchestrator cost counters) -- Scheduler/Orchestrator profiling summary logs guarded by `#if PTO2_PROFILING` -- Performance data collection paths (`enable_profiling` runtime flag becomes ineffective because profiling code is not compiled) - -**Log output (normal run, no stall):** -- No `sched_start/sched_end/sched_cost` timestamps -- No `orch_start/orch_end/orch_cost` timestamps -- No `Scheduler summary: total_time=...` -- No `PTO2 total submitted tasks` log -- `PTO2 progress: completed=... total=...` may appear (thread 0 only, at task completion milestones) - - ---- - -### Level 1: Basic Profiling (PTO2_PROFILING=1) - -**What's compiled:** -- Base timing counters for scheduler loop (`sched_complete/dispatch/idle/scan`) -- Per-thread orchestration timing (`orch_start`, `orch_end`, `orch_cost`) -- Stage-level orchestration end timestamp (`orch_stage_end`, printed by last orch thread only, marks the moment all orch threads have finished and core transition is about to be requested; only when `orch_to_sched_` is true) -- PTO2 total submitted tasks count (printed by last orch thread, after orch timing line) -- Scheduler summary output (`total_time`, `loops`, `tasks_scheduled`) -- Scheduler lifetime timestamps and cost (`sched_start`, `sched_end`, `sched_cost` — captured inside `resolve_and_dispatch_pto2()`, printed before Scheduler summary) - -**What's NOT compiled:** -- Detailed phase breakdowns -- TensorMap statistics - -**Log output (additional lines vs Level 0, per normal run):** -- `Thread %d: orch_start=%llu orch_end=%llu orch_cost=%.3fus` — each orch thread, after orchestration fully complete -- `PTO2 total submitted tasks = %d, already executed %d tasks` — last orch thread only (×1), after orch timing line -- `Thread %d: orch_stage_end=%llu` — last orch thread only (×1), only when `orch_to_sched_=true` -- `Thread %d: sched_start=%llu sched_end=%llu sched_cost=%.3fus` — each sched thread, printed before Scheduler summary -- `Thread %d: Scheduler summary: total_time=%.3fus, loops=%llu, tasks_scheduled=%d` — each sched thread -- `Thread %d: sched_start=%llu sched_end(timeout)=%llu sched_cost=%.3fus` — timeout path only (replaces normal `sched_end`) - -**DEV_ALWAYS count (normal run):** -- `orch_to_sched_=false` (default): `N_sched*2 + N_orch*1 + 1` (orch_timing + PTO2_total + sched_timing + Scheduler_summary) -- `orch_to_sched_=true` (`PTO2_ORCH_TO_SCHED=1`): adds 1 (`orch_stage_end`) - -> See the table at the end for concrete counts based on the `paged_attention` example. - -**Example log output — `orch_to_sched_=false`** (from `paged_attention`, device 10): -``` -Thread 2: orch_start=48214752948321 orch_end=48214752959379 orch_cost=230.000us -Thread 3: orch_start=48214752948316 orch_end=48214752961505 orch_cost=275.000us -PTO2 total submitted tasks = 13, already executed 13 tasks -Thread 1: sched_start=48214752948235 sched_end=48214752962379 sched_cost=295.000us -Thread 1: Scheduler summary: total_time=159.560us, loops=3782, tasks_scheduled=6 -Thread 0: sched_start=48214752948200 sched_end=48214752963571 sched_cost=320.000us -Thread 0: Scheduler summary: total_time=183.180us, loops=4611, tasks_scheduled=7 -``` - -**Example log output — `orch_to_sched_=true`** (`PTO2_ORCH_TO_SCHED=1`, from `paged_attention`, device 11): -``` -Thread 3: orch_stage_end=48236915058307 -Thread 3: orch_start=48236915044001 orch_end=48236915058781 orch_cost=308.000us -Thread 2: orch_start=48236915044003 orch_end=48236915058782 orch_cost=308.000us -PTO2 total submitted tasks = 13, already executed 13 tasks -Thread 0: sched_start=48236915043911 sched_end=48236915059191 sched_cost=318.000us -Thread 0: Scheduler summary: total_time=187.920us, loops=4561, tasks_scheduled=4 -Thread 1: sched_start=48236915043947 sched_end=48236915061881 sched_cost=372.000us -Thread 1: Scheduler summary: total_time=168.620us, loops=3880, tasks_scheduled=9 -``` - -> With `orch_to_sched_=true`, orch threads transition to schedulers after orchestration. They print `orch_end` but do NOT print `Scheduler summary` or `sched_end` (they have no cores assigned at shutdown time). - -**Note:** -- All logs above are controlled by compile-time macro `PTO2_PROFILING`, not by `enable_profiling`. -- `enable_profiling` only controls shared-memory data collection / swimlane export. -- Enable `orch_to_sched_` via environment variable: `PTO2_ORCH_TO_SCHED=1`. - ---- - -### Level 2: Scheduler Detailed Profiling (PTO2_SCHED_PROFILING=1) - -**Requires:** `PTO2_PROFILING=1` - -**What's compiled:** -- All Level 1 features -- Detailed scheduler phase counters -- Phase-specific statistics (complete, scan, dispatch, idle) -- Hit rate tracking (complete poll, ready queue pop) - -**Log output:** 18 DEV_ALWAYS logs (11 debug + 2 basic + 7 scheduler detailed - 2 replaced) -- Replaces scheduler summary with detailed breakdown - -**Scheduler output:** -``` -Thread X: === Scheduler Phase Breakdown: total=XXXus, XXX tasks === -Thread X: complete : XXXus (XX.X%) [fanout: edges=XXX, max_degree=X, avg=X.X] [fanin: edges=XXX, max_degree=X, avg=X.X] -Thread X: poll : XXXus (XX.X%) hit=XXX, miss=XXX, hit_rate=XX.X% -Thread X: otc_lock : XXXus (XX.X%) work=XXXus wait=XXXus atomics=XXX -Thread X: otc_fanout : XXXus (XX.X%) work=XXXus wait=XXXus atomics=XXX -Thread X: otc_fanin : XXXus (XX.X%) atomics=XXX -Thread X: otc_self : XXXus (XX.X%) atomics=XXX -Thread X: perf : XXXus (XX.X%) -Thread X: dispatch : XXXus (XX.X%) [pop: hit=XXX, miss=XXX, hit_rate=XX.X%] -Thread X: poll : XXXus (XX.X%) -Thread X: pop : XXXus (XX.X%) work=XXXus wait=XXXus atomics=XXX -Thread X: setup : XXXus (XX.X%) -Thread X: scan : XXXus (XX.X%) -Thread X: idle : XXXus (XX.X%) -Thread X: avg/complete : XXXus -Thread X: Scheduler summary: total_time=XXXus, loops=XXX, tasks_scheduled=XXX -``` - ---- - -### Level 3: Orchestrator Detailed Profiling (PTO2_ORCH_PROFILING=1) - -**Requires:** `PTO2_PROFILING=1` - -**What's compiled:** -- All Level 1 features -- Detailed orchestrator phase counters -- Per-phase cycle tracking -- Atomic operation counters -- Wait time tracking - -**Log output:** 30 DEV_ALWAYS logs (11 debug + 2 basic + 1 scheduler summary + 17 orchestrator detailed - 1 replaced) -- Replaces basic orchestration completion with detailed breakdown - -**Orchestrator output:** -``` -Thread X: === Orchestrator Profiling: XXX tasks, total=XXXus === -Thread X: sync_tensormap : XXXus (XX.X%) -Thread X: task_ring_alloc: XXXus (XX.X%) work=XXXus wait=XXXus atomics=XXX -Thread X: param_copy : XXXus (XX.X%) atomics=XXX -Thread X: lookup+dep : XXXus (XX.X%) -Thread X: heap_alloc : XXXus (XX.X%) work=XXXus wait=XXXus atomics=XXX -Thread X: tensormap_ins : XXXus (XX.X%) -Thread X: fanin+ready : XXXus (XX.X%) work=XXXus wait=XXXus atomics=XXX -Thread X: finalize+SM : XXXus (XX.X%) work=XXXus wait=XXXus atomics=XXX -Thread X: scope_end : XXXus atomics=XXX -Thread X: avg/task : XXXus -``` - -**Note:** Orchestrator logs always print when `PTO2_ORCH_PROFILING=1`, regardless of `enable_profiling` flag. - ---- - -### Level 4: TensorMap Profiling (PTO2_TENSORMAP_PROFILING=1) - -**Requires:** `PTO2_PROFILING=1` AND `PTO2_ORCH_PROFILING=1` - -**What's compiled:** -- All Level 3 features -- TensorMap lookup statistics -- Hash chain walk tracking -- Overlap check counters - -**Log output:** 34 DEV_ALWAYS logs (30 from Level 3 + 4 tensormap) - -**TensorMap output:** -``` -Thread X: === TensorMap Lookup Stats === -Thread X: lookups : XXX, inserts: XXX -Thread X: chain walked : total=XXX, avg=X.X, max=X -Thread X: overlap checks : XXX, hits=XXX (XX.X%) -``` - ---- - -## Runtime Flag: enable_profiling - -The `runtime->enable_profiling` flag controls **data collection**, NOT log output. - -### When enable_profiling=true: -- Performance buffers are allocated and written -- Per-task timing data is collected -- Phase profiling data is recorded -- Orchestrator summary is written to shared memory - -### When enable_profiling=false: -- No performance data collection -- No shared memory writes -- Logs still print (controlled by macros only) - -### Usage: -```cpp -// Initialize runtime with profiling enabled -runtime->enable_profiling = true; -``` - ---- - -## Common Profiling Configurations - -### Development (minimal overhead) -```bash -# No profiling overhead -PTO2_PROFILING=0 -``` - -### Basic Performance Monitoring -```bash -# Minimal overhead, summary logs only -PTO2_PROFILING=1 -PTO2_ORCH_PROFILING=0 -PTO2_SCHED_PROFILING=0 -``` - -### Scheduler Performance Analysis -```bash -# Detailed scheduler breakdown -PTO2_PROFILING=1 -PTO2_ORCH_PROFILING=0 -PTO2_SCHED_PROFILING=1 -``` - -### Orchestrator Performance Analysis -```bash -# Detailed orchestrator breakdown -PTO2_PROFILING=1 -PTO2_ORCH_PROFILING=1 -PTO2_SCHED_PROFILING=0 -``` - -### Full Profiling (maximum overhead) -```bash -# All profiling features enabled -PTO2_PROFILING=1 -PTO2_ORCH_PROFILING=1 -PTO2_SCHED_PROFILING=1 -PTO2_TENSORMAP_PROFILING=1 -``` - ---- - -## Setting Profiling Macros - -### At compile time: -```bash -# In CMakeLists.txt or build command -add_definitions(-DPTO2_PROFILING=1) -add_definitions(-DPTO2_ORCH_PROFILING=1) -``` - -### In source code (before including headers): -```cpp -#define PTO2_PROFILING 1 -#define PTO2_ORCH_PROFILING 1 -#include "pto_runtime2_types.h" -``` - ---- - -## Log Output Summary - -> Example: `paged_attention` on Ascend hardware, 2 sched threads + 2 orch threads, normal run (no stall/timeout). - -| Level | Macro Settings | DEV_ALWAYS Count (`orch_to_sched_=false`) | DEV_ALWAYS Count (`orch_to_sched_=true`) | Description | -|-------|---------------|------------------------------------------|------------------------------------------|-------------| -| 0 | `PTO2_PROFILING=0` | 0 | 0 | No timing output | -| 1 | `PTO2_PROFILING=1` | 7 | 8 | Timing timestamps + scheduler summary | -| 2 | `+PTO2_SCHED_PROFILING=1` | — | — | Scheduler detailed phase breakdown | -| 3 | `+PTO2_ORCH_PROFILING=1` | — | — | Orchestrator detailed phase breakdown | -| 4 | `+PTO2_TENSORMAP_PROFILING=1` | — | — | TensorMap lookup stats | - ---- - -## Implementation Notes - -### Key Principles - -1. **Macros control compilation and logging** - - `#if PTO2_PROFILING` controls whether profiling code is compiled - - Logs print when macro is enabled, regardless of runtime flag - -2. **Runtime flag controls data collection** - - `enable_profiling` controls performance buffer allocation - - Controls shared memory writes for host-side export - - Does NOT control log output - -3. **Consistent behavior across components** - - Scheduler logs: macro-controlled only - - Orchestrator logs: macro-controlled only - - Data collection: runtime flag controlled - -### Code Locations - -- Macro definitions: `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h` -- Scheduler profiling: `src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp` (lines 770-835) -- Orchestrator profiling: `src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp` (lines 1035-1105) -- TensorMap profiling: `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_tensormap.h` - ---- - -## Performance Impact - -### Compilation overhead: -- Level 0: No overhead -- Level 1: Minimal (counter increments, basic arithmetic) -- Level 2-4: Low to moderate (additional counters, cycle measurements) - -### Runtime overhead: -- Logging: Negligible (device logs are asynchronous) -- Data collection (`enable_profiling=true`): Low to moderate - - Performance buffer writes - - Shared memory updates - - Per-task timing measurements - -### Recommendation: -- Use Level 0 for production -- Use Level 1-2 for performance monitoring -- Use Level 3-4 for detailed performance analysis only diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/host/runtime_compile_info.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/host/runtime_compile_info.cpp deleted file mode 100644 index 76c0e8a74..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/host/runtime_compile_info.cpp +++ /dev/null @@ -1,18 +0,0 @@ -#include "host/platform_compile_info.h" -#include "host/runtime_compile_info.h" -#include - -extern "C" { - -ToolchainType get_incore_compiler(void) { - if (strcmp(get_platform(), "a2a3") == 0) return TOOLCHAIN_CCEC; - return TOOLCHAIN_HOST_GXX_15; -} - -ToolchainType get_orchestration_compiler(void) { - // tensormap_and_ringbuffer: a2a3 needs aarch64 cross-compile (AICPU is aarch64) - if (strcmp(get_platform(), "a2a3") == 0) return TOOLCHAIN_AARCH64_GXX; - return TOOLCHAIN_HOST_GXX; -} - -} diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/host/runtime_maker.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/host/runtime_maker.cpp deleted file mode 100644 index e29bee245..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/host/runtime_maker.cpp +++ /dev/null @@ -1,381 +0,0 @@ -/* - * Copyright (c) PyPTO Contributors. - * This program is free software, you can redistribute it and/or modify it under the terms and conditions of - * CANN Open Software License Agreement Version 2.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - * ----------------------------------------------------------------------------------------------------------- - */ -/** - * Runtime Builder - rt2 Implementation (Device Orchestration) - * - * Provides init_runtime_impl and validate_runtime_impl functions for rt2 runtime. - * Supports device orchestration where AICPU thread 3 runs the orchestrator. - * - * init_runtime_impl: - * - Converts host tensor pointers to device pointers (all tensors copied both directions) - * - Copies orchestration SO to device memory - * - Sets up runtime state for device orchestration - * - * validate_runtime_impl: - * - Copies recorded tensors back from device to host - * - Frees device memory - */ - -#include -#include -#include - -#include -#include -#include -#include -#include -#include - -#include "../runtime/pto_shared_memory.h" -#include "../runtime/runtime.h" -#include "callable.h" -#include "common/platform_config.h" -#include "common/unified_log.h" - -// Helper: return current time in milliseconds -static int64_t _now_ms() { - struct timeval tv; - gettimeofday(&tv, nullptr); - return static_cast(tv.tv_sec) * 1000 + tv.tv_usec / 1000; -} - -/** - * Parse an environment variable as uint64_t with optional power-of-2 constraint. - * Returns the parsed value on success, or 0 if unset or validation fails. - */ -static uint64_t parse_env_uint64(const char* name, uint64_t min_val, bool require_power_of_2) { - const char* env = std::getenv(name); - if (!env) return 0; - char* endptr; - errno = 0; - uint64_t val = strtoull(env, &endptr, 10); - if (errno == ERANGE || endptr == env || *endptr != '\0' || val < min_val) { - LOG_WARN("%s=%s invalid (must be a valid integer >= %" PRIu64 "), ignored", name, env, min_val); - return 0; - } - if (require_power_of_2 && (val & (val - 1)) != 0) { - LOG_WARN("%s=%s invalid (must be a power of 2, >= %" PRIu64 "), ignored", name, env, min_val); - return 0; - } - return static_cast(val); -} - -/** - * Initialize a pre-allocated runtime for device orchestration. - * - * For rt2 runtime, orchestration runs on AICPU thread 3 (device-side). - * This function: - * - Copies tensor metadata and replaces host pointers with device pointers - * - Copies all tensor data to device - * - Records all tensors for copy-back - * - Copies orchestration SO to device memory - * - Sets up runtime state for device orchestration - * - * @param runtime Pointer to pre-constructed Runtime - * @param callable ChipCallable containing orch binary, func_name, and child kernels - * @param orch_args Separated tensor/scalar arguments - * @return 0 on success, -1 on failure - */ -extern "C" int init_runtime_impl(Runtime* runtime, const ChipCallable* callable, const ChipStorageTaskArgs* orch_args) { - // Validate inputs - if (runtime == nullptr) { - LOG_ERROR("Runtime pointer is null"); - return -1; - } - - // Register kernel binaries from ChipCallable children - if (callable->child_count() > 0) { - LOG_INFO("Registering %d kernel(s) in init_runtime_impl", callable->child_count()); - for (int32_t i = 0; i < callable->child_count(); i++) { - int func_id = callable->child_func_id(i); - const auto& kernel = callable->child(i); - uint64_t addr = runtime->host_api.upload_kernel_binary(func_id, - reinterpret_cast(&kernel), - CoreCallable::binary_data_offset() + kernel.binary_size()); - if (addr == 0) { - LOG_ERROR("Failed to upload kernel binary for func_id=%d", func_id); - return -1; - } - runtime->set_function_bin_addr(func_id, addr); - } - } - - const uint8_t* orch_so_binary = static_cast(callable->binary_data()); - size_t orch_so_size = callable->binary_size(); - - if (orch_so_binary == nullptr || orch_so_size == 0) { - LOG_ERROR("Orchestration SO binary is required for device orchestration"); - return -1; - } - - if (orch_args == nullptr) { - LOG_ERROR("orch_args pointer is null"); - return -1; - } - - int tensor_count = orch_args->tensor_count(); - int scalar_count = orch_args->scalar_count(); - LOG_INFO("RT2 init: %d tensors + %d scalars, device orchestration mode", tensor_count, scalar_count); - - int64_t t_total_start = _now_ms(); - - // Build device args: copy from input, replace host tensor pointers with device pointers - ChipStorageTaskArgs device_args; - - int64_t t_args_start = _now_ms(); - for (int i = 0; i < tensor_count; i++) { - ContinuousTensor t = orch_args->tensor(i); - - void* host_ptr = reinterpret_cast(static_cast(t.data)); - size_t size = static_cast(t.nbytes()); - - void* dev_ptr = runtime->host_api.device_malloc(size); - if (dev_ptr == nullptr) { - LOG_ERROR("Failed to allocate device memory for tensor %d", i); - return -1; - } - - int rc = runtime->host_api.copy_to_device(dev_ptr, host_ptr, size); - if (rc != 0) { - LOG_ERROR("Failed to copy tensor %d to device", i); - runtime->host_api.device_free(dev_ptr); - return -1; - } - runtime->record_tensor_pair(host_ptr, dev_ptr, size); - LOG_INFO(" Tensor %d: %zu bytes at %p", i, size, dev_ptr); - - t.data = reinterpret_cast(dev_ptr); - device_args.add_tensor(t); - } - for (int i = 0; i < scalar_count; i++) { - device_args.add_scalar(orch_args->scalar(i)); - } - int64_t t_args_end = _now_ms(); - - // Copy orchestration SO to device memory (AICPU cannot access host memory) - int64_t t_so_start = _now_ms(); - void* dev_so = runtime->host_api.device_malloc(orch_so_size); - if (dev_so == nullptr) { - LOG_ERROR("Failed to allocate device memory for orchestration SO"); - return -1; - } - int rc = runtime->host_api.copy_to_device(dev_so, orch_so_binary, orch_so_size); - if (rc != 0) { - LOG_ERROR("Failed to copy orchestration SO to device"); - runtime->host_api.device_free(dev_so); - return -1; - } - // Copy SO binary into Runtime's internal storage (device_orch_so_storage_) - // Pass the HOST pointer (orch_so_binary), not the device pointer (dev_so) - // AICPU Thread 3 will read from get_device_orch_so_data() which returns this storage - runtime->set_device_orch_so(orch_so_binary, orch_so_size); - runtime->record_tensor_pair(nullptr, dev_so, orch_so_size); - LOG_INFO("Orchestration SO: %zu bytes copied to device", orch_so_size); - int64_t t_so_end = _now_ms(); - - // Read ready queue shard count from environment for AICPU scheduler - { - const char* env_shards = std::getenv("PTO2_READY_QUEUE_SHARDS"); - if (env_shards) { - char* endptr; - int64_t val = strtol(env_shards, &endptr, 10); - if (endptr != env_shards && *endptr == '\0' && val >= 1 && val <= PLATFORM_MAX_AICPU_THREADS) { - runtime->ready_queue_shards = static_cast(val); - } else { - LOG_WARN("PTO2_READY_QUEUE_SHARDS=%s is invalid or out of range [1,%d], using default %d", - env_shards, - PLATFORM_MAX_AICPU_THREADS, - RUNTIME_DEFAULT_READY_QUEUE_SHARDS); - runtime->ready_queue_shards = RUNTIME_DEFAULT_READY_QUEUE_SHARDS; - } - } - LOG_INFO("Ready queue shards: %d", runtime->ready_queue_shards); - } - - // Read orchestrator-to-scheduler transition flag from environment - { - const char* env_val = std::getenv("PTO2_ORCH_TO_SCHED"); - if (env_val && (env_val[0] == '1' || env_val[0] == 't' || env_val[0] == 'T')) { - runtime->orch_to_sched = true; - } - LOG_INFO("Orchestrator-to-scheduler transition: %s", runtime->orch_to_sched ? "enabled" : "disabled"); - } - - // Read ring buffer size overrides from environment - { - runtime->pto2_task_window_size = parse_env_uint64("PTO2_RING_TASK_WINDOW", 4, true); - runtime->pto2_heap_size = parse_env_uint64("PTO2_RING_HEAP", 1024, true); - runtime->pto2_dep_pool_size = parse_env_uint64("PTO2_RING_DEP_POOL", 4, false); - if (runtime->pto2_task_window_size || runtime->pto2_heap_size || runtime->pto2_dep_pool_size) { - LOG_INFO("Ring buffer overrides: task_window=%" PRIu64 " heap=%" PRIu64 " dep_pool=%" PRIu64, - (uint64_t)(runtime->pto2_task_window_size ? runtime->pto2_task_window_size : PTO2_TASK_WINDOW_SIZE), - (uint64_t)(runtime->pto2_heap_size ? runtime->pto2_heap_size : PTO2_HEAP_SIZE), - (uint64_t)(runtime->pto2_dep_pool_size ? runtime->pto2_dep_pool_size : PTO2_DEP_LIST_POOL_SIZE)); - } - } - - // Resolve effective sizes (env override or compile-time default) - uint64_t eff_heap_size = runtime->pto2_heap_size ? runtime->pto2_heap_size : PTO2_HEAP_SIZE; - uint64_t eff_task_window_size = - runtime->pto2_task_window_size ? runtime->pto2_task_window_size : PTO2_TASK_WINDOW_SIZE; - - // Allocate GM heap for orchestrator output buffers (all rings combined) - uint64_t total_heap_size = eff_heap_size * PTO2_MAX_RING_DEPTH; - int64_t t_heap_start = _now_ms(); - void* gm_heap = runtime->host_api.device_malloc(total_heap_size); - int64_t t_heap_end = _now_ms(); - if (gm_heap == nullptr) { - LOG_ERROR("Failed to allocate GM heap"); - return -1; - } - runtime->record_tensor_pair(nullptr, gm_heap, total_heap_size); - runtime->set_pto2_gm_heap(gm_heap); - - // Allocate PTO2 shared memory - int64_t t_sm_start = _now_ms(); - uint64_t sm_size = pto2_sm_calculate_size(eff_task_window_size); - void* sm_ptr = runtime->host_api.device_malloc(sm_size); - int64_t t_sm_end = _now_ms(); - if (sm_ptr == nullptr) { - LOG_ERROR("Failed to allocate PTO2 shared memory"); - return -1; - } - runtime->set_pto2_gm_sm_ptr(sm_ptr); - runtime->record_tensor_pair(nullptr, sm_ptr, static_cast(sm_size)); - - // Set up device orchestration state - runtime->set_orch_built_on_host(false); - runtime->set_orch_args(device_args); - - LOG_INFO("Device orchestration ready: %d tensors + %d scalars", tensor_count, scalar_count); - - int64_t t_total_end = _now_ms(); - LOG_INFO("TIMING: args_malloc_copy = %" PRId64 "ms", t_args_end - t_args_start); - LOG_INFO("TIMING: orch_so_copy = %" PRId64 "ms", t_so_end - t_so_start); - LOG_INFO("TIMING: gm_heap_alloc(1GB) = %" PRId64 "ms", t_heap_end - t_heap_start); - LOG_INFO("TIMING: shared_mem_alloc = %" PRId64 "ms", t_sm_end - t_sm_start); - LOG_INFO("TIMING: total_init_runtime_impl = %" PRId64 "ms", t_total_end - t_total_start); - - return 0; -} - -/** - * Validate runtime results and cleanup. - * - * This function: - * 1. Copies recorded tensors from device back to host - * 2. Frees device memory for recorded tensors - * 3. Clears tensor pair state - * - * @param runtime Pointer to Runtime - * @return 0 on success, -1 on failure - */ -extern "C" int validate_runtime_impl(Runtime* runtime) { - if (runtime == nullptr) { - LOG_ERROR("Runtime pointer is null"); - return -1; - } - - int rc = 0; - - LOG_INFO("=== Copying Results Back to Host ==="); - - // Copy all recorded tensors from device back to host - TensorPair* tensor_pairs = runtime->get_tensor_pairs(); - int tensor_pair_count = runtime->get_tensor_pair_count(); - - LOG_INFO("Tensor pairs to process: %d", tensor_pair_count); - - // PTO2 (device orchestration): graph output may be in packed buffer - void* pto2_sm = runtime->get_pto2_gm_sm_ptr(); - uint64_t graph_out_ptr = 0; - uint64_t graph_out_size = 0; - - if (pto2_sm != nullptr) { - // Copy header from device to host to read graph_output_ptr/size - PTO2SharedMemoryHeader host_header; - int hdr_rc = runtime->host_api.copy_from_device(&host_header, pto2_sm, sizeof(PTO2SharedMemoryHeader)); - if (hdr_rc == 0) { - graph_out_ptr = host_header.graph_output_ptr; - graph_out_size = host_header.graph_output_size; - if (graph_out_ptr != 0) { - LOG_INFO("Graph output buffer: ptr=0x%" PRIx64 ", size=%" PRIu64, graph_out_ptr, graph_out_size); - } - } else { - LOG_WARN("Failed to copy PTO2 header from device"); - } - } - - bool first_output_tensor = true; - for (int i = 0; i < tensor_pair_count; i++) { - const TensorPair& pair = tensor_pairs[i]; - - // Skip if device pointer is null - if (pair.dev_ptr == nullptr) { - LOG_WARN("Tensor %d has null device pointer, skipping", i); - continue; - } - - // If host pointer is null, this is a device-only allocation (no copy-back) - if (pair.host_ptr == nullptr) { - LOG_INFO("Tensor %d: device-only allocation (no copy-back)", i); - continue; - } - - void* src_ptr = pair.dev_ptr; - size_t copy_size = pair.size; - - // Use graph_output_ptr for the first output tensor if available - if (first_output_tensor && graph_out_ptr != 0 && graph_out_size > 0) { - src_ptr = reinterpret_cast(static_cast(graph_out_ptr)); - copy_size = static_cast(graph_out_size); - LOG_INFO("Using packed output buffer for tensor %d", i); - first_output_tensor = false; - } - - int copy_rc = runtime->host_api.copy_from_device(pair.host_ptr, src_ptr, copy_size); - if (copy_rc != 0) { - LOG_ERROR("Failed to copy tensor %d from device: %d", i, copy_rc); - rc = copy_rc; - } else { - LOG_INFO("Tensor %d: %zu bytes copied to host", i, pair.size); - } - } - - // Cleanup device tensors - LOG_INFO("=== Cleaning Up ==="); - for (int i = 0; i < tensor_pair_count; i++) { - if (tensor_pairs[i].dev_ptr != nullptr) { - runtime->host_api.device_free(tensor_pairs[i].dev_ptr); - } - } - LOG_INFO("Freed %d device allocations", tensor_pair_count); - - // Cleanup kernel binaries - int kernel_count = runtime->get_registered_kernel_count(); - for (int i = 0; i < kernel_count; i++) { - int func_id = runtime->get_registered_kernel_func_id(i); - runtime->host_api.remove_kernel_binary(func_id); - runtime->set_function_bin_addr(func_id, 0); - } - if (kernel_count > 0) { - LOG_INFO("Freed %d kernel binaries", kernel_count); - } - runtime->clear_registered_kernels(); - - // Clear tensor pairs - runtime->clear_tensor_pairs(); - - LOG_INFO("=== Finalize Complete ==="); - - return rc; -} diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/orchestration/common.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/orchestration/common.cpp deleted file mode 100644 index f0c666908..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/orchestration/common.cpp +++ /dev/null @@ -1,174 +0,0 @@ -#include "common.h" -#include "pto_orchestration_api.h" - -#ifdef __linux__ -#include -#include -#include -#include - -#include -#include -#include -#endif - -struct PTO2Runtime; - -namespace { -// Plain global (not thread_local) to avoid glibc TLSDESC stale-resolution -// crash (BZ #32412) when the orchestration SO is dlclose'd/re-dlopen'd -// between execution rounds. All orchestrator threads bind the same rt -// value, so per-thread storage is unnecessary. -PTO2Runtime* g_pto2_current_runtime = nullptr; -} - -extern "C" __attribute__((visibility("default"))) void pto2_framework_bind_runtime(PTO2Runtime* rt) { - g_pto2_current_runtime = rt; -} - -// Keep current_runtime local to this .so so orchestration helpers do not -// accidentally bind to the AICPU binary's same-named symbol. -extern "C" __attribute__((visibility("hidden"))) PTO2Runtime* pto2_framework_current_runtime() { - return g_pto2_current_runtime; -} - -/** - * 使用 addr2line 将地址转换为 文件:行号 信息 - * 使用 -i 标志展开内联,返回第一行(最内层实际代码位置) - * 如果存在内联,同时通过 inline_chain 返回外层调用链 - */ -#ifdef __linux__ -static std::string addr_to_line(const char* executable, void* addr, - std::string* inline_chain = nullptr) { - char cmd[512]; - snprintf(cmd, sizeof(cmd), "addr2line -e %s -f -C -p -i %p 2>/dev/null", executable, addr); - - std::array buffer; - std::string raw_output; - - FILE* pipe = popen(cmd, "r"); - if (pipe) { - while (fgets(buffer.data(), buffer.size(), pipe) != nullptr) { - raw_output += buffer.data(); - } - pclose(pipe); - } - - if (raw_output.empty() || raw_output.find("??") != std::string::npos) { - return ""; - } - - // 按行分割 - std::vector lines; - size_t pos = 0; - while (pos < raw_output.size()) { - size_t nl = raw_output.find('\n', pos); - if (nl == std::string::npos) nl = raw_output.size(); - std::string line = raw_output.substr(pos, nl - pos); - while (!line.empty() && line.back() == '\r') line.pop_back(); - if (!line.empty()) lines.push_back(line); - pos = nl + 1; - } - - if (lines.empty()) return ""; - - // 第一行是最内层的实际代码位置,后续行是外层内联调用者 - if (inline_chain && lines.size() > 1) { - *inline_chain = ""; - for (size_t j = 1; j < lines.size(); j++) { - *inline_chain += " [inlined by] " + lines[j] + "\n"; - } - } - - return lines.front(); -} -#endif - -/** - * 获取当前调用栈信息(包含文件路径和行号) - * 通过 dladdr 定位每个栈帧所在的共享库,并用相对地址调用 addr2line - */ -std::string get_stacktrace(int skip_frames) { - (void)skip_frames; // May be unused on non-Linux platforms - std::string result; -#ifdef __linux__ - const int max_frames = 64; - void* buffer[max_frames]; - int nframes = backtrace(buffer, max_frames); - char** symbols = backtrace_symbols(buffer, nframes); - - if (symbols) { - result = "Stack trace:\n"; - for (int i = skip_frames; i < nframes; i++) { - std::string frame_info; - - void* addr = (void*)((char*)buffer[i] - 1); - - Dl_info dl_info; - std::string inline_chain; - if (dladdr(addr, &dl_info) && dl_info.dli_fname) { - void* rel_addr = (void*)((char*)addr - (char*)dl_info.dli_fbase); - std::string addr2line_result = addr_to_line(dl_info.dli_fname, rel_addr, &inline_chain); - - if (addr2line_result.empty()) { - addr2line_result = addr_to_line(dl_info.dli_fname, addr, &inline_chain); - } - - if (!addr2line_result.empty()) { - frame_info = std::string(dl_info.dli_fname) + ": " + addr2line_result; - } - } - - if (frame_info.empty()) { - std::string frame(symbols[i]); - - size_t start = frame.find('('); - size_t end = frame.find('+', start); - if (start != std::string::npos && end != std::string::npos) { - std::string mangled = frame.substr(start + 1, end - start - 1); - int status; - char* demangled = abi::__cxa_demangle(mangled.c_str(), nullptr, nullptr, &status); - if (status == 0 && demangled) { - frame = frame.substr(0, start + 1) + demangled + frame.substr(end); - free(demangled); - } - } - frame_info = frame; - } - - char buf[16]; - snprintf(buf, sizeof(buf), " #%d ", i - skip_frames); - result += buf + frame_info + "\n"; - if (!inline_chain.empty()) { - result += inline_chain; - } - } - free(symbols); - } -#else - result = "(调用栈仅在 Linux 上可用)\n"; -#endif - return result; -} - -// AssertionError 构造函数 -static std::string build_assert_message(const char* condition, const char* file, int line) { - std::string msg = "Assertion failed: " + std::string(condition) + "\n"; - msg += " Location: " + std::string(file) + ":" + std::to_string(line) + "\n"; - msg += get_stacktrace(3); - return msg; -} - -AssertionError::AssertionError(const char* condition, const char* file, int line) - : std::runtime_error(build_assert_message(condition, file, line)), - condition_(condition), file_(file), line_(line) {} - -[[noreturn]] void assert_impl(const char* condition, const char* file, int line) { - LOG_ERROR("\n========================================"); - LOG_ERROR("Assertion failed: %s", condition); - LOG_ERROR("Location: %s:%d", file, line); - LOG_ERROR("%s", get_stacktrace(2).c_str()); - LOG_ERROR("========================================\n"); - - throw AssertionError(condition, file, line); -} diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/orchestration/pto_orchestration_api.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/orchestration/pto_orchestration_api.h deleted file mode 100644 index 00b4899cb..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/orchestration/pto_orchestration_api.h +++ /dev/null @@ -1,308 +0,0 @@ -/* - * Copyright (c) PyPTO Contributors. - * This program is free software, you can redistribute it and/or modify it under the terms and conditions of - * CANN Open Software License Agreement Version 2.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - * ----------------------------------------------------------------------------------------------------------- - */ -/** - * PTO Orchestration API - Slim header for orchestration .so files - * - * This header provides everything an orchestration source needs without - * pulling in runtime implementation headers. The orchestration .so has - * zero link dependencies on runtime .cpp files; all runtime calls go - * through the PTO2RuntimeOps function-pointer table embedded in - * PTO2Runtime. - * - * Orchestration sources include ONLY this header: - * #include "pto_orchestration_api.h" - * - * Runtime sources continue to use pto_runtime2.h (which defines the - * full PTO2Runtime struct with all internal fields). - */ - -#ifndef SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_ORCHESTRATION_PTO_ORCHESTRATION_API_H_ -#define SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_ORCHESTRATION_PTO_ORCHESTRATION_API_H_ - -#include -#include -#include - -// Type headers needed by orchestration -#include "pto_submit_types.h" // MixedKernels, INVALID_KERNEL_ID, subtask slots // NOLINT(build/include_subdir) -#include "pto_types.h" // Arg, TaskOutputTensors, TensorArgType // NOLINT(build/include_subdir) -#include "task_args.h" // ChipStorageTaskArgs, ContinuousTensor // NOLINT(build/include_subdir) -#include "tensor.h" // Tensor, TensorCreateInfo // NOLINT(build/include_subdir) - -// ============================================================================= -// Tensor Factory Helpers -// ============================================================================= - -/** - * Create a Tensor for pre-allocated external memory. - */ -inline Tensor make_tensor_external(void* addr, - const uint32_t shapes[], - uint32_t ndims, - DataType dtype = DataType::FLOAT32, - bool manual_dep = false, - int32_t version = 0) { - static uint32_t zero_offsets[RUNTIME_MAX_TENSOR_DIMS] = {}; - uint64_t total = 1; - for (uint32_t i = 0; i < ndims; i++) { - total *= shapes[i]; - } - return Tensor(addr, - total * get_element_size(dtype), - shapes, - shapes, - zero_offsets, - ndims, - dtype, - version, - /*is_all_offset_zero=*/true, - /*is_raw_eq_shapes=*/true, - manual_dep); -} - -// Convert ContinuousTensor to Tensor -static_assert( - CONTINUOUS_TENSOR_MAX_DIMS == RUNTIME_MAX_TENSOR_DIMS, "ContinuousTensor and runtime max dims must match"); -inline Tensor from_tensor_arg(const ContinuousTensor& t, bool manual_dep = false, int32_t version = 0) { - return make_tensor_external( - reinterpret_cast(static_cast(t.data)), t.shapes, t.ndims, t.dtype, manual_dep, version); -} - -// ============================================================================= -// Ops Table and Opaque Runtime -// ============================================================================= - -/** - * Forward declaration — the orchestration sees PTO2Runtime as a partial - * struct whose first field is the ops pointer. The full definition - * lives in pto_runtime2.h (used only by runtime .cpp files). - */ -typedef struct PTO2Runtime PTO2Runtime; - -#ifdef __cplusplus -extern "C" { -#endif - -/** - * Framework-internal TLS bridge. - * - * The executor binds the current thread's runtime before invoking - * aicpu_orchestration_entry(), so orchestration helpers can fetch the - * current PTO2Runtime without explicit parameter threading. - */ -PTO2Runtime* pto2_framework_current_runtime(void); -void pto2_framework_bind_runtime(PTO2Runtime* rt); - -#ifdef __cplusplus -} -#endif - -/** - * Function-pointer table for runtime operations. - * Populated by the runtime; called by orchestration through inline wrappers. - */ -typedef struct PTO2RuntimeOps { - TaskOutputTensors (*submit_task)(PTO2Runtime* rt, const MixedKernels& mixed_kernels, const Arg& args); - void (*scope_begin)(PTO2Runtime* rt); - void (*scope_end)(PTO2Runtime* rt); - void (*orchestration_done)(PTO2Runtime* rt); - bool (*is_fatal)(PTO2Runtime* rt); - - // Logging (populated by runtime, called by orchestration) - void (*log_error)(const char* func, const char* fmt, ...); - void (*log_warn)(const char* func, const char* fmt, ...); - void (*log_info)(const char* func, const char* fmt, ...); - void (*log_debug)(const char* func, const char* fmt, ...); - void (*log_always)(const char* func, const char* fmt, ...); - - // Cross-layer data access (orchestration reads/writes tensor values via runtime) - // Placed after logging to avoid shifting hot-path field offsets. - uint64_t (*get_tensor_data)(PTO2Runtime* rt, const Tensor& tensor, uint32_t ndims, const uint32_t indices[]); - void (*set_tensor_data)( - PTO2Runtime* rt, const Tensor& tensor, uint32_t ndims, const uint32_t indices[], uint64_t value); -} PTO2RuntimeOps; - -/** - * Partial PTO2Runtime definition for orchestration. - * - * Only the ops pointer is visible. The real struct (in pto_runtime2.h) - * has the same first field, so accessing rt->ops through this definition - * is well-defined (C struct layout guarantee). - */ -struct PTO2Runtime { - const PTO2RuntimeOps* ops; -}; - -// ============================================================================= -// Inline Convenience Wrappers (call through ops table) -// ============================================================================= - -static inline PTO2Runtime* pto2_current_runtime() { return pto2_framework_current_runtime(); } - -static inline TaskOutputTensors pto2_rt_submit_task(const MixedKernels& mixed_kernels, const Arg& args) { - PTO2Runtime* rt = pto2_current_runtime(); - return rt->ops->submit_task(rt, mixed_kernels, args); -} - -/** - * Convenience wrapper: submit an AIC-only task. - */ -static inline TaskOutputTensors pto2_rt_submit_aic_task(int32_t kernel_id, const Arg& args) { - PTO2Runtime* rt = pto2_current_runtime(); - MixedKernels mk; - mk.aic_kernel_id = kernel_id; - return rt->ops->submit_task(rt, mk, args); -} - -/** - * Convenience wrapper: submit an AIV-only task (uses AIV0 slot). - */ -static inline TaskOutputTensors pto2_rt_submit_aiv_task(int32_t kernel_id, const Arg& args) { - PTO2Runtime* rt = pto2_current_runtime(); - MixedKernels mk; - mk.aiv0_kernel_id = kernel_id; - return rt->ops->submit_task(rt, mk, args); -} - -static inline void pto2_rt_scope_begin() { - PTO2Runtime* rt = pto2_current_runtime(); - rt->ops->scope_begin(rt); -} - -static inline void pto2_rt_scope_end() { - PTO2Runtime* rt = pto2_current_runtime(); - rt->ops->scope_end(rt); -} - -static inline void pto2_rt_orchestration_done() { - PTO2Runtime* rt = pto2_current_runtime(); - rt->ops->orchestration_done(rt); -} - -static inline bool pto2_rt_is_fatal() { - PTO2Runtime* rt = pto2_current_runtime(); - return rt->ops->is_fatal(rt); -} - -// ============================================================================= -// Logging Macros for Orchestration (call through ops table) -// ============================================================================= - -#define LOG_ERROR(fmt, ...) pto2_current_runtime()->ops->log_error(__FUNCTION__, fmt, ##__VA_ARGS__) -#define LOG_WARN(fmt, ...) pto2_current_runtime()->ops->log_warn(__FUNCTION__, fmt, ##__VA_ARGS__) -#define LOG_INFO(fmt, ...) pto2_current_runtime()->ops->log_info(__FUNCTION__, fmt, ##__VA_ARGS__) -#define LOG_DEBUG(fmt, ...) pto2_current_runtime()->ops->log_debug(__FUNCTION__, fmt, ##__VA_ARGS__) -#define LOG_ALWAYS(fmt, ...) pto2_current_runtime()->ops->log_always(__FUNCTION__, fmt, ##__VA_ARGS__) - -// ============================================================================= -// Cross-Layer Data Access -// ============================================================================= - -/** - * Read a value from a tensor at the given multi-dimensional indices. - * - * Default T = uint64_t preserves old behavior (raw bits). - * Specify T to get automatic type conversion: - * - * uint64_t raw = get_tensor_data(tensor, 1, idx); // old usage unchanged - * float val = get_tensor_data(tensor, 1, idx); // typed read - * - * If the tensor has a producer in TensorMap, spin-waits until the producer - * task completes before reading. External tensors (make_tensor_external) - * are read immediately without waiting. - */ -template -static inline T get_tensor_data(const Tensor& tensor, uint32_t ndims, const uint32_t indices[]) { - PTO2Runtime* rt = pto2_current_runtime(); - return from_u64(rt->ops->get_tensor_data(rt, tensor, ndims, indices)); -} - -/** - * Write a value to a tensor at the given multi-dimensional indices. - * - * Type is deduced from value argument; uint64_t by default: - * - * set_tensor_data(tensor, 1, idx, raw_u64); // old usage unchanged - * set_tensor_data(tensor, 1, idx, 42.0f); // typed write (T = float) - * - * If the tensor has a producer in TensorMap, spin-waits until the producer - * and all its consumers complete before writing (WAW + WAR safety). - * External tensors (make_tensor_external) with no TensorMap entry are - * written immediately without waiting. - * - * Limitation: TensorMap only tracks producers (OUTPUT/INOUT), not consumers - * that used the tensor as INPUT. If a kernel reads this tensor as INPUT - * (not INOUT) and the tensor has no TensorMap producer entry, set_tensor_data - * cannot detect the reader and may cause a data race. - * - * To ensure WAR safety for all access patterns, use add_inout() instead of - * add_input() for kernel parameters that may later be written via - * set_tensor_data. INOUT creates a TensorMap entry that enables automatic - * consumer tracking via fanout_refcount. - * - * The tensor must already have an allocated buffer (addr != 0). - * For runtime-created outputs, call this only on the Tensor returned by - * add_output(TensorCreateInfo) after submit returns. - */ -template -static inline void set_tensor_data(const Tensor& tensor, uint32_t ndims, const uint32_t indices[], T value) { - PTO2Runtime* rt = pto2_current_runtime(); - rt->ops->set_tensor_data(rt, tensor, ndims, indices, to_u64(value)); -} - -// ============================================================================= -// C++ Scope Guards and Macros -// ============================================================================= - -/** - * RAII Scope Guard (calls through ops table) - */ -class PTO2ScopeGuard { -public: // NOLINT(whitespace/indent) - PTO2ScopeGuard() : rt_(pto2_current_runtime()) { rt_->ops->scope_begin(rt_); } - ~PTO2ScopeGuard() { rt_->ops->scope_end(rt_); } - -private: // NOLINT(whitespace/indent) - PTO2Runtime* rt_; -}; - -#define _PTO2_CONCATENATE_IMPL(x, y) x##y -#define _PTO2_CONCATENATE(x, y) _PTO2_CONCATENATE_IMPL(x, y) - -#define PTO2_SCOPE_GUARD() [[maybe_unused]] PTO2ScopeGuard _PTO2_CONCATENATE(scope_guard_, __COUNTER__) - -/** - * Scoped block macro: - * PTO2_SCOPE() { - * pto2_rt_submit_task(...); - * } - */ -#define PTO2_SCOPE() if (PTO2_SCOPE_GUARD(); true) - -// ============================================================================= -// Orchestration Config -// ============================================================================= - -/** - * Configuration exported by orchestration .so via aicpu_orchestration_config(). - * The executor reads these values to set up shared memory and runtime. - * - * This struct is defined identically in pto_runtime2.h (with an include - * guard) so the executor can use the same type without including this header. - */ -#ifndef PTO2_ORCHESTRATION_CONFIG_DEFINED -#define PTO2_ORCHESTRATION_CONFIG_DEFINED -struct PTO2OrchestrationConfig { - int expected_arg_count; -}; -#endif - -#endif // SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_ORCHESTRATION_PTO_ORCHESTRATION_API_H_ diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/common.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/common.h deleted file mode 100644 index 1a5af9de3..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/common.h +++ /dev/null @@ -1,93 +0,0 @@ -/* - * Copyright (c) PyPTO Contributors. - * This program is free software, you can redistribute it and/or modify it under the terms and conditions of - * CANN Open Software License Agreement Version 2.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - * ----------------------------------------------------------------------------------------------------------- - */ - -#pragma once - -#include -#include - -#include -#include - -/** - * Get the current stack trace, including file paths and line numbers. - * Implemented in common.cpp. - */ -std::string get_stacktrace(int skip_frames = 1); - -/** - * Assertion failure exception with condition, file, line, and stack trace. - */ -class AssertionError : public std::runtime_error { - public: - AssertionError(const char* condition, const char* file, int line); - - const char* condition() const { return condition_; } - const char* file() const { return file_; } - int line() const { return line_; } - - private: - const char* condition_; - const char* file_; - int line_; -}; - -/** - * Assertion failure handler. - * Implemented in common.cpp. - */ -[[noreturn]] void assert_impl(const char* condition, const char* file, int line); - -/** - * debug_assert macro: - * checks the condition in debug builds and throws with a stack trace on failure. - * It is a no-op in release builds (NDEBUG). - */ -#ifdef NDEBUG -#define debug_assert(cond) ((void)0) -#else -#define debug_assert(cond) \ - do { \ - if (!(cond)) { \ - assert_impl(#cond, __FILE__, __LINE__); \ - } \ - } while (0) -#endif - -/** - * always_assert macro: - * checks the condition in both debug and release builds. - */ -#define always_assert(cond) \ - do { \ - if (!(cond)) { \ - assert_impl(#cond, __FILE__, __LINE__); \ - } \ - } while (0) - -#define PTO_PRAGMA(x) _Pragma(#x) - -#if defined(__clang__) -#define MAYBE_UNINITIALIZED_BEGIN \ - PTO_PRAGMA(clang diagnostic push) \ - PTO_PRAGMA(clang diagnostic ignored "-Wuninitialized") \ - PTO_PRAGMA(clang diagnostic ignored "-Wsometimes-uninitialized") -#define MAYBE_UNINITIALIZED_END PTO_PRAGMA(clang diagnostic pop) -#elif defined(__GNUC__) -#define MAYBE_UNINITIALIZED_BEGIN \ - PTO_PRAGMA(GCC diagnostic push) \ - PTO_PRAGMA(GCC diagnostic ignored "-Wuninitialized") \ - PTO_PRAGMA(GCC diagnostic ignored "-Wmaybe-uninitialized") -#define MAYBE_UNINITIALIZED_END PTO_PRAGMA(GCC diagnostic pop) -#else -#define MAYBE_UNINITIALIZED_BEGIN -#define MAYBE_UNINITIALIZED_END -#endif diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto2_dispatch_payload.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto2_dispatch_payload.h deleted file mode 100644 index 914a3f92a..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto2_dispatch_payload.h +++ /dev/null @@ -1,85 +0,0 @@ -/* - * Copyright (c) PyPTO Contributors. - * This program is free software, you can redistribute it and/or modify it under the terms and conditions of - * CANN Open Software License Agreement Version 2.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - * ----------------------------------------------------------------------------------------------------------- - */ - -/** - * @file pto2_dispatch_payload.h - * @brief Per-core dispatch payload for AICore kernel execution - * - * PTO2DispatchPayload holds the kernel function address, a per-core args[] - * array, and embedded SPMD context (LocalContext + GlobalContext). AICPU - * maintains a static array of these (one per core). - * - * GlobalContext (sub_block_id) is initialized once at runtime startup via - * init_global_context() and never modified afterwards. - * - * LocalContext (block_idx, block_num) and args[] are rebuilt by - * build_payload() before each dispatch. Both context struct pointers are - * written into the args[] suffix on every dispatch (since args[] is rebuilt - * entirely each time). - * - * AICore caches a pointer to its per-core slot at startup and reads from - * it on each dispatch. The struct is cache-line aligned to avoid false - * sharing across concurrently dispatched cores. - * - * The DATA_MAIN_BASE register protocol is unchanged from the base runtime: - * a monotonically increasing reg_task_id signals new work to AICore. - */ - -#pragma once - -#include - -#include "intrinsic.h" -#include "pto_types.h" - -/** Max dispatch arguments: 128 scalars + up to 16 tensor pointers + ext params */ -#ifndef PTO2_DISPATCH_MAX_ARGS -#define PTO2_DISPATCH_MAX_ARGS (MAX_TENSOR_ARGS + MAX_SCALAR_ARGS + PTO2_EXT_PARAMS_COUNT) -#endif - -#ifndef PTO2_ALIGN_UP -#define PTO2_ALIGN_UP(x, align) (((x) + (align) - 1) & ~((align) - 1)) -#endif - -// Verify hardcoded indices in intrinsic.h match the computed values. -static_assert((MAX_TENSOR_ARGS + MAX_SCALAR_ARGS) == SPMD_LOCAL_CONTEXT_INDEX, - "LOCAL_CONTEXT_INDEX out of sync with intrinsic.h"); -static_assert((MAX_TENSOR_ARGS + MAX_SCALAR_ARGS + 1) == SPMD_GLOBAL_CONTEXT_INDEX, - "GLOBAL_CONTEXT_INDEX out of sync with intrinsic.h"); - -/** - * Per-core dispatch payload: function address + args[] + SPMD context. - * - * AICPU maintains a static array s_pto2_payload_per_core[RUNTIME_MAX_WORKER]. - * AICore caches a pointer to its per-core slot at startup (via Handshake.task) - * and reads from it on each dispatch. - * - * The struct is cache-line aligned to prevent false sharing across - * concurrently dispatched cores. - */ -struct alignas(64) PTO2DispatchPayload { - uint64_t function_bin_addr; /**< Kernel entry address in GM (set by Scheduler) */ - uint64_t args[PTO2_DISPATCH_MAX_ARGS]; /**< Kernel arguments (GM pointers + scalars + ext params) */ - - /** Per-dispatch context: block_idx and block_num. - * Written by build_payload() before each dispatch. - * args[SPMD_LOCAL_CONTEXT_INDEX] points here. */ - LocalContext local_context; - - /** Per-core global context: sub_block_id (AIV lane identity). - * Initialized once by init_global_context() at runtime startup. - * args[SPMD_GLOBAL_CONTEXT_INDEX] points here. */ - GlobalContext global_context; - - static_assert(sizeof(args[0]) == 8); - static_assert(PTO2_ALIGN_UP((MAX_TENSOR_ARGS + MAX_SCALAR_ARGS) * sizeof(args[0]), 64) == - (MAX_TENSOR_ARGS + MAX_SCALAR_ARGS) * sizeof(args[0])); -}; diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_orchestrator.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_orchestrator.cpp deleted file mode 100644 index 9a6b5fad8..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_orchestrator.cpp +++ /dev/null @@ -1,759 +0,0 @@ -/* - * Copyright (c) PyPTO Contributors. - * This program is free software, you can redistribute it and/or modify it under the terms and conditions of - * CANN Open Software License Agreement Version 2.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - * ----------------------------------------------------------------------------------------------------------- - */ - -/** - * PTO Runtime2 - Orchestrator Implementation - * - * Implements orchestrator state management, scope handling, and task submission. - * - * Based on: docs/RUNTIME_LOGIC.md - */ - -#include "pto_orchestrator.h" - -#include -#include -#include -#include -#include - -#include "common/unified_log.h" -#include "pto_runtime2_types.h" -#include "pto_shared_memory.h" -#include "pto_tensormap.h" -#include "pto_types.h" -#include "tensor.h" - -// ============================================================================= -// Orchestrator Profiling (compile-time toggle) -// ============================================================================= -#if PTO2_ORCH_PROFILING -#include "aicpu/device_time.h" -#include "aicpu/performance_collector_aicpu.h" -// Weak fallback for builds that don't link device_time.cpp (e.g. host). -// The strong symbol from platform/.../device_time.cpp wins in the AICPU build. -// -// IMPORTANT: visibility("hidden") is required to prevent the HOST .so from -// exporting this weak fallback into the global dynamic symbol table via -// RTLD_GLOBAL. Without it, when the AICPU .so is loaded and its PLT entry -// for get_sys_cnt_aicpu is resolved, the dynamic linker finds the HOST .so's -// weak definition first (already in global table) and uses it — returning 0. -// With hidden visibility, the HOST .so does not export this symbol globally, -// so the AICPU .so's PLT resolves to its own strong definition from -// device_time.cpp. -__attribute__((weak, visibility("hidden"))) uint64_t get_sys_cnt_aicpu() { return 0; } -// Weak fallback for builds that don't link performance_collector_aicpu.cpp. -// The strong symbol from the AICPU build wins when profiling is available. -// Also hidden to prevent HOST .so from polluting the global symbol table. -__attribute__((weak, visibility("hidden"))) void perf_aicpu_record_orch_phase( - AicpuPhaseId, uint64_t, uint64_t, uint32_t, uint64_t) {} -// Accumulated cycles per sub-step (only needed for ORCH_PROFILING export) -static uint64_t g_orch_sync_cycle = 0; // tensormap sync -static uint64_t g_orch_alloc_cycle = 0; // unified task+heap alloc -static uint64_t g_orch_args_cycle = 0; // param copy -static uint64_t g_orch_lookup_cycle = 0; // tensormap lookup + dep building -static uint64_t g_orch_insert_cycle = 0; // tensormap insert -static uint64_t g_orch_fanin_cycle = 0; // fanin list + early-return check -static uint64_t g_orch_scope_end_cycle = 0; // scope_end overhead -static int64_t g_orch_submit_count = 0; -static uint32_t g_orch_submit_idx = 0; -uint64_t g_orch_alloc_wait_cycle = 0; -uint64_t g_orch_fanin_wait_cycle = 0; -uint64_t g_orch_alloc_atomic_count = 0; -uint64_t g_orch_args_atomic_count = 0; -uint64_t g_orch_fanin_atomic_count = 0; -uint64_t g_orch_finalize_atomic_count = 0; -uint64_t g_orch_scope_end_atomic_count = 0; -#define CYCLE_COUNT_START() uint64_t _t0 = get_sys_cnt_aicpu(), _t1 -#define CYCLE_COUNT_LAP(acc) \ - do { \ - _t1 = get_sys_cnt_aicpu(); \ - acc += (_t1 - _t0); \ - _t0 = _t1; \ - } while (0) -#define CYCLE_COUNT_LAP_RECORD(acc, phase_id, tid) \ - do { \ - _t1 = get_sys_cnt_aicpu(); \ - acc += (_t1 - _t0); \ - perf_aicpu_record_orch_phase((phase_id), _t0, _t1, g_orch_submit_idx, (tid)); \ - _t0 = _t1; \ - } while (0) -#elif PTO2_PROFILING -#include "aicpu/device_time.h" -#include "aicpu/performance_collector_aicpu.h" -__attribute__((weak, visibility("hidden"))) uint64_t get_sys_cnt_aicpu() { return 0; } -__attribute__((weak, visibility("hidden"))) void perf_aicpu_record_orch_phase( - AicpuPhaseId, uint64_t, uint64_t, uint32_t, uint64_t) {} -// submit_idx needed for swimlane task_id tagging (no cycle accumulation at this level) -static uint32_t g_orch_submit_idx = 0; -#define CYCLE_COUNT_START() \ - bool _prof_active = orch->enable_profiling; \ - uint64_t _t0 = _prof_active ? get_sys_cnt_aicpu() : 0, _t1 = 0 -#define CYCLE_COUNT_LAP(acc) \ - do { \ - } while (0) -#define CYCLE_COUNT_LAP_RECORD(acc, phase_id, tid) \ - do { \ - if (_prof_active) { \ - _t1 = get_sys_cnt_aicpu(); \ - perf_aicpu_record_orch_phase((phase_id), _t0, _t1, g_orch_submit_idx, (tid)); \ - _t0 = _t1; \ - } \ - } while (0) -#else -#define CYCLE_COUNT_START() -#define CYCLE_COUNT_LAP(acc) -#define CYCLE_COUNT_LAP_RECORD(acc, phase_id, tid) -#endif - -static bool pto2_append_fanin_or_fail(PTO2OrchestratorState* orch, - PTO2TaskId task_id, - int32_t tensor_arg_index, - TensorArgType ptype, - PTO2TaskSlotState* prod_state, - PTO2TaskSlotState* fanin_states[], - int32_t* fanin_count, - const char* reason) { - for (int32_t j = 0; j < *fanin_count; j++) { - if (fanin_states[j] == prod_state) { - return true; - } - } - - if (*fanin_count >= PTO2_MAX_INPUTS) { - LOG_ERROR("========================================"); - LOG_ERROR("FATAL: Dependency Overflow Detected!"); - LOG_ERROR("========================================"); - LOG_ERROR("Task requires more than PTO2_MAX_INPUTS unique fanin dependencies."); - LOG_ERROR(" task_id.raw: %" PRIu64, task_id.raw); - LOG_ERROR(" tensor_arg_index: %d", tensor_arg_index); - LOG_ERROR(" tensor_arg_type: %d", static_cast(ptype)); - LOG_ERROR(" fanin_count: %d / %d", *fanin_count, PTO2_MAX_INPUTS); - LOG_ERROR(" reason: %s", reason); - LOG_ERROR("This is a runtime dependency-tracking limit."); - LOG_ERROR("========================================"); - orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_DEPENDENCY_OVERFLOW, std::memory_order_release); - orch->fatal = true; - return false; - } - - fanin_states[(*fanin_count)++] = prod_state; - return true; -} - -// ============================================================================= -// Orchestrator Initialization -// ============================================================================= - -bool pto2_orchestrator_init(PTO2OrchestratorState* orch, - PTO2SharedMemoryHandle* sm_handle, - void* gm_heap, - uint64_t heap_size, - int32_t dep_pool_capacity) { - *orch = PTO2OrchestratorState{}; - - orch->sm_handle = sm_handle; - orch->gm_heap_base = gm_heap; - orch->gm_heap_size = heap_size * PTO2_MAX_RING_DEPTH; - orch->fatal = false; - - // Initialize per-ring resources - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - void* ring_heap_base = reinterpret_cast(gm_heap) + r * heap_size; - auto& fc = sm_handle->header->rings[r].fc; - - // Initialize unified task allocator - orch->rings[r].task_allocator.init(sm_handle->task_descriptors[r], - sm_handle->header->rings[r].task_window_size, - &fc.current_task_index, - &fc.last_task_alive, - ring_heap_base, - heap_size, - &sm_handle->header->orch_error_code); - - // Allocate and initialize dependency list pool (per-ring) - PTO2DepListEntry* dep_entries = - reinterpret_cast(calloc(dep_pool_capacity, sizeof(PTO2DepListEntry))); - if (!dep_entries) { - // Cleanup previously allocated rings - for (int j = 0; j < r; j++) { - free(orch->rings[j].dep_pool.base); - } - return false; - } - orch->rings[r].dep_pool.init(dep_entries, dep_pool_capacity, &sm_handle->header->orch_error_code); - } - - // Initialize TensorMap with per-ring task window sizes - int32_t task_window_sizes[PTO2_MAX_RING_DEPTH]; - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - task_window_sizes[r] = sm_handle->header->rings[r].task_window_size; - } - if (!orch->tensor_map.init_default(task_window_sizes)) { - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - free(orch->rings[r].dep_pool.base); - } - return false; - } - orch->tensor_map.orch = orch; - - // Initialize scope stack: one flat buffer for task IDs + one array for begin offsets - uint64_t max_depth = PTO2_MAX_SCOPE_DEPTH; - int32_t init_cap = PTO2_SCOPE_TASKS_INIT_CAP; - orch->scope_tasks = reinterpret_cast(malloc(init_cap * sizeof(PTO2TaskSlotState*))); - orch->scope_begins = reinterpret_cast(malloc(max_depth * sizeof(int32_t))); - if (!orch->scope_tasks || !orch->scope_begins) { - free(orch->scope_tasks); - free(orch->scope_begins); - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - free(orch->rings[r].dep_pool.base); - } - orch->tensor_map.destroy(); - return false; - } - orch->scope_tasks_size = 0; - orch->scope_tasks_capacity = init_cap; - orch->scope_stack_top = -1; - orch->scope_stack_capacity = max_depth; - - return true; -} - -void pto2_orchestrator_destroy(PTO2OrchestratorState* orch) { - orch->tensor_map.destroy(); - - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - free(orch->rings[r].dep_pool.base); - orch->rings[r].dep_pool.base = NULL; - } - - free(orch->scope_tasks); - orch->scope_tasks = NULL; - free(orch->scope_begins); - orch->scope_begins = NULL; -} - -void pto2_orchestrator_set_scheduler(PTO2OrchestratorState* orch, PTO2SchedulerState* scheduler) { - orch->scheduler = scheduler; -} - -// ============================================================================= -// Scope Management -// ============================================================================= - -static void scope_tasks_push(PTO2OrchestratorState* orch, PTO2TaskSlotState* task_slot_state) { - if (orch->scope_tasks_size >= orch->scope_tasks_capacity) { - int32_t new_cap = orch->scope_tasks_capacity * 2; - PTO2TaskSlotState** new_buf = - reinterpret_cast(realloc(orch->scope_tasks, new_cap * sizeof(PTO2TaskSlotState*))); - assert(new_buf && "Failed to grow scope task buffer"); - orch->scope_tasks = new_buf; - orch->scope_tasks_capacity = new_cap; - } - orch->scope_tasks[orch->scope_tasks_size++] = task_slot_state; -} - -void pto2_scope_begin(PTO2OrchestratorState* orch) { - if (orch->fatal) { - return; - } - assert(orch->scope_stack_top < static_cast(orch->scope_stack_capacity - 1) && "Scope stack overflow"); - - ++orch->scope_stack_top; - orch->scope_begins[orch->scope_stack_top] = orch->scope_tasks_size; -} - -void pto2_scope_end(PTO2OrchestratorState* orch) { - if (orch->fatal) { - return; - } - assert(orch->scope_stack_top >= 0 && "Scope stack underflow"); - -#if PTO2_ORCH_PROFILING - uint64_t _se0 = get_sys_cnt_aicpu(); -#endif - - int32_t begin = orch->scope_begins[orch->scope_stack_top--]; - int32_t count = orch->scope_tasks_size - begin; - - if (orch->scheduler && count > 0) { - orch->scheduler->on_scope_end(&orch->scope_tasks[begin], count); - } - - // Rewind the task buffer — these entries are no longer needed - orch->scope_tasks_size = begin; - -#if PTO2_ORCH_PROFILING - uint64_t _se1 = get_sys_cnt_aicpu(); - g_orch_scope_end_cycle += (_se1 - _se0); - // perf_aicpu_record_orch_phase(AicpuPhaseId::ORCH_SCOPE_END, _se0, _se1, g_orch_submit_idx, -1); -#endif -} - -// ============================================================================= -// Task Submission -// ============================================================================= -TaskOutputTensors pto2_submit_mixed_task( - PTO2OrchestratorState* orch, const MixedKernels& mixed_kernels, const Arg& args) { - CYCLE_COUNT_START(); - - TaskOutputTensors result; - - // Fast path after fatal error — all subsequent submits are no-ops - if (orch->fatal) { - return result; - } - - // Validate Arg construction (errors recorded by add_input/add_output/etc.) - if (args.has_error) { - LOG_ERROR("========================================"); - LOG_ERROR("FATAL: Invalid Arg Detected!"); - LOG_ERROR("========================================"); - LOG_ERROR("Error: %s", args.error_msg ? args.error_msg : "(unknown)"); - LOG_ERROR(" tensor_count: %d, scalar_count: %d", args.tensor_count(), args.scalar_count()); - LOG_ERROR("This is a bug in the orchestration code."); - LOG_ERROR("========================================"); - orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_INVALID_ARGS, std::memory_order_release); - orch->fatal = true; - return result; - } - - // Determine which ring this task belongs to - uint8_t ring_id = orch->current_ring_id(); - auto& allocator = orch->rings[ring_id].task_allocator; - PTO2SchedulerState* sched = orch->scheduler; - PTO2RingFlowControl& fc = orch->sm_handle->header->rings[ring_id].fc; - - // === Validate submit inputs === - uint8_t active_mask = pto2_mixed_kernels_to_active_mask(mixed_kernels); - always_assert(active_mask != 0 && "MixedKernels must have at least one active slot"); - - int16_t block_num = args.launch_spec.block_num(); - always_assert(block_num >= 1 && "block_num must be >= 1"); - - // Normalize single-AIV tasks: if only aiv1 is set (no aic, no aiv0), move - // it to the aiv0 slot. This guarantees the dispatch path can always use - // PTO2SubtaskSlot::AIV0 for single-AIV shapes without inspecting active_mask. - // Mixed tasks (AIC+AIV) keep their original AIV identity so the correct - // hardware channel (AIV0→AIC vs AIV1→AIC) is used at dispatch time. - MixedKernels normalized = mixed_kernels; - bool has_aic = (active_mask & PTO2_SUBTASK_MASK_AIC) != 0; - bool has_aiv0 = (active_mask & PTO2_SUBTASK_MASK_AIV0) != 0; - bool has_aiv1 = (active_mask & PTO2_SUBTASK_MASK_AIV1) != 0; - if (!has_aic && has_aiv1 && !has_aiv0) { - normalized.aiv0_kernel_id = normalized.aiv1_kernel_id; - normalized.aiv1_kernel_id = INVALID_KERNEL_ID; - active_mask = pto2_mixed_kernels_to_active_mask(normalized); - } - - // Submission without an open scope is illegal - always_assert(orch->scope_stack_top >= 0 && "Cannot submit task outside a scope"); - - // === Scope deadlock pre-check === - // Tasks within a scope hold a fanout_count reference released only at scope_end. - // If scope task count >= window_size, no slots can ever be reclaimed → deadlock. - { - int32_t scope_task_count = orch->scope_tasks_size - orch->scope_begins[orch->scope_stack_top]; - if (scope_task_count >= allocator.window_size() - 1) { - int32_t active_count = allocator.active_count(); - - LOG_ERROR("========================================"); - LOG_ERROR("FATAL: Scope Deadlock Detected! (ring %d)", ring_id); - LOG_ERROR("========================================"); - LOG_ERROR( - "Tasks in current scope (%d) >= task_window_size (%d).", scope_task_count, allocator.window_size()); - LOG_ERROR(" scope_depth: %d", orch->scope_stack_top + 1); - LOG_ERROR(" ring_id: %d", ring_id); - LOG_ERROR(" scope_task_count: %d", scope_task_count); - LOG_ERROR(" active_tasks: %d / %d", active_count, allocator.window_size()); - LOG_ERROR("Root Cause:"); - LOG_ERROR(" Tasks within a scope hold a fanout_count reference that is only"); - LOG_ERROR(" released at scope_end. When scope task count >= window_size,"); - LOG_ERROR(" no slots can be reclaimed -> deadlock."); - LOG_ERROR("Solution:"); - LOG_ERROR(" 1. Reduce tasks per scope (use batching/unroll)"); - LOG_ERROR(" 2. Increase task window (current: %d)", allocator.window_size()); - LOG_ERROR(" Compile-time: PTO2_TASK_WINDOW_SIZE in pto_runtime2_types.h"); - LOG_ERROR(" Runtime env: PTO2_RING_TASK_WINDOW="); - LOG_ERROR(" 3. Split work across multiple scopes"); - LOG_ERROR("========================================"); - orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_SCOPE_DEADLOCK, std::memory_order_release); - orch->fatal = true; - return result; - } - } - - // === Calculate output size (from runtime-created OUTPUT args) === - uint64_t offsets[MAX_TENSOR_ARGS] = {}; - uint64_t buffer_sizes[MAX_TENSOR_ARGS] = {}; - int32_t total_output_size = 0; - for (int i = 0; i < args.tensor_count(); i++) { - if (args.tag(i) == TensorArgType::OUTPUT) { - offsets[i] = total_output_size; - buffer_sizes[i] = PTO2_ALIGN_UP(args.tensor(i).create_info->buffer_size_bytes(), PTO2_PACKED_OUTPUT_ALIGN); - total_output_size += buffer_sizes[i]; - } - } - - // === STEP 1: Unified alloc — task slot + packed output buffer (blocks until available) === - PTO2TaskAllocResult alloc_result = allocator.alloc(total_output_size); - if (alloc_result.failed()) { - orch->fatal = true; - return result; - } - - int32_t local_id = alloc_result.task_id; - int32_t slot = alloc_result.slot; - PTO2TaskId task_id = PTO2TaskId::make(ring_id, static_cast(local_id)); - - PTO2TaskDescriptor& task = allocator.task_by_slot(slot); - PTO2TaskPayload* payload = &orch->sm_handle->task_payloads[ring_id][slot]; - - // Early write-prefetch payload GM cache lines to issue RFO in background. - // ~130 lines of computation (lookup, insert) follow before - // param_copy writes, giving ample time for prefetch to complete. - // Use locality=3 (PSTL1KEEP) so prefetched CLs survive lookup/insert eviction. - for (int32_t i = 0; i < args.tensor_count(); i++) { - __builtin_prefetch(&payload->tensors[i], 1, 3); - __builtin_prefetch(reinterpret_cast(&payload->tensors[i]) + 64, 1, 3); - } - for (int32_t i = 0; i < args.scalar_count(); i += 8) { - __builtin_prefetch(&payload->scalars[i], 1, 3); - } - __builtin_prefetch(payload, 1, 3); - __builtin_prefetch(reinterpret_cast(payload) + 64, 1, 3); - __builtin_prefetch(reinterpret_cast(payload) + 128, 1, 3); - - // Initialize slot state (scheduler-private) - if (sched) { - auto& rs = sched->ring_sched_states[ring_id]; - PTO2TaskSlotState& slot_state = rs.get_slot_state_by_slot(slot); - slot_state.fanin_count = 0; - slot_state.fanout_head = nullptr; - slot_state.fanout_lock.store(0, std::memory_order_relaxed); - // Initial fanout_count = 1 (the owning scope holds one reference) - slot_state.fanout_count = 1; - slot_state.fanout_refcount.store(0, std::memory_order_release); - slot_state.fanin_refcount.store(0, std::memory_order_release); - slot_state.payload = payload; - slot_state.task = &task; - slot_state.active_mask = active_mask; - slot_state.subtask_done_mask.store(0, std::memory_order_relaxed); - slot_state.ring_id = ring_id; - scope_tasks_push(orch, &slot_state); - } else { - scope_tasks_push(orch, nullptr); - } - - // Temporary storage for fanin (cached slot state pointers, avoids repeated ring/slot lookups) - PTO2TaskSlotState* fanin_states[PTO2_MAX_INPUTS]; - int32_t fanin_count = 0; - - CYCLE_COUNT_LAP_RECORD(g_orch_alloc_cycle, AicpuPhaseId::ORCH_ALLOC, task_id.raw); - -#if PTO2_PROFILING - if (total_output_size > 0) { - orch->buffers_allocated++; - orch->bytes_allocated += total_output_size; - } -#endif - - // === STEP 2: Sync TensorMap validity and optional cleanup === - // Read current last_task_alive from shared memory for this ring - int32_t sm_last_task_alive = fc.last_task_alive.load(std::memory_order_acquire); - - orch->tensor_map.sync_tensormap(ring_id, sm_last_task_alive); - - if (sched) { - orch->rings[ring_id].dep_pool.reclaim(*sched, ring_id, sm_last_task_alive); - } - - CYCLE_COUNT_LAP_RECORD(g_orch_sync_cycle, AicpuPhaseId::ORCH_SYNC, task_id.raw); - - // === STEP 3: Lookup inputs + materialize runtime-created outputs === - for (int i = 0; i < args.tensor_count(); i++) { - TensorArgType ptype = args.tag(i); - if (ptype == TensorArgType::OUTPUT) { - // Runtime-created OUTPUT tensors are not looked up in the TensorMap since they have no dependencies. - continue; - } - - const Tensor* tensor = args.tensor(i).ptr; - - // Step A: creator retention — all existing tensors extend their creator lifetime. - PTO2TaskId owner = tensor->owner_task_id; - if (owner.is_valid() && sched != nullptr) { - PTO2TaskSlotState* prod_state = - &sched->ring_sched_states[owner.ring()].get_slot_state_by_task_id(owner.local()); - if (!pto2_append_fanin_or_fail( - orch, task_id, i, ptype, prod_state, fanin_states, &fanin_count, "creator retention")) { - return result; - } - } - - // Step B: only INPUT/INOUT need modifier dependency lookup. - if (ptype != TensorArgType::INPUT && ptype != TensorArgType::INOUT) { - continue; - } - if (tensor->manual_dep) { - continue; - } - - PTO2LookupResult lookup_result; - orch->tensor_map.lookup(*tensor, lookup_result); - - for (int r = 0; r < lookup_result.count; r++) { - PTO2TensorMapEntry& entry = *lookup_result.entries[r].entry; - auto overlap_status = lookup_result.entries[r].overlap_status; - auto prod_ring = entry.producer_task_id.ring(); - auto prod_local = entry.producer_task_id.local(); - PTO2TaskSlotState* prod_state = &sched->ring_sched_states[prod_ring].get_slot_state_by_task_id(prod_local); - if (!pto2_append_fanin_or_fail( - orch, task_id, i, ptype, prod_state, fanin_states, &fanin_count, "overlap lookup")) { - return result; - } - if (ptype == TensorArgType::INOUT && overlap_status == OverlapStatus::COVERED) { - orch->tensor_map.remove_entry(entry); - } - } - } - - CYCLE_COUNT_LAP_RECORD(g_orch_lookup_cycle, AicpuPhaseId::ORCH_LOOKUP, task_id.raw); - - // === STEP 4: Register outputs/inouts in TensorMap (must be separate from lookup) === - { - for (int i = 0; i < args.tensor_count(); i++) { - TensorArgType ptype = args.tag(i); - if (ptype == TensorArgType::INOUT || ptype == TensorArgType::OUTPUT_EXISTING) { - if (!args.tensor(i).ptr->manual_dep) { - orch->tensor_map.insert(*args.tensor(i).ptr, task_id); - } - } - } - } - - CYCLE_COUNT_LAP_RECORD(g_orch_insert_cycle, AicpuPhaseId::ORCH_INSERT, task_id.raw); - - // === STEP 5: Batch-write to GM (single cache line burst) === - // Deferred from allocation phase to avoid scattered GM writes that get - // evicted by TensorMap lookup/insert cache pressure. - __builtin_prefetch(&task, 1, 1); - task.task_id = task_id; - task.kernel_id[static_cast(PTO2SubtaskSlot::AIC)] = normalized.aic_kernel_id; - task.kernel_id[static_cast(PTO2SubtaskSlot::AIV0)] = normalized.aiv0_kernel_id; - task.kernel_id[static_cast(PTO2SubtaskSlot::AIV1)] = normalized.aiv1_kernel_id; - task.packed_buffer_base = alloc_result.packed_base; - task.packed_buffer_end = alloc_result.packed_end; - - // Prefetch producer slot_states and cur_slot_state (written at init but likely - // evicted by lookup/insert/heap). param_copy below provides hide time. - if (sched) { - auto& rs = sched->ring_sched_states[ring_id]; - __builtin_prefetch(&rs.get_slot_state_by_slot(slot), 1, 0); - for (int i = 0; i < fanin_count; i++) { - __builtin_prefetch(fanin_states[i], 1, 0); - } - } - - payload->init(args, result, alloc_result.packed_base, offsets, buffer_sizes); - - // Write owner_task_id into materialized OUTPUT tensors so creator-only dependency - // tracking remains available even when manual_dep skips OverlapMap publication. - for (int i = 0; i < args.tensor_count(); i++) { - if (args.tag(i) == TensorArgType::OUTPUT) { - payload->tensors[i].owner_task_id = task_id; - } - } - - CYCLE_COUNT_LAP_RECORD(g_orch_args_cycle, AicpuPhaseId::ORCH_PARAMS, task_id.raw); -#if PTO2_ORCH_PROFILING - g_orch_args_atomic_count += 2; // fanout_lock.store + fanout_count.store -#endif - - // === STEP 6: Finalize fanin list === - // First build the fanin list - if (sched) { - auto& rs = sched->ring_sched_states[ring_id]; - PTO2TaskSlotState& cur_slot_state = rs.get_slot_state_by_slot(slot); - // Initialize scheduler state BEFORE adding to producer fanout lists, - // so concurrent on_mixed_task_complete can safely access task_state/fanout_refcount. - cur_slot_state.task_state.store(PTO2_TASK_PENDING, std::memory_order_relaxed); - cur_slot_state.fanout_refcount.store(0, std::memory_order_relaxed); - cur_slot_state.completed_subtasks.store(0, std::memory_order_relaxed); - cur_slot_state.total_required_subtasks = static_cast(block_num * __builtin_popcount(active_mask)); - cur_slot_state.block_num = block_num; - cur_slot_state.next_block_idx = 0; - - auto& dep_pool = orch->rings[ring_id].dep_pool; - // Ensure dep pool has space: fanin_count entries + 1 pre-alloc - dep_pool.ensure_space(*sched, fc, ring_id, fanin_count + 1); - - int32_t early_finished = 0; - cur_slot_state.fanin_count = fanin_count + 1; // +1 redundance for not being ready too early - payload->fanin_actual_count = fanin_count; - for (int i = 0; i < fanin_count; i++) { - payload->fanin_slot_states[i] = fanin_states[i]; - } - for (int i = 0; i < fanin_count; i++) { - PTO2TaskSlotState& producer_slot_state = *fanin_states[i]; -#if PTO2_ORCH_PROFILING - pto2_fanout_lock(producer_slot_state, g_orch_fanin_atomic_count, g_orch_fanin_wait_cycle); -#else - pto2_fanout_lock(producer_slot_state); -#endif - // Normal path: prepend consumer to producer's fanout list - producer_slot_state.fanout_count += 1; - int32_t prod_state = producer_slot_state.task_state.load(std::memory_order_acquire); - if (prod_state >= PTO2_TASK_COMPLETED) { - // Early return optimization: if producer already completed, we can skip adding dependency and directly - // decrement fanin_count - early_finished++; - } else { - producer_slot_state.fanout_head = dep_pool.prepend(producer_slot_state.fanout_head, &cur_slot_state); - } - pto2_fanout_unlock(producer_slot_state); - } - // Combined release: merge early_finished batch with the +1 init release - // into a single atomic fetch_add (saves one acq_rel cache-line bounce per task). - int32_t initial_refcount = early_finished + 1; // +1 for the init release - int32_t new_rc = - cur_slot_state.fanin_refcount.fetch_add(initial_refcount, std::memory_order_acq_rel) + initial_refcount; - if (new_rc >= fanin_count + 1) { - PTO2ResourceShape shape = pto2_active_mask_to_shape(active_mask); - sched->ready_queues[static_cast(shape)].push(&cur_slot_state); - } - // Record dep pool watermark in local slot state (used by tail reclamation) - cur_slot_state.dep_pool_mark = orch->rings[ring_id].dep_pool.top; -#if PTO2_ORCH_PROFILING - // Per producer: fetch_add(fanout_count) + load(task_state) + store(unlock) = 3 atomics - // Lock atomics (loads + CAS) are counted inside pto2_fanout_lock - g_orch_fanin_atomic_count += fanin_count * 3; - if (early_finished > 0) { - g_orch_fanin_atomic_count += 1; // fanin_refcount.fetch_add - } -#endif - } - - CYCLE_COUNT_LAP_RECORD(g_orch_fanin_cycle, AicpuPhaseId::ORCH_FANIN, task_id.raw); - -#if PTO2_PROFILING - orch->tasks_submitted++; -#if PTO2_ORCH_PROFILING - g_orch_submit_count++; -#endif - g_orch_submit_idx++; -#endif - return result; -} - -// ============================================================================= -// Flow Control -// ============================================================================= - -void pto2_orchestrator_done(PTO2OrchestratorState* orch) { - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - int32_t total_tasks = orch->rings[r].task_allocator.active_count(); - if (total_tasks > 0) { - LOG_INFO("=== [Orchestrator] ring %d: total_tasks=%d ===", r, total_tasks); - } - auto& pool = orch->rings[r].dep_pool; - if (pool.top > 0) { - LOG_INFO("=== [DepPool %d] top=%d tail=%d used=%d high_water=%d capacity=%d ===", - r, - pool.top, - pool.tail, - pool.top - pool.tail, - pool.high_water, - pool.capacity); - } - } - orch->sm_handle->header->orchestrator_done.store(1, std::memory_order_release); -#if !PTO2_ORCH_PROFILING && PTO2_PROFILING - g_orch_submit_idx = 0; -#endif -} - -// ============================================================================= -// Debug Utilities -// ============================================================================= - -void pto2_orchestrator_print_stats(PTO2OrchestratorState* orch) { - LOG_INFO("=== Orchestrator Statistics ==="); -#if PTO2_PROFILING - LOG_INFO("Tasks submitted: %" PRId64, orch->tasks_submitted); - LOG_INFO("Buffers allocated: %" PRId64, orch->buffers_allocated); - LOG_INFO("Bytes allocated: %" PRId64, orch->bytes_allocated); -#endif - LOG_INFO("Current scope depth: %d", orch->scope_stack_top + 1); - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - int32_t active = orch->rings[r].task_allocator.active_count(); - if (active > 0) { - LOG_INFO("Ring %d task active: %d", r, active); - LOG_INFO("Ring %d heap used: %" PRIu64 " / %" PRIu64, - r, - orch->rings[r].task_allocator.heap_top(), - orch->rings[r].task_allocator.heap_capacity()); - LOG_INFO( - "Ring %d dep pool: %d / %d", r, orch->rings[r].dep_pool.used(), orch->rings[r].dep_pool.capacity); - } - } - LOG_INFO("TensorMap valid: %d", orch->tensor_map.valid_count()); - LOG_INFO("==============================="); -} - -void pto2_orchestrator_print_scope_stack(PTO2OrchestratorState* orch) { - LOG_INFO("=== Scope Stack ==="); - LOG_INFO("Depth: %d", orch->scope_stack_top + 1); - - for (int i = 0; i <= orch->scope_stack_top; i++) { - int32_t begin = orch->scope_begins[i]; - int32_t end = (i < orch->scope_stack_top) ? orch->scope_begins[i + 1] : orch->scope_tasks_size; - LOG_INFO(" [%d] tasks_owned = %d", i, end - begin); - } - - LOG_INFO("=================="); -} - -#if PTO2_ORCH_PROFILING -PTO2OrchProfilingData pto2_orchestrator_get_profiling() { - PTO2OrchProfilingData d; - d.sync_cycle = g_orch_sync_cycle; - d.alloc_cycle = g_orch_alloc_cycle; - d.args_cycle = g_orch_args_cycle; - d.lookup_cycle = g_orch_lookup_cycle; - d.insert_cycle = g_orch_insert_cycle; - d.fanin_cycle = g_orch_fanin_cycle; - d.scope_end_cycle = g_orch_scope_end_cycle; - d.submit_count = g_orch_submit_count; - d.alloc_wait_cycle = g_orch_alloc_wait_cycle; - d.fanin_wait_cycle = g_orch_fanin_wait_cycle; - d.alloc_atomic_count = g_orch_alloc_atomic_count; - d.args_atomic_count = g_orch_args_atomic_count; - d.fanin_atomic_count = g_orch_fanin_atomic_count; - d.finalize_atomic_count = g_orch_finalize_atomic_count; - d.scope_end_atomic_count = g_orch_scope_end_atomic_count; - - // Reset - g_orch_sync_cycle = g_orch_alloc_cycle = g_orch_args_cycle = 0; - g_orch_lookup_cycle = g_orch_insert_cycle = 0; - g_orch_fanin_cycle = g_orch_scope_end_cycle = 0; - g_orch_submit_count = 0; - g_orch_submit_idx = 0; - g_orch_alloc_wait_cycle = 0; - g_orch_fanin_wait_cycle = 0; - g_orch_alloc_atomic_count = 0; - g_orch_args_atomic_count = 0; - g_orch_fanin_atomic_count = 0; - g_orch_finalize_atomic_count = 0; - g_orch_scope_end_atomic_count = 0; - return d; -} -#endif diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_orchestrator.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_orchestrator.h deleted file mode 100644 index 80d33e4f2..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_orchestrator.h +++ /dev/null @@ -1,225 +0,0 @@ -/** - * PTO Runtime2 - Orchestrator Interface - * - * The Orchestrator is responsible for: - * 1. Executing the orchestration function (Turing-complete control flow) - * 2. Allocating intermediate buffers from the heap - * 3. Submitting tasks via async InCore function calls - * 4. Building the dependency graph using TensorMap - * 5. Managing buffer scopes for lifecycle control - * - * The Orchestrator can run on either: - * - Host CPU (lower latency for complex control, easier debugging) - * - Device AI_CPU (lower latency for task submission) - * - * Based on: docs/RUNTIME_LOGIC.md - */ - -#ifndef PTO_ORCHESTRATOR_H -#define PTO_ORCHESTRATOR_H - -#include "pto_ring_buffer.h" -#include "pto_runtime2_types.h" -#include "pto_submit_types.h" -#include "pto_scheduler.h" -#include "pto_shared_memory.h" -#include "pto_tensormap.h" -#include "pto_types.h" - -// ============================================================================= -// Orchestrator State -// ============================================================================= - -/** - * Orchestrator state structure (private to Orchestrator) - * - * Contains all state needed for task graph construction and buffer management. - */ -struct PTO2OrchestratorState { - // === SHARED MEMORY ACCESS === - PTO2SharedMemoryHandle* sm_handle; - - // === PER-RING RESOURCES === - PTO2RingSet rings[PTO2_MAX_RING_DEPTH]; - - // === TENSOR MAP (Private) === - PTO2TensorMap tensor_map; // Producer lookup - - // === SCOPE STACK (Private) === - // Single contiguous buffer of task IDs, partitioned by scope level. - // scope_begins[i] is the index into scope_tasks where scope i starts. - // Tasks for the top scope occupy [scope_begins[top], scope_tasks_size). - PTO2TaskSlotState** scope_tasks; // Flat buffer of taskSlotState (all scopes concatenated) - int32_t scope_tasks_size; // Number of task IDs currently in the buffer - int32_t scope_tasks_capacity; // Allocated capacity of scope_tasks - int32_t* scope_begins; // scope_begins[i] = start index of scope i in scope_tasks - int32_t scope_stack_top; // Current top of stack (-1 = no scope open) - uint64_t scope_stack_capacity; // Max nesting depth (PTO2_MAX_SCOPE_DEPTH) - - // === SCHEDULER REFERENCE === - // Note: In simulated mode, orchestrator and scheduler share address space - // In real mode, they communicate via shared memory only - PTO2SchedulerState* scheduler; // For simulated mode only -#if PTO2_PROFILING - // Runtime profiling switch copied from Runtime::enable_profiling. - bool enable_profiling; -#endif - - // === GM HEAP (for output buffers) === - void* gm_heap_base; // Base address of GM heap - uint64_t gm_heap_size; // Total size of GM heap (all rings) - - // === FATAL ERROR === - // Fatal error flag (single-thread access by orchestrator, no atomic needed) - // Cross-thread notification uses shared memory orch_error_code (atomic) - bool fatal; - - // === STATISTICS === -#if PTO2_PROFILING - int64_t tasks_submitted; - int64_t buffers_allocated; - int64_t bytes_allocated; -#endif - - /** - * Get current ring index from scope depth. - * Maps scope depth to ring_id: min(scope_depth, PTO2_MAX_RING_DEPTH - 1) - */ - uint8_t current_ring_id() const { - int32_t depth = scope_stack_top; - if (depth < 0) depth = 0; - return depth < PTO2_MAX_RING_DEPTH ? static_cast(depth) : PTO2_MAX_RING_DEPTH - 1; - } - -}; - -// ============================================================================= -// Orchestrator API -// ============================================================================= - -/** - * Initialize orchestrator state - * - * @param orch Orchestrator state to initialize - * @param sm_handle Shared memory handle - * @param gm_heap GM heap memory for output buffers - * @param heap_size Size of GM heap - * @return true on success - */ -bool pto2_orchestrator_init( - PTO2OrchestratorState* orch, PTO2SharedMemoryHandle* sm_handle, void* gm_heap, uint64_t heap_size, - int32_t dep_pool_capacity = PTO2_DEP_LIST_POOL_SIZE); - -/** - * Destroy orchestrator state and free resources - */ -void pto2_orchestrator_destroy(PTO2OrchestratorState* orch); - -/** - * Set scheduler reference (for simulated mode) - */ -void pto2_orchestrator_set_scheduler(PTO2OrchestratorState* orch, PTO2SchedulerState* scheduler); - - -// ============================================================================= -// Scope Management -// ============================================================================= - -/** - * Begin a new scope - * - * Pushes a new empty task list onto the scope stack. - * Tasks submitted while this scope is at the top of the stack are - * owned by it and have their fanout_count initialized to 1. - */ -void pto2_scope_begin(PTO2OrchestratorState* orch); - -/** - * End current scope - * - * Pops the top scope and increments fanout_refcount for each task - * directly owned by that scope. - * May trigger buffer release for tasks that are now fully consumed. - */ -void pto2_scope_end(PTO2OrchestratorState* orch); - -// ============================================================================= -// Task Submission -// ============================================================================= - -/** - * Submit a task with InCore function and parameters - * - * This is the main API for building the task graph: - * 1. Allocates task slot + packed output buffer via TaskAllocator (blocks until available) - * 2. Looks up inputs in TensorMap to find dependencies - * 3. Updates producer's fanout_count/list (with spinlock) - * 4. Registers outputs in TensorMap - * 5. Initializes task state in scheduler - * - * @param orch Orchestrator state - * @param mixed_kernels Kernel IDs for AIC/AIV0/AIV1 slots - * @param args Aggregated tensor and scalar parameters - */ -TaskOutputTensors pto2_submit_mixed_task(PTO2OrchestratorState* orch, - const MixedKernels& mixed_kernels, - const Arg& args); - -// ============================================================================= -// Flow Control -// ============================================================================= - -/** - * Mark orchestration as complete - * - * Signals to scheduler that no more tasks will be submitted. - */ -void pto2_orchestrator_done(PTO2OrchestratorState* orch); - -// ============================================================================= -// Debug Utilities -// ============================================================================= - -/** - * Print orchestrator statistics - */ -void pto2_orchestrator_print_stats(PTO2OrchestratorState* orch); - -/** - * Print scope stack state - */ -void pto2_orchestrator_print_scope_stack(PTO2OrchestratorState* orch); - -// ============================================================================= -// Orchestrator Profiling Data -// ============================================================================= - -#if PTO2_ORCH_PROFILING -struct PTO2OrchProfilingData { - uint64_t sync_cycle; - uint64_t alloc_cycle; // Combined task slot + heap allocation - uint64_t args_cycle; - uint64_t lookup_cycle; - uint64_t insert_cycle; - uint64_t fanin_cycle; - uint64_t scope_end_cycle; - int64_t submit_count; - // Wait time tracking for blocking phases - uint64_t alloc_wait_cycle; // Cycles spent waiting in unified alloc - uint64_t fanin_wait_cycle; // Cycles spent waiting in fanout_lock - // Atomic operation counts per phase - uint64_t alloc_atomic_count; - uint64_t args_atomic_count; - uint64_t fanin_atomic_count; - uint64_t finalize_atomic_count; - uint64_t scope_end_atomic_count; -}; - -/** - * Get and reset orchestrator profiling data. - * Returns accumulated profiling data and resets counters. - */ -PTO2OrchProfilingData pto2_orchestrator_get_profiling(); -#endif - -#endif // PTO_ORCHESTRATOR_H diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_ring_buffer.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_ring_buffer.cpp deleted file mode 100644 index 47a0ec1a6..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_ring_buffer.cpp +++ /dev/null @@ -1,78 +0,0 @@ -/** - * PTO Runtime2 - Ring Buffer Implementation - * - * Implements DepListPool ring buffer for zero-overhead dependency management. - * TaskAllocator methods are defined inline in pto_ring_buffer.h. - * - * Based on: docs/RUNTIME_LOGIC.md - */ - -#include "pto_ring_buffer.h" -#include -#include -#include // for exit() -#include "common/unified_log.h" -#include "pto_scheduler.h" - -// ============================================================================= -// Dependency List Pool Implementation -// ============================================================================= -void PTO2DepListPool::reclaim(PTO2SchedulerState& sched, uint8_t ring_id, int32_t sm_last_task_alive) { - if (sm_last_task_alive >= last_reclaimed + PTO2_DEP_POOL_CLEANUP_INTERVAL && sm_last_task_alive > 0) { - int32_t mark = sched.ring_sched_states[ring_id].get_slot_state_by_task_id(sm_last_task_alive - 1).dep_pool_mark; - if (mark > 0) { - advance_tail(mark); - } - last_reclaimed = sm_last_task_alive; - } -} - -void PTO2DepListPool::ensure_space( - PTO2SchedulerState& sched, PTO2RingFlowControl& fc, uint8_t ring_id, int32_t needed) { - if (available() >= needed) return; - - int spin_count = 0; - int32_t prev_last_alive = fc.last_task_alive.load(std::memory_order_acquire); - while (available() < needed) { - reclaim(sched, ring_id, prev_last_alive); - if (available() >= needed) return; - - spin_count++; - - // Progress detection: reset spin counter if last_task_alive advances - int32_t cur_last_alive = fc.last_task_alive.load(std::memory_order_acquire); - if (cur_last_alive > prev_last_alive) { - spin_count = 0; - prev_last_alive = cur_last_alive; - } - - if (spin_count >= PTO2_DEP_POOL_SPIN_LIMIT) { - int32_t current = fc.current_task_index.load(std::memory_order_acquire); - LOG_ERROR("========================================"); - LOG_ERROR("FATAL: Dependency Pool Deadlock Detected! (ring %d)", ring_id); - LOG_ERROR("========================================"); - LOG_ERROR("DepListPool cannot reclaim space after %d spins (no progress).", spin_count); - LOG_ERROR(" - Pool used: %d / %d (%.1f%%)", - used(), - capacity, - (capacity > 0) ? (100.0 * used() / capacity) : 0.0); - LOG_ERROR(" - Pool top: %d (linear)", top); - LOG_ERROR(" - Pool tail: %d (linear)", tail); - LOG_ERROR(" - High water: %d", high_water); - LOG_ERROR(" - Needed: %d entries", needed); - LOG_ERROR(" - last_task_alive: %d (stuck here)", cur_last_alive); - LOG_ERROR(" - current_task: %d", current); - LOG_ERROR(" - In-flight tasks: %d", current - cur_last_alive); - LOG_ERROR("Diagnosis:"); - LOG_ERROR(" last_task_alive is not advancing, so dep pool tail"); - LOG_ERROR(" cannot reclaim. Check TaskRing diagnostics for root cause."); - LOG_ERROR("Solution:"); - LOG_ERROR(" Increase dep pool capacity (current: %d, recommended: %d)", capacity, high_water * 2); - LOG_ERROR(" Compile-time: PTO2_DEP_LIST_POOL_SIZE in pto_runtime2_types.h"); - LOG_ERROR(" Runtime env: PTO2_RING_DEP_POOL=%d", high_water * 2); - LOG_ERROR("========================================"); - exit(1); - } - SPIN_WAIT_HINT(); - } -} diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_ring_buffer.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_ring_buffer.h deleted file mode 100644 index 6f9f655ba..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_ring_buffer.h +++ /dev/null @@ -1,508 +0,0 @@ -/** - * PTO Runtime2 - Ring Buffer Data Structures - * - * Implements ring buffer designs for zero-overhead memory management: - * - * 1. TaskAllocator - Unified task slot + output buffer allocation - * - Combines task ring (slot allocation) and heap ring (output buffer allocation) - * - Single spin-wait loop with unified back-pressure and deadlock detection - * - O(1) bump allocation for both task slots and heap buffers - * - * 2. DepListPool - Dependency list entry allocation - * - Ring buffer for linked list entries - * - O(1) prepend operation - * - Implicit reclamation with task ring - * - * Based on: docs/RUNTIME_LOGIC.md - */ - -#ifndef PTO_RING_BUFFER_H -#define PTO_RING_BUFFER_H - -#include - -#include "pto_runtime2_types.h" -#include "pto_shared_memory.h" -#include "common/unified_log.h" - -struct PTO2SchedulerState; // Forward declaration for dep_pool reclaim - -// Set to 1 to enable periodic BLOCKED/Unblocked messages during spin-wait. -#ifndef PTO2_SPIN_VERBOSE_LOGGING -#define PTO2_SPIN_VERBOSE_LOGGING 1 -#endif - -// Block notification interval (in spin counts) -#define PTO2_BLOCK_NOTIFY_INTERVAL 10000 -// Alloc spin limit - after this, report deadlock and exit -#define PTO2_ALLOC_SPIN_LIMIT 100000 - -// Dep pool spin limit - if exceeded, dep pool capacity too small for workload -#define PTO2_DEP_POOL_SPIN_LIMIT 100000 - -// ============================================================================= -// Task Allocator (unified task slot + heap buffer allocation) -// ============================================================================= - -/** - * Result of a unified task allocation. - */ -struct PTO2TaskAllocResult { - int32_t task_id; // Absolute task ID (not wrapped), -1 on failure - int32_t slot; // task_id & (window_size - 1) - void* packed_base; // Heap allocation result (nullptr if output_size == 0) - void* packed_end; // packed_base + aligned output_size - - bool failed() const { return task_id < 0; } -}; - -/** - * Unified task slot + heap buffer allocator. - * - * Since task and heap are always allocated together and the orchestrator is - * single-threaded, both pointers (task index, heap top) are tracked locally - * and published to shared memory via plain store — no fetch_add or CAS needed. - * - * The alloc() method checks both resources BEFORE committing to either, - * eliminating the need for rollback on partial failure. - */ -class PTO2TaskAllocator { -public: - /** - * Initialize the allocator with task ring and heap ring resources. - */ - void init(PTO2TaskDescriptor* descriptors, int32_t window_size, - std::atomic* current_index_ptr, - std::atomic* last_alive_ptr, - void* heap_base, uint64_t heap_size, - std::atomic* error_code_ptr) { - descriptors_ = descriptors; - window_size_ = window_size; - window_mask_ = window_size - 1; - current_index_ptr_ = current_index_ptr; - last_alive_ptr_ = last_alive_ptr; - heap_base_ = heap_base; - heap_size_ = heap_size; - error_code_ptr_ = error_code_ptr; - local_task_id_ = current_index_ptr->load(std::memory_order_relaxed); - heap_top_ = 0; - heap_tail_ = 0; - last_alive_seen_ = 0; - } - - /** - * Allocate a task slot and its associated output buffer in one call. - * - * Both task index and heap top are maintained as local counters and - * published to shared memory only on success. Since the orchestrator is - * single-threaded, no CAS or fetch_add is needed — just check-then-commit. - * - * @param output_size Total packed output size in bytes (0 = no heap needed) - * @return Allocation result; check failed() for errors - */ - PTO2TaskAllocResult alloc(int32_t output_size) { - uint64_t aligned_size = output_size > 0 - ? PTO2_ALIGN_UP(static_cast(output_size), PTO2_ALIGN_SIZE) : 0; - - int spin_count = 0; - int32_t prev_last_alive = last_alive_ptr_->load(std::memory_order_acquire); - int32_t last_alive = prev_last_alive; - update_heap_tail(last_alive); - bool blocked_on_heap = false; -#if PTO2_ORCH_PROFILING - uint64_t wait_start = 0; - bool waiting = false; -#endif - - while (true) { - // Check both resources; commit only if both available - if (local_task_id_ - last_alive + 1 < window_size_) { - void* heap_ptr = try_bump_heap(aligned_size); - if (heap_ptr) { - int32_t task_id = commit_task(); -#if PTO2_ORCH_PROFILING - record_wait(spin_count, wait_start, waiting); -#endif - return {task_id, task_id & window_mask_, - heap_ptr, static_cast(heap_ptr) + aligned_size}; - } - blocked_on_heap = true; - } else { - blocked_on_heap = false; - } - - // Spin: wait for scheduler to advance last_task_alive - spin_count++; -#if PTO2_ORCH_PROFILING - if (!waiting) { wait_start = get_sys_cnt_aicpu(); waiting = true; } -#endif - last_alive = last_alive_ptr_->load(std::memory_order_acquire); - update_heap_tail(last_alive); - if (last_alive > prev_last_alive) { - spin_count = 0; - prev_last_alive = last_alive; - } else { -#if PTO2_SPIN_VERBOSE_LOGGING - if (spin_count % PTO2_BLOCK_NOTIFY_INTERVAL == 0) { - LOG_WARN("[TaskAllocator] BLOCKED: tasks=%d/%d, heap=%" PRIu64 "/%" PRIu64 ", on=%s, spins=%d", - local_task_id_ - last_alive, - window_size_, - heap_top_, - heap_size_, - blocked_on_heap ? "heap" : "task", - spin_count); - } -#endif - if (spin_count >= PTO2_ALLOC_SPIN_LIMIT) { - report_deadlock(output_size, blocked_on_heap); - return {-1, -1, nullptr, nullptr}; - } - } - SPIN_WAIT_HINT(); - } - } - - // ========================================================================= - // Task descriptor accessors - // ========================================================================= - - PTO2TaskDescriptor& task(int32_t task_id) const { - return descriptors_[task_id & window_mask_]; - } - - PTO2TaskDescriptor& task_by_slot(int32_t slot) const { - return descriptors_[slot]; - } - - // ========================================================================= - // State queries - // ========================================================================= - - int32_t active_count() const { - int32_t last_alive = last_alive_ptr_->load(std::memory_order_acquire); - return local_task_id_ - last_alive; - } - - int32_t window_size() const { return window_size_; } - - uint64_t heap_available() const { - uint64_t tail = heap_tail_; - if (heap_top_ >= tail) { - uint64_t at_end = heap_size_ - heap_top_; - uint64_t at_begin = tail; - return at_end > at_begin ? at_end : at_begin; - } - return tail - heap_top_; - } - - uint64_t heap_top() const { return heap_top_; } - uint64_t heap_capacity() const { return heap_size_; } - -private: - // --- Task Ring --- - PTO2TaskDescriptor* descriptors_ = nullptr; - int32_t window_size_ = 0; - int32_t window_mask_ = 0; - std::atomic* current_index_ptr_ = nullptr; - std::atomic* last_alive_ptr_ = nullptr; - - // --- Heap --- - void* heap_base_ = nullptr; - uint64_t heap_size_ = 0; - - // --- Local state (single-writer, no atomics needed) --- - int32_t local_task_id_ = 0; // Next task ID to allocate - uint64_t heap_top_ = 0; // Current heap allocation pointer - uint64_t heap_tail_ = 0; // Heap reclamation pointer (derived from consumed tasks) - int32_t last_alive_seen_ = 0; // last_task_alive at last heap_tail derivation - - // --- Shared --- - std::atomic* error_code_ptr_ = nullptr; - - // ========================================================================= - // Internal helpers - // ========================================================================= - - /** - * Commit a task slot: bump local counter and publish to shared memory. - * Must only be called after space check has passed. - */ - int32_t commit_task() { - int32_t task_id = local_task_id_++; - current_index_ptr_->store(local_task_id_, std::memory_order_release); - return task_id; - } - - /** - * Derive heap_tail_ from the last consumed task's packed_buffer_end. - * - * Every task has a valid packed_buffer_end (equal to packed_buffer_base - * for zero-size allocations), so the last consumed task always determines - * the correct heap_tail — no backward scan needed. - */ - void update_heap_tail(int32_t last_alive) { - if (last_alive <= last_alive_seen_) return; - last_alive_seen_ = last_alive; - - PTO2TaskDescriptor& desc = descriptors_[(last_alive - 1) & window_mask_]; - heap_tail_ = static_cast( - static_cast(desc.packed_buffer_end) - static_cast(heap_base_)); - } - - /** - * Bump the heap pointer for the given allocation size. - * Returns the allocated pointer, or nullptr if insufficient space. - * When alloc_size == 0, returns current position without advancing. - */ - void* try_bump_heap(uint64_t alloc_size) { - uint64_t top = heap_top_; - if (alloc_size == 0) { - return static_cast(heap_base_) + top; - } - uint64_t tail = heap_tail_; - void* result; - - if (top >= tail) { - uint64_t space_at_end = heap_size_ - top; - if (space_at_end >= alloc_size) { - result = static_cast(heap_base_) + top; - heap_top_ = top + alloc_size; - } else if (tail > alloc_size) { - result = heap_base_; - heap_top_ = alloc_size; - } else { - return nullptr; - } - } else { - if (tail - top >= alloc_size) { - result = static_cast(heap_base_) + top; - heap_top_ = top + alloc_size; - } else { - return nullptr; - } - } - - return result; - } - -#if PTO2_ORCH_PROFILING - void record_wait(int spin_count, uint64_t wait_start, bool waiting) { - if (waiting) { - extern uint64_t g_orch_alloc_wait_cycle; - g_orch_alloc_wait_cycle += (get_sys_cnt_aicpu() - wait_start); - } - { - extern uint64_t g_orch_alloc_atomic_count; - g_orch_alloc_atomic_count += spin_count + 1; - } - } -#endif - - /** - * Report deadlock with targeted diagnostics. - */ - void report_deadlock(int32_t requested_output_size, bool heap_blocked) { - int32_t last_alive = last_alive_ptr_->load(std::memory_order_acquire); - int32_t active_tasks = local_task_id_ - last_alive; - uint64_t htail = heap_tail_; - - LOG_ERROR("========================================"); - if (heap_blocked) { - LOG_ERROR("FATAL: Task Allocator Deadlock - Heap Exhausted!"); - } else { - LOG_ERROR("FATAL: Task Allocator Deadlock - Task Ring Full!"); - } - LOG_ERROR("========================================"); - LOG_ERROR("No progress after %d spins.", PTO2_ALLOC_SPIN_LIMIT); - LOG_ERROR(" Task ring: current=%d, last_alive=%d, active=%d/%d (%.1f%%)", - local_task_id_, last_alive, active_tasks, window_size_, - 100.0 * active_tasks / window_size_); - LOG_ERROR(" Heap ring: top=%" PRIu64 ", tail=%" PRIu64 ", size=%" PRIu64 - ", available=%" PRIu64, - heap_top_, htail, heap_size_, heap_available()); - if (heap_blocked) { - LOG_ERROR(" Requested: %d bytes", requested_output_size); - } - LOG_ERROR("Diagnosis:"); - LOG_ERROR(" last_task_alive is stuck at %d, meaning task %d", - last_alive, last_alive); - LOG_ERROR(" cannot transition to CONSUMED. Possible causes:"); - LOG_ERROR(" 1. Task %d still executing (subtasks not complete)", last_alive); - LOG_ERROR(" 2. Task %d fanout not fully released (downstream not done)", last_alive); - LOG_ERROR(" 3. Scope reference not released (scope_end not called)"); - LOG_ERROR(" 4. Orchestrator blocked here -> can't call scope_end -> circular wait"); - LOG_ERROR("Solution:"); - if (heap_blocked) { - LOG_ERROR(" Increase heap size (current: %" PRIu64 ", recommended: %" PRIu64 ")", - heap_size_, heap_size_ * 2); - LOG_ERROR(" Compile-time: PTO2_HEAP_SIZE in pto_runtime2_types.h"); - LOG_ERROR(" Runtime env: PTO2_RING_HEAP= (e.g. %" PRIu64 ")", - heap_size_ * 2); - } else { - LOG_ERROR(" Increase task window size (current: %d, recommended: %d)", - window_size_, active_tasks * 2); - LOG_ERROR(" Compile-time: PTO2_TASK_WINDOW_SIZE in pto_runtime2_types.h"); - LOG_ERROR(" Runtime env: PTO2_RING_TASK_WINDOW= (e.g. %d)", - active_tasks * 2); - } - LOG_ERROR("========================================"); - if (error_code_ptr_) { - int32_t code = heap_blocked ? PTO2_ERROR_HEAP_RING_DEADLOCK - : PTO2_ERROR_FLOW_CONTROL_DEADLOCK; - error_code_ptr_->store(code, std::memory_order_release); - } - } -}; - -// ============================================================================= -// Dependency List Pool -// ============================================================================= - -/** - * Dependency list pool structure - * - * True ring buffer for allocating linked list entries. - * Entries are reclaimed when their producer tasks become CONSUMED, - * as tracked by the orchestrator via dep_pool_mark per task. - * - * Linear counters (top, tail) grow monotonically; the physical index - * is obtained via modulo: base[linear_index % capacity]. - */ -struct PTO2DepListPool { - PTO2DepListEntry* base; // Pool base address - int32_t capacity; // Total number of entries - int32_t top; // Linear next-allocation counter (starts from 1) - int32_t tail; // Linear first-alive counter (entries before this are dead) - int32_t high_water; // Peak concurrent usage (top - tail) - int32_t last_reclaimed{0}; // last_task_alive at last successful reclamation - - // Error code pointer for fatal error reporting (→ sm_header->orch_error_code) - std::atomic* error_code_ptr = nullptr; - - /** - * Initialize dependency list pool - * - * @param base Pool base address from shared memory - * @param capacity Total number of entries - */ - void init(PTO2DepListEntry* in_base, int32_t in_capacity, std::atomic* in_error_code_ptr) { - base = in_base; - capacity = in_capacity; - top = 1; // Start from 1, 0 means NULL/empty - tail = 1; // Match initial top (no reclaimable entries yet) - high_water = 0; - last_reclaimed = 0; - - // Initialize entry 0 as NULL marker - base[0].slot_state = nullptr; - base[0].next = nullptr; - - error_code_ptr = in_error_code_ptr; - } - - /** - * Reclaim dead entries based on scheduler's slot state dep_pool_mark. - * Safe to call multiple times — only advances tail forward. - * - * @param sched Scheduler state (for reading slot dep_pool_mark) - * @param ring_id Ring layer index - * @param sm_last_task_alive Current last_task_alive from shared memory - */ - void reclaim(PTO2SchedulerState& sched, uint8_t ring_id, int32_t sm_last_task_alive); - - /** - * Ensure dep pool for a specific ring has at least `needed` entries available. - * Spin-waits for reclamation if under pressure. Detects deadlock if no progress. - */ - void ensure_space(PTO2SchedulerState& sched, PTO2RingFlowControl &fc, uint8_t ring_id, int32_t needed); - - /** - * Allocate a single entry from the pool (single-thread per pool instance) - * - * @return Pointer to allocated entry, or nullptr on fatal error - */ - PTO2DepListEntry* alloc() { - int32_t used = top - tail; - if (used >= capacity) { - LOG_ERROR("========================================"); - LOG_ERROR("FATAL: Dependency Pool Overflow!"); - LOG_ERROR("========================================"); - LOG_ERROR("DepListPool exhausted: %d entries alive (capacity=%d).", used, capacity); - LOG_ERROR(" - Pool top: %d (linear)", top); - LOG_ERROR(" - Pool tail: %d (linear)", tail); - LOG_ERROR(" - High water: %d", high_water); - LOG_ERROR("Solution:"); - LOG_ERROR(" Increase dep pool capacity (current: %d, recommended: %d).", capacity, capacity * 2); - LOG_ERROR(" Compile-time: PTO2_DEP_LIST_POOL_SIZE in pto_runtime2_types.h"); - LOG_ERROR(" Runtime env: PTO2_RING_DEP_POOL=%d", capacity * 2); - LOG_ERROR("========================================"); - if (error_code_ptr) { - error_code_ptr->store(PTO2_ERROR_DEP_POOL_OVERFLOW, std::memory_order_release); - } - return nullptr; - } - int32_t idx = top % capacity; - top++; - used++; - if (used > high_water) high_water = used; - return &base[idx]; - } - - /** - * Advance the tail pointer, reclaiming dead entries. - * Called by the orchestrator based on last_task_alive advancement. - */ - void advance_tail(int32_t new_tail) { - if (new_tail > tail) { - tail = new_tail; - } - } - - /** - * Prepend a task ID to a dependency list - * - * O(1) operation: allocates new entry and links to current head. - * - * @param current_head Current list head offset (0 = empty list) - * @param task_slot Task slot to prepend - * @return New head offset - */ - PTO2DepListEntry* prepend(PTO2DepListEntry* cur, PTO2TaskSlotState* slot_state) { - PTO2DepListEntry* new_entry = alloc(); - if (!new_entry) return nullptr; - new_entry->slot_state = slot_state; - new_entry->next = cur; - return new_entry; - } - - /** - * Get entry by offset - */ - PTO2DepListEntry* pto2_dep_pool_get(int32_t offset) { - if (offset <= 0) return NULL; - return &base[offset]; - } - - int32_t used() const { - return top - tail; - } - - int32_t available() const { - return capacity - used(); - } -}; - -// ============================================================================= -// Ring Set (per-depth aggregate) -// ============================================================================= - -/** - * Groups a TaskAllocator and DepPool into one per-depth unit. - * PTO2_MAX_RING_DEPTH instances provide independent reclamation per scope depth. - */ -struct PTO2RingSet { - PTO2TaskAllocator task_allocator; - PTO2DepListPool dep_pool; -}; - -#endif // PTO_RING_BUFFER_H diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_runtime2.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_runtime2.cpp deleted file mode 100644 index bd049ecf8..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_runtime2.cpp +++ /dev/null @@ -1,337 +0,0 @@ -/* - * Copyright (c) PyPTO Contributors. - * This program is free software, you can redistribute it and/or modify it under the terms and conditions of - * CANN Open Software License Agreement Version 2.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - * ----------------------------------------------------------------------------------------------------------- - */ - -/** - * PTO Runtime2 - Main Implementation - * - * Implements the unified runtime API that combines orchestrator and scheduler. - * - * Based on: docs/RUNTIME_LOGIC.md - */ - -#include "pto_runtime2.h" - -#include -#include -#include - -#include - -#include "aicpu/device_time.h" -#include "common/unified_log.h" - -// Weak fallback for HOST .so builds (never called, but satisfies linker). -// The AICPU build links the strong symbol from platform/.../device_time.cpp. -// Hidden visibility prevents HOST .so from polluting global symbol table. -__attribute__((weak, visibility("hidden"))) uint64_t get_sys_cnt_aicpu() { return 0; } - -// ============================================================================= -// Thread-local orchestrator index for multi-orchestrator dispatch -// ============================================================================= - -thread_local int pto2_current_orch_idx = 0; - -void pto2_set_orch_thread_idx(int idx) { pto2_current_orch_idx = idx; } - -// ============================================================================= -// Orchestration Ops Table (function-pointer dispatch for orchestration .so) -// ============================================================================= - -static TaskOutputTensors submit_task_impl(PTO2Runtime* rt, const MixedKernels& mixed_kernels, const Arg& args) { - return pto2_submit_mixed_task(&rt->orchestrators[pto2_current_orch_idx], mixed_kernels, args); -} - -void pto2_rt_scope_begin(PTO2Runtime* rt) { pto2_scope_begin(&rt->orchestrators[pto2_current_orch_idx]); } - -void pto2_rt_scope_end(PTO2Runtime* rt) { pto2_scope_end(&rt->orchestrators[pto2_current_orch_idx]); } - -void pto2_rt_orchestration_done(PTO2Runtime* rt) { pto2_orchestrator_done(&rt->orchestrators[pto2_current_orch_idx]); } - -static bool is_fatal_impl(PTO2Runtime* rt) { return rt->orchestrators[pto2_current_orch_idx].fatal; } - -// Wait for all producers of this tensor to be safe for data access. -// Checks owner metadata (lifecycle anchor) and OverlapMap (modifier writers). -// For reads: wait until each producer COMPLETED (done writing). -// For writes: also wait until all consumers done reading -// (fanout_refcount >= fanout_count - 1, excluding scope reference). -// Uses cycle-based timeout (checked every 1024 spins). -// Returns false on timeout (sets orch.fatal). -MAYBE_UNINITIALIZED_BEGIN -static bool wait_for_tensor_ready(PTO2Runtime* rt, const Tensor& tensor, bool wait_for_consumers, const char* caller) { - PTO2OrchestratorState& orch = rt->orchestrators[pto2_current_orch_idx]; - - // Collect producer slot states from both maps, deduplicated by pointer. - // +1: one creator slot + up to PTO2_LOOKUP_MAX_RESULTS modifier slots. - constexpr int kMaxWait = PTO2_LOOKUP_MAX_RESULTS + 1; - PTO2TaskSlotState* slots[kMaxWait]; - int slot_count = 0; - - // Step A: creator retention — read owner directly from tensor metadata - PTO2TaskId owner = tensor.owner_task_id; - if (owner.is_valid()) { - slots[slot_count++] = &rt->scheduler.ring_sched_states[owner.ring()].get_slot_state_by_task_id(owner.local()); - } - - // Step B: modifier writer lookup (OverlapMap) - PTO2LookupResult lookup_result; - orch.tensor_map.lookup(tensor, lookup_result); - for (int r = 0; r < lookup_result.count; r++) { - PTO2TaskId pid = lookup_result.entries[r].entry->producer_task_id; - PTO2TaskSlotState* s = &rt->scheduler.ring_sched_states[pid.ring()].get_slot_state_by_task_id(pid.local()); - bool already = false; - for (int j = 0; j < slot_count; j++) { - if (slots[j] == s) { - already = true; - break; - } - } - if (!already && slot_count < kMaxWait) { - slots[slot_count++] = s; - } - } - - // Wait for each producer - for (int p = 0; p < slot_count; p++) { - PTO2TaskSlotState& slot = *slots[p]; - uint8_t ring_id = slot.ring_id; - int32_t local_id = static_cast(slot.task->task_id.local()); - - uint64_t t0 = get_sys_cnt_aicpu(); - int32_t spin_count = 0; - while (slot.task_state.load(std::memory_order_acquire) < PTO2_TASK_COMPLETED) { - SPIN_WAIT_HINT(); - if ((++spin_count & 1023) == 0 && get_sys_cnt_aicpu() - t0 > PTO2_TENSOR_DATA_TIMEOUT_CYCLES) { - orch.fatal = true; - unified_log_error(caller, - "Timeout (%llu cycles): producer (ring=%d, local=%d) not completed", - (unsigned long long)PTO2_TENSOR_DATA_TIMEOUT_CYCLES, // NOLINT(runtime/int) - ring_id, - local_id); - return false; - } - } - - if (wait_for_consumers) { - t0 = get_sys_cnt_aicpu(); - spin_count = 0; - while (slot.fanout_refcount.load(std::memory_order_acquire) < slot.fanout_count - 1) { - SPIN_WAIT_HINT(); - if ((++spin_count & 1023) == 0 && get_sys_cnt_aicpu() - t0 > PTO2_TENSOR_DATA_TIMEOUT_CYCLES) { - orch.fatal = true; - unified_log_error(caller, - "Timeout (%llu cycles): consumers of producer (ring=%d, local=%d) not done", - (unsigned long long)PTO2_TENSOR_DATA_TIMEOUT_CYCLES, // NOLINT(runtime/int) - ring_id, - local_id); - return false; - } - } - } - } - return true; -} -MAYBE_UNINITIALIZED_END - -uint64_t pto2_get_tensor_data(PTO2Runtime* rt, const Tensor& tensor, uint32_t ndims, const uint32_t indices[]) { - if (tensor.buffer.addr == 0) { - unified_log_error(__FUNCTION__, - "get_tensor_data: buffer not allocated (addr=0). " - "Use the Tensor returned by add_output(TensorCreateInfo) after submit returns."); - return 0; - } - - if (!wait_for_tensor_ready(rt, tensor, false, __FUNCTION__)) { - return 0; - } - - uint64_t flat_offset = tensor.compute_flat_offset(indices, ndims); - uint64_t elem_size = get_element_size(tensor.dtype); - const void* ptr = reinterpret_cast(tensor.buffer.addr + flat_offset * elem_size); - uint64_t result = 0; - memcpy(&result, ptr, elem_size); - return result; -} - -void pto2_set_tensor_data( - PTO2Runtime* rt, const Tensor& tensor, uint32_t ndims, const uint32_t indices[], uint64_t value) { - if (tensor.buffer.addr == 0) { - unified_log_error(__FUNCTION__, - "set_tensor_data: buffer not allocated (addr=0). " - "Use the Tensor returned by add_output(TensorCreateInfo) after submit returns."); - return; - } - - // Wait for producer + all consumers before writing (WAW + WAR safety) - if (!wait_for_tensor_ready(rt, tensor, true, __FUNCTION__)) { - return; - } - - uint64_t flat_offset = tensor.compute_flat_offset(indices, ndims); - uint64_t elem_size = get_element_size(tensor.dtype); - void* ptr = reinterpret_cast(tensor.buffer.addr + flat_offset * elem_size); - memcpy(ptr, &value, elem_size); -} - -static const PTO2RuntimeOps s_runtime_ops = { - .submit_task = submit_task_impl, - .scope_begin = pto2_rt_scope_begin, - .scope_end = pto2_rt_scope_end, - .orchestration_done = pto2_rt_orchestration_done, - .is_fatal = is_fatal_impl, - .log_error = unified_log_error, - .log_warn = unified_log_warn, - .log_info = unified_log_info, - .log_debug = unified_log_debug, - .log_always = unified_log_always, - .get_tensor_data = pto2_get_tensor_data, - .set_tensor_data = pto2_set_tensor_data, -}; - -// ============================================================================= -// Runtime Creation and Destruction -// ============================================================================= - -PTO2Runtime* pto2_runtime_create(PTO2RuntimeMode mode) { - return pto2_runtime_create_custom(mode, PTO2_TASK_WINDOW_SIZE, PTO2_HEAP_SIZE); -} - -PTO2Runtime* pto2_runtime_create_custom( - PTO2RuntimeMode mode, uint64_t task_window_size, uint64_t heap_size, int32_t dep_pool_capacity) { - // Allocate runtime context - PTO2Runtime* rt = static_cast(calloc(1, sizeof(PTO2Runtime))); - if (!rt) { - return NULL; - } - - rt->ops = &s_runtime_ops; - rt->mode = mode; - rt->orch_count = 1; - rt->sm_handle = pto2_sm_create(task_window_size, heap_size); - if (!rt->sm_handle) { - free(rt); - return NULL; - } - - // Allocate GM heap for output buffers (all rings combined) - uint64_t total_heap_size = heap_size * PTO2_MAX_RING_DEPTH; - rt->gm_heap_size = total_heap_size; -#if defined(_POSIX_C_SOURCE) && _POSIX_C_SOURCE >= 200112L - if (posix_memalign(&rt->gm_heap, PTO2_ALIGN_SIZE, total_heap_size) != 0) { - pto2_sm_destroy(rt->sm_handle); - free(rt); - return NULL; - } -#else - rt->gm_heap = aligned_alloc(PTO2_ALIGN_SIZE, total_heap_size); - if (!rt->gm_heap) { - pto2_sm_destroy(rt->sm_handle); - free(rt); - return NULL; - } -#endif - rt->gm_heap_owned = true; - - // Initialize first orchestrator - if (!pto2_orchestrator_init(&rt->orchestrators[0], rt->sm_handle, rt->gm_heap, heap_size, dep_pool_capacity)) { - free(rt->gm_heap); - pto2_sm_destroy(rt->sm_handle); - free(rt); - return NULL; - } - - // Initialize scheduler (heap_size = per-ring heap size) - if (!pto2_scheduler_init(&rt->scheduler, rt->sm_handle)) { - pto2_orchestrator_destroy(&rt->orchestrators[0]); - free(rt->gm_heap); - pto2_sm_destroy(rt->sm_handle); - free(rt); - return NULL; - } - - // Connect orchestrator to scheduler (for simulated mode) - pto2_orchestrator_set_scheduler(&rt->orchestrators[0], &rt->scheduler); - - return rt; -} - -PTO2Runtime* pto2_runtime_create_from_sm(PTO2RuntimeMode mode, - PTO2SharedMemoryHandle* sm_handle, - void* gm_heap, - uint64_t heap_size, - int orch_count, - int32_t dep_pool_capacity) { - if (!sm_handle) return NULL; - if (orch_count < 1) orch_count = 1; - if (orch_count > PTO2_MAX_ORCH_THREADS) orch_count = PTO2_MAX_ORCH_THREADS; - - PTO2Runtime* rt = static_cast(calloc(1, sizeof(PTO2Runtime))); - if (!rt) return NULL; - - rt->ops = &s_runtime_ops; - rt->mode = mode; - rt->sm_handle = sm_handle; - rt->gm_heap = gm_heap; - rt->gm_heap_size = heap_size > 0 ? heap_size * PTO2_MAX_RING_DEPTH : 0; - rt->gm_heap_owned = false; - rt->orch_count = orch_count; - - // Initialize all orchestrator states - for (int i = 0; i < orch_count; i++) { - if (!pto2_orchestrator_init(&rt->orchestrators[i], rt->sm_handle, rt->gm_heap, heap_size, dep_pool_capacity)) { - for (int j = 0; j < i; j++) { - pto2_orchestrator_destroy(&rt->orchestrators[j]); - } - free(rt); - return NULL; - } - } - - // Initialize scheduler (heap_size = per-ring heap size) - if (!pto2_scheduler_init(&rt->scheduler, rt->sm_handle)) { - for (int i = 0; i < orch_count; i++) { - pto2_orchestrator_destroy(&rt->orchestrators[i]); - } - free(rt); - return NULL; - } - - // Connect all orchestrators to scheduler - for (int i = 0; i < orch_count; i++) { - pto2_orchestrator_set_scheduler(&rt->orchestrators[i], &rt->scheduler); - } - - return rt; -} - -void pto2_runtime_destroy(PTO2Runtime* rt) { - if (!rt) return; - - pto2_scheduler_destroy(&rt->scheduler); - for (int i = 0; i < rt->orch_count; i++) { - pto2_orchestrator_destroy(&rt->orchestrators[i]); - } - - if (rt->gm_heap_owned && rt->gm_heap) { - free(rt->gm_heap); - } - - if (rt->sm_handle) { - pto2_sm_destroy(rt->sm_handle); - } - - free(rt); -} - -void pto2_runtime_set_mode(PTO2Runtime* rt, PTO2RuntimeMode mode) { - if (rt) { - rt->mode = mode; - } -} diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_runtime2.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_runtime2.h deleted file mode 100644 index 31b28acd4..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_runtime2.h +++ /dev/null @@ -1,225 +0,0 @@ -/** - * PTO Runtime2 - Main Interface - * - * This is the main header for the PTO Runtime2 system. - * It provides a unified API for task graph construction and execution. - * - * Key Features: - * - Ring buffer based memory management (zero allocation overhead) - * - Lazy invalidation TensorMap for dependency discovery - * - Scope-based buffer lifecycle management - * - Per-task spinlocks for concurrent fanout updates - * - Orchestrator-Scheduler decoupling via shared memory - * - * Usage: - * 1. Create runtime: pto2_runtime_create() - * 2. Build task graph in orchestration function: - * - pto2_scope_begin() / pto2_scope_end() - * - pto2_submit_task() - * 3. Mark orchestration complete: pto2_orchestrator_done() - * 4. Destroy runtime: pto2_runtime_destroy() - * - * Based on: docs/RUNTIME_LOGIC.md - */ - -#ifndef PTO_RUNTIME2_H -#define PTO_RUNTIME2_H - -#include "pto_runtime2_types.h" -#include "pto_submit_types.h" -#include "pto_shared_memory.h" -#include "pto_ring_buffer.h" -#include "pto_tensormap.h" -#include "pto_scheduler.h" -#include "pto_orchestrator.h" - -// Maximum number of orchestrator threads supported -constexpr int PTO2_MAX_ORCH_THREADS = 4; - -// ============================================================================= -// Runtime Context -// ============================================================================= - -/** - * Runtime execution mode - */ -enum PTO2RuntimeMode { - PTO2_MODE_EXECUTE = 0, // Execute tasks on workers - PTO2_MODE_SIMULATE = 1, // Simulate task execution with cycle counting - PTO2_MODE_GRAPH_ONLY = 2 // Build graph only, no execution -}; - -/** - * Function-pointer ops table for runtime operations. - * - * The orchestration .so calls runtime functions through this table - * (via pto_orchestration_api.h inline wrappers), so it has zero link - * dependencies on runtime .cpp files. - */ -typedef struct PTO2Runtime PTO2Runtime; // forward declare for ops signatures - -struct PTO2RuntimeOps { - TaskOutputTensors (*submit_task)(PTO2Runtime* rt, const MixedKernels& mixed_kernels, - const Arg& args); - void (*scope_begin)(PTO2Runtime* rt); - void (*scope_end)(PTO2Runtime* rt); - void (*orchestration_done)(PTO2Runtime* rt); - bool (*is_fatal)(PTO2Runtime* rt); - - // Logging (populated by runtime, called by orchestration) - void (*log_error)(const char* func, const char* fmt, ...); - void (*log_warn)(const char* func, const char* fmt, ...); - void (*log_info)(const char* func, const char* fmt, ...); - void (*log_debug)(const char* func, const char* fmt, ...); - void (*log_always)(const char* func, const char* fmt, ...); - - // Cross-layer data access (orchestration reads/writes tensor values via runtime) - // Placed after logging to avoid shifting hot-path field offsets. - uint64_t (*get_tensor_data)(PTO2Runtime* rt, const Tensor& tensor, - uint32_t ndims, const uint32_t indices[]); - void (*set_tensor_data)(PTO2Runtime* rt, const Tensor& tensor, - uint32_t ndims, const uint32_t indices[], - uint64_t value); -}; - -/** - * PTO Runtime2 context - * - * Contains all state for orchestration and scheduling. - * In simulated mode, runs in single process with shared address space. - */ -struct PTO2Runtime { - // Ops table (first field — used by orchestration .so via function pointers) - const PTO2RuntimeOps* ops; - - // Components - PTO2SharedMemoryHandle* sm_handle; - PTO2OrchestratorState orchestrators[PTO2_MAX_ORCH_THREADS]; - int orch_count; // Number of active orchestrator states - PTO2SchedulerState scheduler; - - // GM Heap for output buffers - void* gm_heap; - uint64_t gm_heap_size; - bool gm_heap_owned; // True if we allocated it - - // Mode - PTO2RuntimeMode mode; - - // Statistics - int64_t total_cycles; -}; - -// ============================================================================= -// Runtime Lifecycle API -// ============================================================================= - -/** - * Create a new runtime instance - * - * @param mode Execution mode - * @return Runtime context, or NULL on failure - */ -PTO2Runtime* pto2_runtime_create(PTO2RuntimeMode mode); - -/** - * Create runtime with custom sizes - * - * @param mode Execution mode - * @param task_window_size Number of task slots - * @param heap_size Size of GM heap - * @return Runtime context, or NULL on failure - */ -PTO2Runtime* pto2_runtime_create_custom(PTO2RuntimeMode mode, - uint64_t task_window_size, - uint64_t heap_size, - int32_t dep_pool_capacity = PTO2_DEP_LIST_POOL_SIZE); - -/** - * Create runtime from existing shared memory and GM heap (e.g. on device). - * Does not allocate sm_handle or gm_heap; caller owns them. - * - * @param mode Execution mode - * @param sm_handle Pre-created shared memory handle (e.g. from pto2_sm_create_from_buffer) - * @param gm_heap GM heap base for output buffers (or NULL if not used) - * @param heap_size GM heap size in bytes - * @return Runtime context, or NULL on failure - */ -PTO2Runtime* pto2_runtime_create_from_sm(PTO2RuntimeMode mode, - PTO2SharedMemoryHandle* sm_handle, - void* gm_heap, - uint64_t heap_size, - int orch_count = 1, - int32_t dep_pool_capacity = PTO2_DEP_LIST_POOL_SIZE); - -/** - * Destroy runtime and free all resources - */ -void pto2_runtime_destroy(PTO2Runtime* rt); - -/** - * Set execution mode - */ -void pto2_runtime_set_mode(PTO2Runtime* rt, PTO2RuntimeMode mode); - -/** - * Set the orchestrator index for the current thread. - * Must be called before any orchestration API calls on a given thread. - */ -void pto2_set_orch_thread_idx(int idx); - -// ============================================================================= -// Orchestration API (called by orchestration function) -// ============================================================================= - -/** - * Begin a new scope - * - * All tasks submitted within this scope will have their lifetime - * bounded by the scope. When scope_end() is called, the scope - * releases its reference to all enclosed tasks. - */ -void pto2_rt_scope_begin(PTO2Runtime* rt); - -/** - * End current scope - * - * Releases scope reference for all tasks submitted since scope_begin(). - * Tasks whose refcount reaches zero will have their buffers released. - */ -void pto2_rt_scope_end(PTO2Runtime* rt); - -/** - * Mark orchestration as complete - * - * Signals that no more tasks will be submitted. - */ -void pto2_rt_orchestration_done(PTO2Runtime* rt); - -/** - * Cross-layer data access: read a tensor value by waiting for its producer. - */ -uint64_t pto2_get_tensor_data(PTO2Runtime* rt, const Tensor& tensor, - uint32_t ndims, const uint32_t indices[]); - -/** - * Cross-layer data access: write a value to a tensor at given indices. - * Waits for producer completion (WAW) and all consumers (WAR) via TensorMap. - * See set_tensor_data in pto_orchestration_api.h for full documentation. - */ -void pto2_set_tensor_data(PTO2Runtime* rt, const Tensor& tensor, - uint32_t ndims, const uint32_t indices[], - uint64_t value); - -/** - * Slim config struct exported by orchestration .so via aicpu_orchestration_config(). - * Shared definition with pto_orchestration_api.h (same layout, guarded). - */ -#ifndef PTO2_ORCHESTRATION_CONFIG_DEFINED -#define PTO2_ORCHESTRATION_CONFIG_DEFINED -struct PTO2OrchestrationConfig { - int expected_arg_count; -}; -#endif - -#endif // PTO_RUNTIME2_H diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_runtime2_types.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_runtime2_types.h deleted file mode 100644 index 8f89afb25..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_runtime2_types.h +++ /dev/null @@ -1,557 +0,0 @@ -/* - * Copyright (c) PyPTO Contributors. - * This program is free software, you can redistribute it and/or modify it under the terms and conditions of - * CANN Open Software License Agreement Version 2.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - * ----------------------------------------------------------------------------------------------------------- - */ - -/** - * PTO Runtime2 - Core Type Definitions - * - * This header defines all fundamental types used by the PTO Runtime2 system: - * - Configuration constants - * - Worker types and task states - * - Tensor regions and task parameters - * - Task descriptors with fanin/fanout tracking - * - Dependency list entries - * - * Based on: docs/RUNTIME_LOGIC.md - */ - -#ifndef SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_RUNTIME2_TYPES_H_ -#define SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_RUNTIME2_TYPES_H_ - -#include -#include -#include - -#include - -#include "pto2_dispatch_payload.h" -#include "pto_submit_types.h" -#include "pto_task_id.h" -#include "pto_types.h" - -// ============================================================================= -// Profiling Configuration -// ============================================================================= - -#ifndef PTO2_PROFILING -#define PTO2_PROFILING 1 -#endif - -#ifndef PTO2_ORCH_PROFILING -#define PTO2_ORCH_PROFILING 0 -#endif - -#ifndef PTO2_SCHED_PROFILING -#define PTO2_SCHED_PROFILING 0 -#endif - -#ifndef PTO2_TENSORMAP_PROFILING -#define PTO2_TENSORMAP_PROFILING 0 -#endif - -#if PTO2_ORCH_PROFILING && !PTO2_PROFILING -#error "PTO2_ORCH_PROFILING requires PTO2_PROFILING=1" -#endif - -#if PTO2_SCHED_PROFILING && !PTO2_PROFILING -#error "PTO2_SCHED_PROFILING requires PTO2_PROFILING=1" -#endif - -#if PTO2_TENSORMAP_PROFILING && !PTO2_ORCH_PROFILING -#error "PTO2_TENSORMAP_PROFILING requires PTO2_ORCH_PROFILING=1" -#endif - -// ============================================================================= -// AICPU Error Codes (written to shared memory for Host-side diagnosis) -// ============================================================================= - -// Orchestrator errors (1-99): detected in orchestrator thread -#define PTO2_ERROR_NONE 0 -#define PTO2_ERROR_SCOPE_DEADLOCK 1 -#define PTO2_ERROR_HEAP_RING_DEADLOCK 2 -#define PTO2_ERROR_FLOW_CONTROL_DEADLOCK 3 -#define PTO2_ERROR_DEP_POOL_OVERFLOW 4 -#define PTO2_ERROR_INVALID_ARGS 5 // Arg construction error (invalid args) -#define PTO2_ERROR_DEPENDENCY_OVERFLOW 6 // Too many unique fanin dependencies for one task - -// Scheduler errors (100+): detected in scheduler threads -#define PTO2_ERROR_SCHEDULER_TIMEOUT 100 - -// ============================================================================= -// Configuration Constants -// ============================================================================= - -// Task management -// NOTE: PTO2_TASK_WINDOW_SIZE is now a per-ring default value. -// Actual window size is passed at runtime to pto2_runtime_create_threaded_custom(). -// Use pto2_task_slot(sched, task_id) for slot calculation. -#define PTO2_TASK_WINDOW_SIZE 16384 // Default per-ring task window size (power of 2) - -// Multi-ring: number of independent ring layers (HeapRing + TaskRing + DepPool per layer) -// Scope depth maps to ring index via: min(scope_depth, PTO2_MAX_RING_DEPTH - 1) -#define PTO2_MAX_RING_DEPTH 4 - -// Memory pools (per-ring defaults; total = value × PTO2_MAX_RING_DEPTH) -#define PTO2_HEAP_SIZE (256 * 1024 * 1024) // 256MB per ring (1GB total) -#define PTO2_DEP_LIST_POOL_SIZE 16384 // Per-ring dependency list pool entries -#define PTO2_TENSORMAP_POOL_SIZE (65536) // TensorMap entry pool -#define PTO2_TENSORMAP_NUM_BUCKETS 4096 // Power of 2 for fast hash (4096×8B=32KB fits L1) - -// Scope management -#define PTO2_MAX_SCOPE_DEPTH 64 // Maximum nesting depth -#define PTO2_SCOPE_TASKS_INIT_CAP 65536 // Initial capacity for scope task buffer - -// Ready queue -#define PTO2_READY_QUEUE_SIZE 65536 // Per-shape queue size - -// Memory alignment -#define PTO2_ALIGN_SIZE 64 // Cache line alignment -#define PTO2_PACKED_OUTPUT_ALIGN 1024 // Each output in packed buffer aligned to 1024B; gap is padding -#define PTO2_ALIGN_UP(x, align) (((x) + (align) - 1) & ~((align) - 1)) - -// TensorMap cleanup interval -#define PTO2_TENSORMAP_CLEANUP_INTERVAL 64 // Cleanup every N retired tasks -#define PTO2_DEP_POOL_CLEANUP_INTERVAL 64 // Cleanup every N retired tasks - -// get_tensor_data/set_tensor_data spin wait timeout in cycles. -// ~10s on hardware (1.5 GHz counter), ~10s on simulation (chrono-based). -constexpr uint64_t PTO2_TENSOR_DATA_TIMEOUT_CYCLES = 15 * 1000 * 1000 * 1000ULL; - -// ============================================================================= -// Multi-Ring task_id Encoding -// ============================================================================= - -/** - * TaskId: defined in pto_task_id.h (included above). - */ - -// ============================================================================= -// Worker Types -// ============================================================================= - -/** - * Worker type enumeration - * Each worker type has its own ready queue for load balancing - */ -typedef enum { - PTO2_WORKER_CUBE = 0, // AICore CUBE unit (matrix ops) - PTO2_WORKER_VECTOR = 1, // AICore VECTOR unit (element-wise ops) - PTO2_WORKER_AI_CPU = 2, // AI_CPU (scalar ops, control flow) - PTO2_WORKER_ACCELERATOR = 3, // Fixed-function accelerators (DMA, etc.) - PTO2_NUM_WORKER_TYPES = 4 -} PTO2WorkerType; - -// ============================================================================= -// Task States -// ============================================================================= - -/** - * Task state enumeration - * - * State transitions: - * PENDING -> READY -> RUNNING -> COMPLETED -> CONSUMED - * - * Conditions: - * PENDING->READY: fanin_refcount == fanin_count - * COMPLETED->CONSUMED: fanout_refcount == fanout_count && state == COMPLETED - */ -typedef enum { - PTO2_TASK_PENDING = 0, // Waiting for dependencies (fanin_refcount < fanin_count) - PTO2_TASK_READY = 1, // All dependencies satisfied, waiting in ready queue - PTO2_TASK_RUNNING = 2, // Currently executing on a worker - PTO2_TASK_COMPLETED = 3, // Execution finished, output may still be in use - PTO2_TASK_CONSUMED = 4 // Output fully consumed, buffers can be released -} PTO2TaskState; - -// ============================================================================= -// Logical Tensor (for view/reshape/transpose operations) -// ============================================================================= - -/** - * Maximum dimensions supported for logical tensors - */ -#define PTO2_MAX_TENSOR_DIM 8 - -/** - * Maximum depth of layout history for HBB overlap detection - * Simple (contiguous) tensor has depth=1, non-contiguous has depth>1 - */ -#define PTO2_MAX_LAYOUT_DEPTH 8 - -/** - * Layout operation type for HBB - */ -typedef enum { - PTO2_LAYOUT_VIEW = 0, // View/slice: records bounding box - PTO2_LAYOUT_RESHAPE = 1, // Reshape: records new shape - PTO2_LAYOUT_TRANSPOSE = 2 // Transpose: records permutation -} PTO2LayoutOpType; - -/** - * Layout operation entry for HBB - * Each entry records one derivation step from the parent tensor. - */ -typedef struct { - PTO2LayoutOpType type; - union { - struct { // PTO2_LAYOUT_VIEW - int64_t bbox_min; // First byte accessed - int64_t bbox_max; // Last byte accessed - } view; - struct { // PTO2_LAYOUT_RESHAPE - int32_t ndim; - int64_t shape[PTO2_MAX_TENSOR_DIM]; - } reshape; - struct { // PTO2_LAYOUT_TRANSPOSE - int32_t ndim; - int32_t perm[PTO2_MAX_TENSOR_DIM]; - } transpose; - }; -} PTO2LayoutOp; - -/** - * Tensor extraction type (for tracking how tensor was created) - */ -typedef enum { - PTO2_TENSOR_RAW = 0, // Original raw tensor (owns storage) - PTO2_TENSOR_VIEW = 1, // view() - subset selection, shared storage - PTO2_TENSOR_RESHAPE = 2, // reshape() - shape change, shared storage - PTO2_TENSOR_TRANSPOSE = 3, // transpose() - dimension permute, shared storage - PTO2_TENSOR_DEEP_VIEW = 4, // deep_view() - copied subset, new storage - PTO2_TENSOR_DEEP_RESHAPE = 5, // deep_reshape() - copied reshape, new storage - PTO2_TENSOR_DEEP_TRANSPOSE = 6 // deep_transpose() - copied transpose, new storage -} PTO2TensorExtractionType; - -/** - * Raw tensor (storage provider) - * - * The raw tensor owns the actual memory allocation. - * Multiple logical tensors can share the same raw tensor (aliasing). - */ -typedef struct { - void* base_ptr; // Base pointer of allocated memory - int64_t total_size; // Total size in bytes - int32_t refcount; // Number of logical tensors referencing this storage - // (for memory management, 0 = can be freed) -} PTO2RawTensor; - -/** - * Logical tensor structure - * - * A "view" into raw tensor storage with specific layout. - * Supports multi-dimensional tensors with strides (for view/reshape/transpose). - * - * Memory footprint is determined by: - * - storage_offset: byte offset from raw_base to first element - * - shape[d]: number of elements in dimension d - * - strides[d]: byte offset between consecutive elements in dimension d - * - * For element at indices [i0, i1, ..., i_{n-1}]: - * byte_offset = storage_offset + sum(i_d * strides[d]) - * - * Examples: - * - Contiguous row-major (3,4): shape=[3,4], strides=[4*elem_size, elem_size] - * - Transposed (4,3): shape=[4,3], strides=[elem_size, 4*elem_size] - * - Sliced [1:3, 1:3]: offset adjusted, shape=[2,2], strides unchanged - */ -typedef struct { - // === Raw tensor reference (shared storage) === - void* raw_base; // Pointer to raw tensor's base (for aliasing check) - int64_t raw_total_size; // Total size of raw tensor in bytes - - // === Storage offset === - int64_t storage_offset; // Byte offset from raw_base to first element - - // === Shape and strides === - int64_t shape[PTO2_MAX_TENSOR_DIM]; // Size in each dimension - int64_t strides[PTO2_MAX_TENSOR_DIM]; // Byte stride in each dimension - int32_t ndim; // Number of dimensions (0 = scalar) - - // === Precomputed bounding box (for fast overlap detection) === - int64_t min_byte_offset; // First byte accessed (relative to raw_base) - int64_t max_byte_offset; // Last byte accessed (relative to raw_base) - - // === Element info === - int64_t elem_size; // Size of each element in bytes - int64_t numel; // Total number of elements - - // === Extraction tracking === - PTO2TensorExtractionType extraction_type; // How this tensor was created - bool is_contiguous; // True if memory is contiguous (no gaps) - // Equivalent to layout_depth == 1 - - // === Layout history for HBB overlap detection === - int32_t layout_depth; // Number of layout ops (1=simple) - PTO2LayoutOp layout_ops[PTO2_MAX_LAYOUT_DEPTH]; // Derivation history -} PTO2LogicalTensor; - -// ============================================================================= -// Dependency List Entry -// ============================================================================= - -/** - * Dependency list entry (singly-linked list node) - * Stored in DepListPool ring buffer - * - * Used for both fanin_list and fanout_list - */ -struct PTO2TaskSlotState; // Forward declaration -struct PTO2DepListEntry { - PTO2TaskSlotState* slot_state; // Consumer slot state (direct pointer) - PTO2DepListEntry* next; // next entry -}; - -// ============================================================================= -// Task Descriptor -// ============================================================================= - -/** - * Task descriptor structure (shared memory) - * - * Stored in the TaskDescriptor ring buffer in shared memory. - * Contains static identification and buffer pointers only. - * Dynamic scheduling state (fanin/fanout/task_state) is in PTO2TaskSlotState. - * - * Fields set by Orchestrator at submission, read by Scheduler for dispatch. - */ -struct PTO2TaskDescriptor { - // Mixed-task identification (encodes ring_id in upper 32 bits) - PTO2TaskId task_id; // raw: (ring_id << 32) | local_id - - // Per-slot kernel IDs (INVALID_KERNEL_ID = inactive) - int32_t kernel_id[PTO2_SUBTASK_SLOT_COUNT]; - - // Packed output buffer (all outputs packed into single contiguous buffer) - void* packed_buffer_base; // Start of packed buffer in GM Heap - void* packed_buffer_end; // End of packed buffer (for heap reclamation) -}; - -// ============================================================================= -// Per-Slot Scheduling State -// ============================================================================= - -/** - * Task payload data (cold path - only accessed during orchestration and dispatch) - * - * Layout: metadata (counts, fanin pointers) packed in the first 3 cache lines, - * followed by bulk tensor and scalar data. This gives sequential write access - * during orchestration and groups scheduler-hot fields (fanin_actual_count + - * fanin_slot_states) together for on_task_release. - */ -struct PTO2TaskPayload { - // === Cache lines 0-2 (192B) — metadata === - int32_t tensor_count{0}; - int32_t scalar_count{0}; - int32_t fanin_actual_count{0}; // Actual fanin count (without the +1 redundance) - int32_t _reserved{0}; // Reserved (dep_pool_mark moved to SlotState for local access) - PTO2TaskSlotState* fanin_slot_states[PTO2_MAX_INPUTS]; // Producer slot states (used by on_task_release) - // === Cache lines 3-34 (2048B) — tensors (alignas(64) forces alignment) === - Tensor tensors[MAX_TENSOR_ARGS]; - // === Cache lines 35-50 (1024B) — scalars === - uint64_t scalars[MAX_SCALAR_ARGS]; - - // Layout verification (size checks that don't need offsetof). - static_assert(sizeof(Tensor) == 128, "Tensor must be 2 cache lines"); - static_assert(MAX_SCALAR_ARGS * sizeof(uint64_t) == 1024, "scalar region must be 1024B (16 cache lines)"); - - /** - * Initialize payload: copy tensors, store scalars. - * - * For each param slot, the tensor source is determined by TensorArgType: - * - OUTPUT -> use materialized_outputs.output_ptr(out_idx++) - * - INPUT / INOUT -> use refs[i].tensor - * - * @param args Task arguments (tensors + scalars) - * @param materialized_outputs Materialized output tensors (from TensorCreateInfo path) - */ - void init( - const Arg& args, TaskOutputTensors& result, void* base_addr, uint64_t offsets[], uint64_t buffer_sizes[]) { - tensor_count = args.tensor_count(); - scalar_count = args.scalar_count(); - - // int32_t out_idx = 0; - for (int32_t i = 0; i < args.tensor_count(); i++) { - if (args.tag(i) != TensorArgType::OUTPUT) { - tensors[i].copy(*args.tensor(i).ptr); - } else { - tensors[i].init_from_create_info(*args.tensor(i).create_info, - reinterpret_cast(reinterpret_cast(base_addr) + offsets[i]), - buffer_sizes[i]); - result.materialize_output(tensors[i]); - } - tensors[i].update_start_offset(); - } - // Round up to cache line boundary. Both arrays are 1024B so no overrun. - // Eliminates branches; extra bytes within the same CL have zero additional cost. - memcpy(scalars, args.scalars(), PTO2_ALIGN_UP(args.scalar_count() * sizeof(uint64_t), 64)); - } -}; - -// PTO2TaskPayload layout verification (offsetof requires complete type). -static_assert(offsetof(PTO2TaskPayload, tensors) == 192, "tensors must start at byte 192 (cache line 3)"); -static_assert(offsetof(PTO2TaskPayload, scalars) == 192 + MAX_TENSOR_ARGS * sizeof(Tensor), - "scalars must immediately follow tensors"); - -/** - * Per-task slot scheduling state (scheduler-private, NOT in shared memory) - * - * Consolidates all hot-path scheduling fields into a single cache-friendly - * structure (32 bytes = half a cache line). Accessing any field of a task's - * slot state brings all related fields into the same cache line. - * - * Concurrency notes: - * - fanout_head, fanout_count protected by fanout_lock (per-task spinlock) - * - fanin_count set once at submission, read-only after (hot path for ready check) - * - task_state, fanin_refcount, fanout_refcount updated atomically - */ -struct alignas(64) PTO2TaskSlotState { - // Fanout lock + list (accessed together under lock in on_task_complete) - std::atomic fanout_lock; // Per-task spinlock (0=unlocked, 1=locked) - int32_t fanout_count; // 1 (owning scope) + number of consumers - - PTO2DepListEntry* fanout_head; // Pointer to first fanout entry (nullptr = empty) - - // Task state (completion, consumed check, ready check) - std::atomic task_state; // PENDING/READY/RUNNING/COMPLETED/CONSUMED - - // Fanin (accessed together in release_fanin_and_check_ready) - std::atomic fanin_refcount; // Dynamic: counts completed producers - int32_t fanin_count; // Number of producer dependencies (set once) - - // Fanout refcount (accessed with fanout_count in check_and_handle_consumed) - std::atomic fanout_refcount; // Dynamic: counts released references - - PTO2TaskPayload* payload; - - PTO2TaskDescriptor* task; - - // Hot-path completion fields (moved from TaskDescriptor to avoid cross-struct access) - uint8_t active_mask; // Bitmask of active subtask slots (set once) - std::atomic subtask_done_mask; // Deprecated: superseded by completed_subtasks - uint8_t ring_id; // Ring layer this task belongs to (for per-ring reclamation) - int32_t dep_pool_mark{0}; // Dep pool top after this task's submission (orchestrator-only, local memory) - - // SPMD multi-block (occupies the 8 tail bytes previously implicit padding) - std::atomic completed_subtasks{0}; // Each core completion increments by 1 - int16_t total_required_subtasks{0}; // = block_num * popcount(active_mask) - int16_t block_num{1}; // Total logical blocks (set by orchestrator) - int16_t next_block_idx{0}; // Next block to dispatch (scheduler state) -}; - -static_assert(sizeof(PTO2TaskSlotState) == 64); - -// ============================================================================= -// Cycle Cost Function Type -// ============================================================================= - -/** - * Cycle cost function pointer type - * Returns estimated cycle count for the InCore function - */ -typedef int64_t (*PTO2CycleCostFunc)(void** args, int32_t num_args); - -// ============================================================================= -// InCore Function Type -// ============================================================================= - -/** - * InCore function signature - * All InCore functions must match this signature - */ -typedef void (*PTO2InCoreFunc)(void** args, int32_t num_args); - -// ============================================================================= -// Utility Macros -// ============================================================================= - -/** - * Memory barrier macros for different architectures - */ -#if defined(__aarch64__) -#define PTO2_MEMORY_BARRIER() __asm__ __volatile__("dmb sy" ::: "memory") -#elif defined(__x86_64__) -#define PTO2_MEMORY_BARRIER() __asm__ __volatile__("mfence" ::: "memory") -#else -#define PTO2_MEMORY_BARRIER() __sync_synchronize() -#endif - -// Spin-wait hint for AICPU threads. On real hardware the AICPU has dedicated -// ARM A55 cores — no OS yield is needed, so the hint is a no-op. In simulation -// all threads share host CPU cores, so we yield to prevent starvation. -// This header is also compiled into the Host .so (for struct definitions only), -// where the hint is never called — the fallback no-op keeps Host builds clean. -#if __has_include("spin_hint.h") -#include "spin_hint.h" -#else -#define SPIN_WAIT_HINT() ((void)0) -#endif - -// ============================================================================= -// Per-task fanout spinlock helpers -// -// Used by BOTH the orchestrator (pto_orchestrator.cpp) and the scheduler -// (aicpu_executor.cpp). Placing them here ensures both translation units use -// identical acquire/release semantics. -// -// The fanout_lock MUST be held whenever reading or writing fanout_head / -// fanout_count, because the orchestrator adds consumers concurrently with the -// scheduler traversing the list after task completion. -// ============================================================================= - -#if PTO2_ORCH_PROFILING || PTO2_SCHED_PROFILING -#include "aicpu/device_time.h" -#endif - -#if PTO2_ORCH_PROFILING || PTO2_SCHED_PROFILING -static inline void pto2_fanout_lock(PTO2TaskSlotState& slot_state, uint64_t& atomic_count, uint64_t& wait_cycle) { - uint64_t t0 = get_sys_cnt_aicpu(); - bool contended = false; - uint32_t atomic_ops = 0; - - for (;;) { - while (slot_state.fanout_lock.load(std::memory_order_acquire) != 0) { - contended = true; - atomic_ops++; // each load = 1 atomic - SPIN_WAIT_HINT(); - } - int32_t expected = 0; - if (slot_state.fanout_lock.compare_exchange_weak( - expected, 1, std::memory_order_acquire, std::memory_order_relaxed)) { - atomic_ops++; // successful CAS = 1 atomic - atomic_count += atomic_ops; - if (contended) { - wait_cycle += (get_sys_cnt_aicpu() - t0); - } - return; - } - contended = true; - atomic_ops++; // failed CAS = 1 atomic - } -} -#endif - -static inline void pto2_fanout_lock(PTO2TaskSlotState& slot_state) { - for (;;) { - while (slot_state.fanout_lock.load(std::memory_order_acquire) != 0) { - SPIN_WAIT_HINT(); - } - int32_t expected = 0; - if (slot_state.fanout_lock.compare_exchange_weak( - expected, 1, std::memory_order_acquire, std::memory_order_relaxed)) { - return; - } - } -} - -static inline void pto2_fanout_unlock(PTO2TaskSlotState& slot_state) { - slot_state.fanout_lock.store(0, std::memory_order_release); -} - -#endif // SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_RUNTIME2_TYPES_H_ diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_scheduler.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_scheduler.cpp deleted file mode 100644 index 2fa7b0cd3..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_scheduler.cpp +++ /dev/null @@ -1,220 +0,0 @@ -/** - * PTO Runtime2 - Scheduler Implementation - * - * Implements scheduler state management, ready queues, and task lifecycle. - * - * Based on: docs/RUNTIME_LOGIC.md - */ - -#include "pto_scheduler.h" -#include -#include -#include -#include -#include "common/unified_log.h" - -// ============================================================================= -// Scheduler Profiling Counters -// ============================================================================= - -#if PTO2_SCHED_PROFILING -#include "common/platform_config.h" - -uint64_t g_sched_lock_cycle[PLATFORM_MAX_AICPU_THREADS] = {}; -uint64_t g_sched_fanout_cycle[PLATFORM_MAX_AICPU_THREADS] = {}; -uint64_t g_sched_fanin_cycle[PLATFORM_MAX_AICPU_THREADS] = {}; -uint64_t g_sched_self_consumed_cycle[PLATFORM_MAX_AICPU_THREADS] = {}; -uint64_t g_sched_lock_wait_cycle[PLATFORM_MAX_AICPU_THREADS] = {}; -uint64_t g_sched_push_wait_cycle[PLATFORM_MAX_AICPU_THREADS] = {}; -uint64_t g_sched_pop_wait_cycle[PLATFORM_MAX_AICPU_THREADS] = {}; -uint64_t g_sched_lock_atomic_count[PLATFORM_MAX_AICPU_THREADS] = {}; -uint64_t g_sched_fanout_atomic_count[PLATFORM_MAX_AICPU_THREADS] = {}; -uint64_t g_sched_fanin_atomic_count[PLATFORM_MAX_AICPU_THREADS] = {}; -uint64_t g_sched_self_atomic_count[PLATFORM_MAX_AICPU_THREADS] = {}; -uint64_t g_sched_pop_atomic_count[PLATFORM_MAX_AICPU_THREADS] = {}; -uint64_t g_sched_complete_count[PLATFORM_MAX_AICPU_THREADS] = {}; - -PTO2SchedProfilingData pto2_scheduler_get_profiling(int thread_idx) { - PTO2SchedProfilingData d; - d.lock_cycle = std::exchange(g_sched_lock_cycle[thread_idx], 0); - d.fanout_cycle = std::exchange(g_sched_fanout_cycle[thread_idx], 0); - d.fanin_cycle = std::exchange(g_sched_fanin_cycle[thread_idx], 0); - d.self_consumed_cycle = std::exchange(g_sched_self_consumed_cycle[thread_idx], 0); - d.lock_wait_cycle = std::exchange(g_sched_lock_wait_cycle[thread_idx], 0); - d.push_wait_cycle = std::exchange(g_sched_push_wait_cycle[thread_idx], 0); - d.pop_wait_cycle = std::exchange(g_sched_pop_wait_cycle[thread_idx], 0); - d.lock_atomic_count = std::exchange(g_sched_lock_atomic_count[thread_idx], 0); - d.fanout_atomic_count = std::exchange(g_sched_fanout_atomic_count[thread_idx], 0); - d.fanin_atomic_count = std::exchange(g_sched_fanin_atomic_count[thread_idx], 0); - d.self_atomic_count = std::exchange(g_sched_self_atomic_count[thread_idx], 0); - d.pop_atomic_count = std::exchange(g_sched_pop_atomic_count[thread_idx], 0); - d.complete_count = std::exchange(g_sched_complete_count[thread_idx], 0); - return d; -} -#endif - -// ============================================================================= -// Task State Names -// ============================================================================= - -const char* pto2_task_state_name(PTO2TaskState state) { - switch (state) { - case PTO2_TASK_PENDING: return "PENDING"; - case PTO2_TASK_READY: return "READY"; - case PTO2_TASK_RUNNING: return "RUNNING"; - case PTO2_TASK_COMPLETED: return "COMPLETED"; - case PTO2_TASK_CONSUMED: return "CONSUMED"; - default: return "UNKNOWN"; - } -} - -// ============================================================================= -// Ready Queue Implementation -// ============================================================================= - -bool pto2_ready_queue_init(PTO2ReadyQueue* queue, uint64_t capacity) { - queue->slots = (PTO2ReadyQueueSlot*)malloc(capacity * sizeof(PTO2ReadyQueueSlot)); - if (!queue->slots) { - return false; - } - - queue->capacity = capacity; - queue->mask = capacity - 1; - queue->enqueue_pos.store(0, std::memory_order_relaxed); - queue->dequeue_pos.store(0, std::memory_order_relaxed); - - for (uint64_t i = 0; i < capacity; i++) { - queue->slots[i].sequence.store((int64_t)i, std::memory_order_relaxed); - queue->slots[i].slot_state = nullptr; - } - - return true; -} - -void pto2_ready_queue_destroy(PTO2ReadyQueue* queue) { - if (queue->slots) { - free(queue->slots); - queue->slots = NULL; - } -} - -// ============================================================================= -// Scheduler Initialization -// ============================================================================= - -bool PTO2SchedulerState::RingSchedState::init( - PTO2SharedMemoryHandle* sm_handle, int32_t ring_id) { - task_descriptors = sm_handle->task_descriptors[ring_id]; - task_window_size = sm_handle->header->rings[ring_id].task_window_size; - task_window_mask = static_cast(task_window_size - 1); - last_task_alive = 0; - slot_states = nullptr; - advance_lock.store(0, std::memory_order_relaxed); - - // Allocate per-task slot state array (dynamically sized based on runtime window_size) - slot_states = new (std::nothrow) PTO2TaskSlotState[task_window_size]; - if (!slot_states) { - return false; - } - - // Zero-initialize all per-task slot state fields. - for (uint64_t i = 0; i < task_window_size; i++) { - slot_states[i].fanout_lock.store(0, std::memory_order_relaxed); - slot_states[i].fanout_count = 0; - slot_states[i].fanout_head = nullptr; - slot_states[i].task_state.store(static_cast(0), std::memory_order_relaxed); - slot_states[i].fanin_refcount.store(0, std::memory_order_relaxed); - slot_states[i].fanin_count = 0; - slot_states[i].fanout_refcount.store(0, std::memory_order_relaxed); - slot_states[i].payload = nullptr; - slot_states[i].task = nullptr; - slot_states[i].active_mask = 0; - slot_states[i].subtask_done_mask.store(0, std::memory_order_relaxed); - slot_states[i].ring_id = 0; - } - - return true; -} - -void PTO2SchedulerState::RingSchedState::destroy() { - if (!slot_states) return; - delete[] slot_states; - slot_states = nullptr; -} - -bool pto2_scheduler_init(PTO2SchedulerState* sched, - PTO2SharedMemoryHandle* sm_handle) { - sched->sm_handle = sm_handle; -#if PTO2_SCHED_PROFILING - sched->tasks_completed.store(0, std::memory_order_relaxed); - sched->tasks_consumed.store(0, std::memory_order_relaxed); -#endif - - // Initialize per-ring state - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - if (!sched->ring_sched_states[r].init(sm_handle, r)) { - for (int j = 0; j < r; j++) { - sched->ring_sched_states[j].destroy(); - } - return false; - } - } - - // Initialize ready queues (one per resource shape, global) - for (int i = 0; i < PTO2_NUM_RESOURCE_SHAPES; i++) { - if (!pto2_ready_queue_init(&sched->ready_queues[i], PTO2_READY_QUEUE_SIZE)) { - // Cleanup on failure - for (int j = 0; j < i; j++) { - pto2_ready_queue_destroy(&sched->ready_queues[j]); - } - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - sched->ring_sched_states[r].destroy(); - } - return false; - } - } - - return true; -} - -void pto2_scheduler_destroy(PTO2SchedulerState* sched) { - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - sched->ring_sched_states[r].destroy(); - } - - for (int i = 0; i < PTO2_NUM_RESOURCE_SHAPES; i++) { - pto2_ready_queue_destroy(&sched->ready_queues[i]); - } -} - -// ============================================================================= -// Debug Utilities -// ============================================================================= - -void pto2_scheduler_print_stats(PTO2SchedulerState* sched) { - LOG_INFO("=== Scheduler Statistics ==="); - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - if (sched->ring_sched_states[r].last_task_alive > 0) { - LOG_INFO("Ring %d:", r); - LOG_INFO(" last_task_alive: %d", sched->ring_sched_states[r].last_task_alive); - } - } -#if PTO2_SCHED_PROFILING - LOG_INFO("tasks_completed: %lld", (long long)sched->tasks_completed.load(std::memory_order_relaxed)); - LOG_INFO("tasks_consumed: %lld", (long long)sched->tasks_consumed.load(std::memory_order_relaxed)); -#endif - LOG_INFO("============================"); -} - -void pto2_scheduler_print_queues(PTO2SchedulerState* sched) { - LOG_INFO("=== Ready Queues ==="); - - const char* shape_names[] = {"AIC", "AIV", "MIX"}; - - for (int i = 0; i < PTO2_NUM_RESOURCE_SHAPES; i++) { - LOG_INFO(" %s: count=%" PRIu64, shape_names[i], - sched->ready_queues[i].size()); - } - - LOG_INFO("===================="); -} diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_scheduler.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_scheduler.h deleted file mode 100644 index d9b984ce0..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_scheduler.h +++ /dev/null @@ -1,819 +0,0 @@ -/* - * Copyright (c) PyPTO Contributors. - * This program is free software, you can redistribute it and/or modify it under the terms and conditions of - * CANN Open Software License Agreement Version 2.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - * ----------------------------------------------------------------------------------------------------------- - */ - -/** - * PTO Runtime2 - Scheduler Interface - * - * The Scheduler is responsible for: - * 1. Maintaining per-resource-shape ready queues - * 2. Tracking task state (PENDING -> READY -> RUNNING -> COMPLETED -> CONSUMED) - * 3. Managing fanin/fanout refcounts for dependency resolution - * 4. Advancing last_task_alive for heap reclamation - * 5. Two-stage mixed-task completion (subtask done bits → mixed-task complete) - * - * The Scheduler runs on Device AI_CPU and processes: - * - Task state transitions based on fanin_refcount - * - Buffer lifecycle based on fanout_refcount - * - Ring pointer advancement for flow control - * - * Based on: docs/RUNTIME_LOGIC.md - */ - -#pragma once - -#include - -#include "common/core_type.h" -#include "pto_ring_buffer.h" -#include "pto_runtime2_types.h" -#include "pto_shared_memory.h" - -#if PTO2_SCHED_PROFILING -#include "aicpu/device_time.h" -#define PTO2_SCHED_CYCLE_START() uint64_t _st0 = get_sys_cnt_aicpu(), _st1 -#define PTO2_SCHED_CYCLE_LAP(acc) \ - do { \ - _st1 = get_sys_cnt_aicpu(); \ - acc += (_st1 - _st0); \ - _st0 = _st1; \ - } while (0) -#endif - -// ============================================================================= -// Ready Queue (Lock-free bounded MPMC — Vyukov design) -// ============================================================================= - -/** - * Per-slot entry: sequence counter for ABA safety + task payload - */ -struct PTO2ReadyQueueSlot { - std::atomic sequence; - PTO2TaskSlotState* slot_state; -}; - -/** - * Thread-local ready buffer for local-first dispatch optimization. - * - * Two buffers per scheduling thread, one per CoreType (AIC=0, AIV=1). - * Initialized once before the scheduling loop; must be empty at - * the start of each iteration (verified by always_assert). - * - * Phase 1 fills per-CoreType buffers via on_task_complete(). - * dispatch_ready_tasks_to_idle_cores drains them: local-first via - * get_ready_task_batch, then remaining tasks pushed to global readyQ. - */ -// Number of CoreType values eligible for local dispatch (AIC=0, AIV=1) -static constexpr int PTO2_LOCAL_DISPATCH_TYPE_NUM = 2; - -struct PTO2LocalReadyBuffer { - PTO2TaskSlotState** slot_states = nullptr; - int count = 0; - int capacity = 0; - - void reset(PTO2TaskSlotState** buf, int cap) { - slot_states = buf; - count = 0; - capacity = cap; - } - - bool try_push(PTO2TaskSlotState* s) { - if (slot_states && count < capacity) { - slot_states[count++] = s; - return true; - } - return false; - } - - PTO2TaskSlotState* pop() { return (count > 0) ? slot_states[--count] : nullptr; } -}; - -/** - * Lock-free bounded MPMC queue (Dmitry Vyukov design) - * - * Key properties: - * - enqueue_pos and dequeue_pos on separate cache lines (no false sharing) - * - Per-slot sequence counter prevents ABA problem - * - Empty queue pop returns immediately (single atomic load, no lock) - * - CAS contention is split: producers only touch enqueue_pos, - * consumers only touch dequeue_pos - */ -struct alignas(64) PTO2ReadyQueue { - PTO2ReadyQueueSlot* slots; - uint64_t capacity; - uint64_t mask; // capacity - 1 - char _pad0[64 - 24]; // Pad to own cache line - - std::atomic enqueue_pos; - char _pad1[64 - sizeof(std::atomic)]; // Own cache line - - std::atomic dequeue_pos; - char _pad2[64 - sizeof(std::atomic)]; // Own cache line - - uint64_t size() { - uint64_t e = enqueue_pos.load(std::memory_order_relaxed); - uint64_t d = dequeue_pos.load(std::memory_order_relaxed); - return (e >= d) ? (e - d) : 0; - } - - bool push(PTO2TaskSlotState* slot_state) { - uint64_t pos; - PTO2ReadyQueueSlot* slot; - while (true) { - pos = enqueue_pos.load(std::memory_order_relaxed); - slot = &slots[pos & mask]; - int64_t seq = slot->sequence.load(std::memory_order_acquire); - int64_t diff = seq - static_cast(pos); - if (diff == 0) { - if (enqueue_pos.compare_exchange_weak( - pos, pos + 1, std::memory_order_relaxed, std::memory_order_relaxed)) { - break; - } - } else if (diff < 0) { - return false; // Queue full - } - } - - slot->slot_state = slot_state; - slot->sequence.store(static_cast(pos + 1), std::memory_order_release); - return true; - } - - // Batch push: reserve count slots with a single CAS after confirming - // every target slot is available under the usual Vyukov sequence check. - void push_batch(PTO2TaskSlotState** items, int count) { - if (count == 0) return; - - uint64_t pos; - while (true) { - pos = enqueue_pos.load(std::memory_order_relaxed); - bool ready = true; - for (int i = 0; i < count; i++) { - PTO2ReadyQueueSlot* slot = &slots[(pos + i) & mask]; - int64_t seq = slot->sequence.load(std::memory_order_acquire); - int64_t diff = seq - static_cast(pos + i); - if (diff != 0) { - ready = false; - break; - } - } - if (!ready) { - continue; - } - if (enqueue_pos.compare_exchange_weak( - pos, pos + count, std::memory_order_relaxed, std::memory_order_relaxed)) { - break; - } - } - - for (int i = 0; i < count; i++) { - PTO2ReadyQueueSlot* slot = &slots[(pos + i) & mask]; - slot->slot_state = items[i]; - slot->sequence.store(static_cast(pos + i + 1), std::memory_order_release); - } - } - -#if PTO2_ORCH_PROFILING || PTO2_SCHED_PROFILING - bool push(PTO2TaskSlotState* slot_state, uint64_t& atomic_count, uint64_t& wait_cycle) { - uint64_t pos; - PTO2ReadyQueueSlot* slot; - uint64_t t0 = get_sys_cnt_aicpu(); - bool contended = false; - uint32_t atomic_ops = 0; - while (true) { - pos = enqueue_pos.load(std::memory_order_relaxed); - slot = &slots[pos & mask]; - int64_t seq = slot->sequence.load(std::memory_order_acquire); - int64_t diff = seq - static_cast(pos); - atomic_ops += 2; // enqueue_pos.load + sequence.load - if (diff == 0) { - if (enqueue_pos.compare_exchange_weak( - pos, pos + 1, std::memory_order_relaxed, std::memory_order_relaxed)) { - atomic_ops++; // successful CAS - break; - } - contended = true; - atomic_ops++; // failed CAS - } else if (diff < 0) { - return false; // Queue full - } else { - contended = true; // diff > 0: slot not yet released, spin - } - } - atomic_ops++; // final sequence.store - atomic_count += atomic_ops; - if (contended) { - wait_cycle += (get_sys_cnt_aicpu() - t0); - } - - slot->slot_state = slot_state; - slot->sequence.store(static_cast(pos + 1), std::memory_order_release); - return true; - } -#endif - - PTO2TaskSlotState* pop() { - // Fast-path: skip slot load when queue is clearly empty - uint64_t d = dequeue_pos.load(std::memory_order_relaxed); - uint64_t e = enqueue_pos.load(std::memory_order_relaxed); - if (d >= e) { - return nullptr; - } - - uint64_t pos; - PTO2ReadyQueueSlot* slot; - while (true) { - pos = dequeue_pos.load(std::memory_order_relaxed); - slot = &slots[pos & mask]; - int64_t seq = slot->sequence.load(std::memory_order_acquire); - int64_t diff = seq - static_cast(pos + 1); - if (diff == 0) { - if (dequeue_pos.compare_exchange_weak( - pos, pos + 1, std::memory_order_relaxed, std::memory_order_relaxed)) - break; - } else if (diff < 0) { - return nullptr; // Queue empty - } - } - - PTO2TaskSlotState* result = slot->slot_state; - slot->sequence.store(static_cast(pos + mask + 1), std::memory_order_release); - return result; - } - -#if PTO2_SCHED_PROFILING - PTO2TaskSlotState* pop(uint64_t& atomic_count, uint64_t& wait_cycle) { - // Fast-path: skip slot load when queue is clearly empty - uint64_t d = dequeue_pos.load(std::memory_order_relaxed); - uint64_t e = enqueue_pos.load(std::memory_order_relaxed); - atomic_count += 2; // dequeue_pos.load + enqueue_pos.load - if (d >= e) { - return nullptr; - } - - uint64_t pos; - PTO2ReadyQueueSlot* slot; - uint64_t t0 = get_sys_cnt_aicpu(); - bool contended = false; - uint32_t atomic_ops = 0; - while (true) { - pos = dequeue_pos.load(std::memory_order_relaxed); - slot = &slots[pos & mask]; - int64_t seq = slot->sequence.load(std::memory_order_acquire); - int64_t diff = seq - static_cast(pos + 1); - atomic_ops += 2; // dequeue_pos.load + sequence.load - if (diff == 0) { - if (dequeue_pos.compare_exchange_weak( - pos, pos + 1, std::memory_order_relaxed, std::memory_order_relaxed)) { - atomic_ops++; // successful CAS - break; - } - contended = true; - atomic_ops++; // failed CAS - } else if (diff < 0) { - atomic_count += atomic_ops; - return nullptr; // Queue empty - } else { - contended = true; - } - } - atomic_ops++; // final sequence.store - atomic_count += atomic_ops; - if (contended) { - wait_cycle += (get_sys_cnt_aicpu() - t0); - } - - PTO2TaskSlotState* result = slot->slot_state; - slot->sequence.store(static_cast(pos + mask + 1), std::memory_order_release); - return result; - } -#endif - - // Batch pop: reserve a contiguous run of ready slots with a single CAS. - // Returns actual number of items popped (may be less than max_count). - int pop_batch(PTO2TaskSlotState** out, int max_count) { - uint64_t pos; - int count; - while (true) { - pos = dequeue_pos.load(std::memory_order_relaxed); - count = 0; - while (count < max_count) { - PTO2ReadyQueueSlot* slot = &slots[(pos + count) & mask]; - int64_t seq = slot->sequence.load(std::memory_order_acquire); - int64_t diff = seq - static_cast(pos + count + 1); - if (diff == 0) { - count++; - continue; - } - if (diff < 0) { - break; - } - count = -1; - break; - } - if (count == 0) return 0; - if (count < 0) continue; - if (dequeue_pos.compare_exchange_weak( - pos, pos + count, std::memory_order_relaxed, std::memory_order_relaxed)) { - break; - } - } - - for (int i = 0; i < count; i++) { - PTO2ReadyQueueSlot* slot = &slots[(pos + i) & mask]; - out[i] = slot->slot_state; - slot->sequence.store(static_cast(pos + i + mask + 1), std::memory_order_release); - } - return count; - } - -#if PTO2_SCHED_PROFILING - int pop_batch(PTO2TaskSlotState** out, int max_count, uint64_t& atomic_count, uint64_t& wait_cycle) { - uint64_t pos; - int count; - uint64_t t0 = get_sys_cnt_aicpu(); - bool contended = false; - uint32_t atomic_ops = 0; - while (true) { - pos = dequeue_pos.load(std::memory_order_relaxed); - atomic_ops++; // dequeue_pos.load - count = 0; - while (count < max_count) { - PTO2ReadyQueueSlot* slot = &slots[(pos + count) & mask]; - int64_t seq = slot->sequence.load(std::memory_order_acquire); - int64_t diff = seq - static_cast(pos + count + 1); - atomic_ops++; // sequence.load - if (diff == 0) { - count++; - continue; - } - if (diff < 0) { - break; - } - contended = true; - count = -1; - break; - } - if (count == 0) { - atomic_count += atomic_ops; - return 0; - } - if (count < 0) { - continue; - } - if (dequeue_pos.compare_exchange_weak( - pos, pos + count, std::memory_order_relaxed, std::memory_order_relaxed)) { - atomic_ops++; // successful CAS - break; - } - contended = true; - atomic_ops++; // failed CAS - } - - for (int i = 0; i < count; i++) { - PTO2ReadyQueueSlot* slot = &slots[(pos + i) & mask]; - out[i] = slot->slot_state; - slot->sequence.store(static_cast(pos + i + mask + 1), std::memory_order_release); - atomic_ops++; // sequence.store - } - atomic_count += atomic_ops; - if (contended) { - wait_cycle += (get_sys_cnt_aicpu() - t0); - } - return count; - } -#endif -}; - -// Cold-path ready queue operations (defined in pto_scheduler.cpp) -bool pto2_ready_queue_init(PTO2ReadyQueue* queue, uint64_t capacity); -void pto2_ready_queue_destroy(PTO2ReadyQueue* queue); - -// ============================================================================= -// Scheduler State -// ============================================================================= - -/** - * Statistics returned by mixed-task completion processing - */ -struct PTO2CompletionStats { - int32_t fanout_edges; // Number of fanout edges traversed (notify consumers) - int32_t tasks_enqueued; // Number of consumers that became READY - int32_t fanin_edges; // Number of fanin edges traversed (release producers) - bool mixed_task_completed; // True only when this callback completed a mixed task -}; - -/** - * Scheduler state structure - * - * Contains dynamic state updated during task execution. - * Separated from shared memory for cache efficiency. - * Hot-path methods are defined inline (implicitly inline as member functions). - */ -struct PTO2SchedulerState { - // Shared memory access - PTO2SharedMemoryHandle* sm_handle; - - // Per-ring state - struct RingSchedState { - PTO2TaskDescriptor* task_descriptors; - PTO2TaskSlotState* slot_states; - int32_t last_task_alive; - int32_t task_window_mask; - uint64_t task_window_size; - // Try-lock used to advance this ring's last_task_alive pointer. - std::atomic advance_lock; - - bool init(PTO2SharedMemoryHandle* sm_handle, int32_t ring_id); - void destroy(); - - PTO2TaskSlotState& get_slot_state_by_task_id(int32_t local_id) { - return slot_states[local_id & task_window_mask]; - } - PTO2TaskSlotState& get_slot_state_by_slot(int32_t slot) { return slot_states[slot]; } - - void sync_to_sm(PTO2SharedMemoryRingHeader& ring) { - ring.fc.last_task_alive.store(last_task_alive, std::memory_order_release); - } - - void advance_ring_pointers(PTO2SharedMemoryRingHeader& ring) { - int32_t current_task_index = ring.fc.current_task_index.load(std::memory_order_acquire); - - while (last_task_alive < current_task_index) { - PTO2TaskSlotState& slot_state = get_slot_state_by_task_id(last_task_alive); - if (slot_state.task_state.load(std::memory_order_acquire) != PTO2_TASK_CONSUMED) { - break; - } - last_task_alive++; - } - - sync_to_sm(ring); - } - } ring_sched_states[PTO2_MAX_RING_DEPTH]; - - // Ready queues remain global (scheduling is ring-agnostic) - PTO2ReadyQueue ready_queues[PTO2_NUM_RESOURCE_SHAPES]; - - // Statistics -#if PTO2_SCHED_PROFILING - std::atomic tasks_completed; - std::atomic tasks_consumed; -#endif - // ========================================================================= - // Inline hot-path methods - // ========================================================================= - PTO2TaskSlotState& get_slot_state(int32_t ring_id, int32_t local_id) { - return ring_sched_states[ring_id].get_slot_state_by_task_id(local_id); - } - PTO2TaskSlotState& get_slot_state_by_slot(int32_t ring_id, int32_t slot) { - return ring_sched_states[ring_id].get_slot_state_by_slot(slot); - } - - void check_and_handle_consumed(PTO2TaskSlotState& slot_state) { - if (slot_state.fanout_refcount.load(std::memory_order_acquire) != slot_state.fanout_count) return; - - PTO2TaskState expected = PTO2_TASK_COMPLETED; - if (!slot_state.task_state.compare_exchange_strong( - expected, PTO2_TASK_CONSUMED, std::memory_order_acq_rel, std::memory_order_acquire)) { - return; - } - -#if PTO2_SCHED_PROFILING - tasks_consumed.fetch_add(1, std::memory_order_relaxed); -#endif - - int32_t ring_id = slot_state.ring_id; - // Try-lock — if another thread is advancing this ring, it will scan our CONSUMED task - int32_t expected_lock = 0; - if (ring_sched_states[ring_id].advance_lock.compare_exchange_strong( - expected_lock, 1, std::memory_order_acquire, std::memory_order_relaxed)) { - ring_sched_states[ring_id].advance_ring_pointers(sm_handle->header->rings[ring_id]); - ring_sched_states[ring_id].advance_lock.store(0, std::memory_order_release); - } - } - -#if PTO2_ORCH_PROFILING || PTO2_SCHED_PROFILING - void check_and_handle_consumed(PTO2TaskSlotState& slot_state, uint64_t& atomic_count) { - int32_t fc = slot_state.fanout_count; - int32_t rc = slot_state.fanout_refcount.load(std::memory_order_acquire); - - atomic_count += 2; // fanout_count.load + fanout_refcount.load - - if (rc != fc) return; - - PTO2TaskState expected = PTO2_TASK_COMPLETED; - if (!slot_state.task_state.compare_exchange_strong( - expected, PTO2_TASK_CONSUMED, std::memory_order_acq_rel, std::memory_order_acquire)) { - atomic_count += 1; // failed CAS - return; - } - - atomic_count += 1; // successful CAS - -#if PTO2_SCHED_PROFILING - tasks_consumed.fetch_add(1, std::memory_order_relaxed); -#endif - - int32_t ring_id = slot_state.ring_id; - // Try-lock — if another thread is advancing this ring, it will scan our CONSUMED task - int32_t expected_lock = 0; - if (ring_sched_states[ring_id].advance_lock.compare_exchange_strong( - expected_lock, 1, std::memory_order_acquire, std::memory_order_relaxed)) { - ring_sched_states[ring_id].advance_ring_pointers(sm_handle->header->rings[ring_id]); - ring_sched_states[ring_id].advance_lock.store(0, std::memory_order_release); - atomic_count += 2; // try-lock CAS + unlock store - } else { - atomic_count += 1; // failed try-lock CAS - } - } -#endif - - void release_producer(PTO2TaskSlotState& slot_state) { - slot_state.fanout_refcount.fetch_add(1, std::memory_order_acq_rel); - check_and_handle_consumed(slot_state); - } - -#if PTO2_ORCH_PROFILING || PTO2_SCHED_PROFILING - void release_producer(PTO2TaskSlotState& slot_state, uint64_t& atomic_count) { - slot_state.fanout_refcount.fetch_add(1, std::memory_order_acq_rel); - atomic_count += 1; // fanout_refcount.fetch_add - check_and_handle_consumed(slot_state, atomic_count); - } -#endif - - bool release_fanin_and_check_ready(PTO2TaskSlotState& slot_state, PTO2LocalReadyBuffer* local_bufs = nullptr) { - // Atomically increment fanin_refcount and check if all producers are done - // ACQ_REL on fanin_refcount already synchronizes with the orchestrator's - // init release, making fanin_count visible — plain load suffices. - int32_t new_refcount = slot_state.fanin_refcount.fetch_add(1, std::memory_order_acq_rel) + 1; - - if (new_refcount == slot_state.fanin_count) { - // Local-first: try per-CoreType thread-local buffer before global queue - // Route by active_mask: AIC-containing tasks → buf[0], AIV-only → buf[1] - PTO2ResourceShape shape = pto2_active_mask_to_shape(slot_state.active_mask); - if (!local_bufs || !local_bufs[static_cast(shape)].try_push(&slot_state)) { - ready_queues[static_cast(shape)].push(&slot_state); - } - return true; - } - return false; - } - -#if PTO2_ORCH_PROFILING || PTO2_SCHED_PROFILING - bool release_fanin_and_check_ready(PTO2TaskSlotState& slot_state, - uint64_t& atomic_count, - uint64_t& push_wait, - PTO2LocalReadyBuffer* local_bufs = nullptr) { - int32_t new_refcount = slot_state.fanin_refcount.fetch_add(1, std::memory_order_acq_rel) + 1; - atomic_count += 1; // fanin_refcount.fetch_add - - if (new_refcount == slot_state.fanin_count) { - PTO2TaskState expected = PTO2_TASK_PENDING; - if (slot_state.task_state.compare_exchange_strong( - expected, PTO2_TASK_READY, std::memory_order_acq_rel, std::memory_order_acquire)) { - atomic_count += 1; // CAS(task_state PENDING→READY) - // Local-first: try per-CoreType thread-local buffer before global queue - PTO2ResourceShape shape = pto2_active_mask_to_shape(slot_state.active_mask); - if (!local_bufs || !local_bufs[static_cast(shape)].try_push(&slot_state)) { - ready_queues[static_cast(shape)].push(&slot_state, atomic_count, push_wait); - } - return true; - } - } - return false; - } -#endif - - int get_ready_tasks_batch( - PTO2ResourceShape shape, PTO2LocalReadyBuffer& local_buf, PTO2TaskSlotState** out, int max_count) { - int count = 0; - while (count < max_count && local_buf.count > 0) { - out[count++] = local_buf.slot_states[--local_buf.count]; - } - int remaining = max_count - count; - if (remaining > 0) { - count += ready_queues[static_cast(shape)].pop_batch(out + count, remaining); - } - return count; - } - -#if PTO2_SCHED_PROFILING - int get_ready_tasks_batch(PTO2ResourceShape shape, - PTO2LocalReadyBuffer& local_buf, - PTO2TaskSlotState** out, - int max_count, - uint64_t& atomic_count, - uint64_t& wait_cycle, - uint64_t& local_dispatch_count) { - int count = 0; - while (count < max_count && local_buf.count > 0) { - local_dispatch_count++; - out[count++] = local_buf.slot_states[--local_buf.count]; - } - int remaining = max_count - count; - if (remaining > 0) { - count += - ready_queues[static_cast(shape)].pop_batch(out + count, remaining, atomic_count, wait_cycle); - } - return count; - } -#endif - - void on_scope_end(PTO2TaskSlotState** task_slot_states, int32_t count) { -#if PTO2_ORCH_PROFILING - extern uint64_t g_orch_scope_end_atomic_count; - if (count > 0) __builtin_prefetch(task_slot_states[0], 1, 0); - for (int32_t i = 0; i < count; i++) { - if (i + 1 < count) __builtin_prefetch(task_slot_states[i + 1], 1, 0); - release_producer(*task_slot_states[i], g_orch_scope_end_atomic_count); - } -#else - if (count > 0) __builtin_prefetch(task_slot_states[0], 1, 0); - for (int32_t i = 0; i < count; i++) { - if (i + 1 < count) __builtin_prefetch(task_slot_states[i + 1], 1, 0); - release_producer(*task_slot_states[i]); - } -#endif - } - - /** - * Subtask completion: atomic counter model. - * Called when a single subtask (AIC, AIV0, or AIV1) finishes on any block. - * Atomically increments completed_subtasks and checks whether all subtasks - * across all blocks are done. - * - * @return true if this was the last subtask, completing the entire task. - */ - bool on_subtask_complete(PTO2TaskSlotState& slot_state) { - int16_t prev = slot_state.completed_subtasks.fetch_add(1, std::memory_order_acq_rel); - return (prev + 1) == slot_state.total_required_subtasks; - } - - /** - * Two-stage completion: second stage. - * Called exactly once when all subtasks of a mixed task are done - * (i.e., on_subtask_complete returned true). - * Handles fanout notification, fanin release, and self-consumption check. - */ -#if PTO2_SCHED_PROFILING - PTO2CompletionStats -#else - void -#endif - on_mixed_task_complete(PTO2TaskSlotState& slot_state, -#if PTO2_SCHED_PROFILING - int thread_idx, -#endif - - PTO2LocalReadyBuffer* local_bufs = nullptr) { -#if PTO2_SCHED_PROFILING - PTO2CompletionStats stats = {0, 0, 0, true}; -#endif -#if PTO2_SCHED_PROFILING - extern uint64_t g_sched_lock_cycle[], g_sched_fanout_cycle[]; - extern uint64_t g_sched_lock_atomic_count[], g_sched_lock_wait_cycle[]; - extern uint64_t g_sched_fanout_atomic_count[], g_sched_push_wait_cycle[]; - uint64_t lock_atomics = 0, lock_wait = 0; - PTO2_SCHED_CYCLE_START(); -#endif - -#if PTO2_SCHED_PROFILING - pto2_fanout_lock(slot_state, lock_atomics, lock_wait); -#else - pto2_fanout_lock(slot_state); -#endif - slot_state.task_state.store(PTO2_TASK_COMPLETED, std::memory_order_release); - PTO2DepListEntry* current = slot_state.fanout_head; // Protected by fanout_lock - pto2_fanout_unlock(slot_state); - -#if PTO2_SCHED_PROFILING - lock_atomics += 2; // state.store + unlock.store - g_sched_lock_atomic_count[thread_idx] += lock_atomics; - g_sched_lock_wait_cycle[thread_idx] += lock_wait; - PTO2_SCHED_CYCLE_LAP(g_sched_lock_cycle[thread_idx]); -#endif - - // Fanout: notify consumers -#if PTO2_SCHED_PROFILING - uint64_t fanout_atomics = 0, push_wait = 0; -#endif - while (current != nullptr) { - PTO2TaskSlotState& consumer_slot = *current->slot_state; -#if PTO2_SCHED_PROFILING - stats.fanout_edges++; - if (release_fanin_and_check_ready(consumer_slot, fanout_atomics, push_wait, local_bufs)) { - stats.tasks_enqueued++; - } -#else - release_fanin_and_check_ready(consumer_slot, local_bufs); -#endif - current = current->next; - } - -#if PTO2_SCHED_PROFILING - g_sched_fanout_atomic_count[thread_idx] += fanout_atomics; - g_sched_push_wait_cycle[thread_idx] += push_wait; - PTO2_SCHED_CYCLE_LAP(g_sched_fanout_cycle[thread_idx]); - return stats; -#endif - } - - /** - * Cold path: release producers (fanin traversal) + check self for CONSUMED. - * Returns fanin edge count for profiling. - */ - -#if PTO2_SCHED_PROFILING - int32_t on_task_release(PTO2TaskSlotState& slot_state, int32_t thread_idx) { - PTO2_SCHED_CYCLE_START(); - extern uint64_t g_sched_fanin_cycle[], g_sched_fanin_atomic_count[]; - extern uint64_t g_sched_self_atomic_count[]; - extern uint64_t g_sched_self_consumed_cycle[]; - extern uint64_t g_sched_complete_count[]; - uint64_t fanin_atomics = 0; -#else - int32_t on_task_release(PTO2TaskSlotState& slot_state) { -#endif - PTO2TaskPayload* payload = slot_state.payload; - int32_t fanin_edges = payload->fanin_actual_count; - for (int32_t i = 0; i < fanin_edges; i++) { -#if PTO2_SCHED_PROFILING - release_producer(*payload->fanin_slot_states[i], fanin_atomics); -#else - release_producer(*payload->fanin_slot_states[i]); -#endif - } -#if PTO2_SCHED_PROFILING - g_sched_fanin_atomic_count[thread_idx] += fanin_atomics; - PTO2_SCHED_CYCLE_LAP(g_sched_fanin_cycle[thread_idx]); -#endif - - // Self consumed check -#if PTO2_SCHED_PROFILING - uint64_t self_atomics = 0; - check_and_handle_consumed(slot_state, self_atomics); - g_sched_self_atomic_count[thread_idx] += self_atomics; - PTO2_SCHED_CYCLE_LAP(g_sched_self_consumed_cycle[thread_idx]); - g_sched_complete_count[thread_idx]++; -#else - check_and_handle_consumed(slot_state); -#endif - return fanin_edges; - } -}; // NOLINT(readability/braces) - -// ============================================================================= -// Scheduler API (cold path, defined in pto_scheduler.cpp) -// ============================================================================= - -bool pto2_scheduler_init(PTO2SchedulerState* sched, PTO2SharedMemoryHandle* sm_handle); -void pto2_scheduler_destroy(PTO2SchedulerState* sched); - -// ============================================================================= -// Debug Utilities (cold path, defined in pto_scheduler.cpp) -// ============================================================================= - -void pto2_scheduler_print_stats(PTO2SchedulerState* sched); -void pto2_scheduler_print_queues(PTO2SchedulerState* sched); -const char* pto2_task_state_name(PTO2TaskState state); - -// ============================================================================= -// Scheduler Profiling Data -// ============================================================================= - -#if PTO2_SCHED_PROFILING -struct PTO2SchedProfilingData { - // Sub-phase cycle breakdown within on_mixed_task_complete - uint64_t lock_cycle; // pto2_fanout_lock + state store + unlock - uint64_t fanout_cycle; // fanout traversal - uint64_t fanin_cycle; // fanin traversal - uint64_t self_consumed_cycle; // self check_and_handle_consumed - - // Wait times - uint64_t lock_wait_cycle; // spin-wait in fanout_lock - uint64_t push_wait_cycle; // CAS contention in push() - uint64_t pop_wait_cycle; // CAS contention in pop() - - // Atomic counts per sub-phase - uint64_t lock_atomic_count; - uint64_t fanout_atomic_count; - uint64_t fanin_atomic_count; - uint64_t self_atomic_count; - uint64_t pop_atomic_count; - - int64_t complete_count; -}; - -/** - * Get and reset scheduler profiling data for a specific thread. - * Returns accumulated profiling data and resets counters. - */ -PTO2SchedProfilingData pto2_scheduler_get_profiling(int thread_idx); -#endif diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_shared_memory.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_shared_memory.cpp deleted file mode 100644 index 633caa048..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_shared_memory.cpp +++ /dev/null @@ -1,273 +0,0 @@ -/** - * PTO Runtime2 - Shared Memory Implementation - * - * Implements shared memory allocation, initialization, and management - * for Orchestrator-Scheduler communication. - * - * Based on: docs/RUNTIME_LOGIC.md - */ - -#include "pto_shared_memory.h" -#include -#include -#include -#include "common/unified_log.h" - -// ============================================================================= -// Size Calculation -// ============================================================================= - -uint64_t pto2_sm_calculate_size(uint64_t task_window_size) { - uint64_t task_window_sizes[PTO2_MAX_RING_DEPTH]; - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - task_window_sizes[r] = task_window_size; - } - return pto2_sm_calculate_size_per_ring(task_window_sizes); -} - -uint64_t pto2_sm_calculate_size_per_ring(const uint64_t task_window_sizes[PTO2_MAX_RING_DEPTH]) { - uint64_t size = 0; - - // Header (aligned to cache line) - size += PTO2_ALIGN_UP(sizeof(PTO2SharedMemoryHeader), PTO2_ALIGN_SIZE); - - // Per-ring task descriptors and payloads - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - size += PTO2_ALIGN_UP(task_window_sizes[r] * sizeof(PTO2TaskDescriptor), PTO2_ALIGN_SIZE); - size += PTO2_ALIGN_UP(task_window_sizes[r] * sizeof(PTO2TaskPayload), PTO2_ALIGN_SIZE); - } - - return size; -} - -// ============================================================================= -// Creation and Destruction -// ============================================================================= - -static void pto2_sm_setup_pointers_per_ring( - PTO2SharedMemoryHandle* handle, - const uint64_t task_window_sizes[PTO2_MAX_RING_DEPTH]) { - char* ptr = (char*)handle->sm_base; - - // Header - handle->header = (PTO2SharedMemoryHeader*)ptr; - ptr += PTO2_ALIGN_UP(sizeof(PTO2SharedMemoryHeader), PTO2_ALIGN_SIZE); - - // Per-ring task descriptors and payloads - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - handle->task_descriptors[r] = (PTO2TaskDescriptor*)ptr; - ptr += PTO2_ALIGN_UP(task_window_sizes[r] * sizeof(PTO2TaskDescriptor), PTO2_ALIGN_SIZE); - - handle->task_payloads[r] = (PTO2TaskPayload*)ptr; - ptr += PTO2_ALIGN_UP(task_window_sizes[r] * sizeof(PTO2TaskPayload), PTO2_ALIGN_SIZE); - } -} - -static void pto2_sm_setup_pointers(PTO2SharedMemoryHandle* handle, uint64_t task_window_size) { - uint64_t task_window_sizes[PTO2_MAX_RING_DEPTH]; - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - task_window_sizes[r] = task_window_size; - } - pto2_sm_setup_pointers_per_ring(handle, task_window_sizes); -} - -PTO2SharedMemoryHandle* pto2_sm_create(uint64_t task_window_size, - uint64_t heap_size) { - // Allocate handle - PTO2SharedMemoryHandle* handle = (PTO2SharedMemoryHandle*)calloc(1, sizeof(PTO2SharedMemoryHandle)); - if (!handle) { - return NULL; - } - - // Calculate total size - uint64_t sm_size = pto2_sm_calculate_size(task_window_size); - - // Allocate shared memory (aligned for DMA efficiency) - #if defined(_POSIX_C_SOURCE) && _POSIX_C_SOURCE >= 200112L - if (posix_memalign(&handle->sm_base, PTO2_ALIGN_SIZE, static_cast(sm_size)) != 0) { - free(handle); - return NULL; - } - #else - handle->sm_base = aligned_alloc(PTO2_ALIGN_SIZE, static_cast(sm_size)); - if (!handle->sm_base) { - free(handle); - return NULL; - } - #endif - - handle->sm_size = sm_size; - handle->is_owner = true; - - // Initialize to zero - memset(handle->sm_base, 0, static_cast(sm_size)); - - // Set up pointers - pto2_sm_setup_pointers(handle, task_window_size); - - // Initialize header - pto2_sm_init_header(handle, task_window_size, heap_size); - - return handle; -} - -PTO2SharedMemoryHandle* pto2_sm_create_default(void) { - return pto2_sm_create(PTO2_TASK_WINDOW_SIZE, - PTO2_HEAP_SIZE); -} - -PTO2SharedMemoryHandle* pto2_sm_create_from_buffer(void* sm_base, - uint64_t sm_size, - uint64_t task_window_size, - uint64_t heap_size) { - if (!sm_base || sm_size == 0) return NULL; - - uint64_t required = pto2_sm_calculate_size(task_window_size); - if (sm_size < required) return NULL; - - PTO2SharedMemoryHandle* handle = (PTO2SharedMemoryHandle*)calloc(1, sizeof(PTO2SharedMemoryHandle)); - if (!handle) return NULL; - - handle->sm_base = sm_base; - handle->sm_size = sm_size; - handle->is_owner = false; - - pto2_sm_setup_pointers(handle, task_window_size); - pto2_sm_init_header(handle, task_window_size, heap_size); - - return handle; -} - -void pto2_sm_destroy(PTO2SharedMemoryHandle* handle) { - if (!handle) return; - - if (handle->is_owner && handle->sm_base) { - free(handle->sm_base); - } - - free(handle); -} - -// ============================================================================= -// Initialization -// ============================================================================= -// -// no need init data in pool, init pool data when used -void pto2_sm_init_header(PTO2SharedMemoryHandle* handle, - uint64_t task_window_size, - uint64_t heap_size) { - uint64_t task_window_sizes[PTO2_MAX_RING_DEPTH]; - uint64_t heap_sizes[PTO2_MAX_RING_DEPTH]; - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - task_window_sizes[r] = task_window_size; - heap_sizes[r] = heap_size; - } - pto2_sm_init_header_per_ring(handle, task_window_sizes, heap_sizes); -} - -void pto2_sm_init_header_per_ring( - PTO2SharedMemoryHandle* handle, - const uint64_t task_window_sizes[PTO2_MAX_RING_DEPTH], - const uint64_t heap_sizes[PTO2_MAX_RING_DEPTH]) { - PTO2SharedMemoryHeader* header = handle->header; - - // Per-ring flow control (start at 0) - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - header->rings[r].fc.init(); - } - - header->orchestrator_done.store(0, std::memory_order_relaxed); - - // Per-ring layout info - uint64_t offset = PTO2_ALIGN_UP(sizeof(PTO2SharedMemoryHeader), PTO2_ALIGN_SIZE); - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - header->rings[r].task_window_size = task_window_sizes[r]; - header->rings[r].heap_size = heap_sizes[r]; - header->rings[r].task_descriptors_offset = offset; - offset += PTO2_ALIGN_UP(task_window_sizes[r] * sizeof(PTO2TaskDescriptor), PTO2_ALIGN_SIZE); - offset += PTO2_ALIGN_UP(task_window_sizes[r] * sizeof(PTO2TaskPayload), PTO2_ALIGN_SIZE); - } - - header->total_size = handle->sm_size; - header->graph_output_ptr.store(0, std::memory_order_relaxed); - header->graph_output_size.store(0, std::memory_order_relaxed); - - // Error reporting - header->orch_error_code.store(PTO2_ERROR_NONE, std::memory_order_relaxed); - header->sched_error_bitmap.store(0, std::memory_order_relaxed); - header->sched_error_code.store(PTO2_ERROR_NONE, std::memory_order_relaxed); - header->sched_error_thread.store(-1, std::memory_order_relaxed); -} - -// ============================================================================= -// Debug Utilities -// ============================================================================= - -void pto2_sm_print_layout(PTO2SharedMemoryHandle* handle) { - if (!handle || !handle->header) return; - - PTO2SharedMemoryHeader* h = handle->header; - - LOG_INFO("=== PTO2 Shared Memory Layout ==="); - LOG_INFO("Base address: %p", handle->sm_base); - LOG_INFO("Total size: %" PRIu64 " bytes", h->total_size); - LOG_INFO("Ring depth: %d", PTO2_MAX_RING_DEPTH); - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - LOG_INFO("Ring %d:", r); - LOG_INFO(" task_window_size: %" PRIu64, h->rings[r].task_window_size); - LOG_INFO(" heap_size: %" PRIu64 " bytes", h->rings[r].heap_size); - LOG_INFO(" descriptors_off: %" PRIu64 " (0x%" PRIx64 ")", - h->rings[r].task_descriptors_offset, h->rings[r].task_descriptors_offset); - LOG_INFO(" heap_top: %" PRIu64, h->rings[r].fc.heap_top.load(std::memory_order_acquire)); - LOG_INFO(" heap_tail: %" PRIu64, h->rings[r].fc.heap_tail.load(std::memory_order_acquire)); - LOG_INFO(" current_task_idx: %d", h->rings[r].fc.current_task_index.load(std::memory_order_acquire)); - LOG_INFO(" last_task_alive: %d", h->rings[r].fc.last_task_alive.load(std::memory_order_acquire)); - } - LOG_INFO("orchestrator_done: %d", h->orchestrator_done.load(std::memory_order_acquire)); - LOG_INFO("Error state:"); - LOG_INFO(" orch_error_code: %d", h->orch_error_code.load(std::memory_order_relaxed)); - LOG_INFO(" sched_error_bitmap: 0x%x", h->sched_error_bitmap.load(std::memory_order_relaxed)); - LOG_INFO(" sched_error_code: %d", h->sched_error_code.load(std::memory_order_relaxed)); - LOG_INFO(" sched_error_thread: %d", h->sched_error_thread.load(std::memory_order_relaxed)); - LOG_INFO("================================"); -} - -bool pto2_sm_validate(PTO2SharedMemoryHandle* handle) { - if (!handle) return false; - if (!handle->sm_base) return false; - if (!handle->header) return false; - - PTO2SharedMemoryHeader* h = handle->header; - - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - if (!h->rings[r].fc.validate(handle, r)) return false; - } - - return true; -} - -bool PTO2RingFlowControl::validate(PTO2SharedMemoryHandle* handle, int32_t ring_id) const { - if (!handle) return false; - if (!handle->header) return false; - if (ring_id < 0 || ring_id >= PTO2_MAX_RING_DEPTH) return false; - - const PTO2SharedMemoryHeader* h = handle->header; - - // Check that offsets are within bounds - if (h->rings[ring_id].task_descriptors_offset >= h->total_size) return false; - - // Check pointer alignment - if ((uintptr_t)handle->task_descriptors[ring_id] % PTO2_ALIGN_SIZE != 0) return false; - - // Check flow control pointer sanity - int32_t current = current_task_index.load(std::memory_order_acquire); - int32_t last_alive = last_task_alive.load(std::memory_order_acquire); - uint64_t top = heap_top.load(std::memory_order_acquire); - uint64_t tail = heap_tail.load(std::memory_order_acquire); - if (current < 0) return false; - if (last_alive < 0) return false; - if (top > h->rings[ring_id].heap_size) return false; - if (tail > h->rings[ring_id].heap_size) return false; - - return true; -} diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_shared_memory.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_shared_memory.h deleted file mode 100644 index b0200da4d..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_shared_memory.h +++ /dev/null @@ -1,227 +0,0 @@ -/** - * PTO Runtime2 - Shared Memory Layout - * - * Defines the shared memory structure for Orchestrator-Scheduler communication. - * - * Memory Layout (per-ring sections repeat for each ring 0..PTO2_MAX_RING_DEPTH-1): - * +---------------------------+ - * | SharedMemoryHeader | (per-ring flow control + sync) - * +---------------------------+ - * | Ring 0: TaskDescriptor[] | - * | Ring 0: TaskPayload[] | - * +---------------------------+ - * | Ring 1: TaskDescriptor[] | - * | Ring 1: TaskPayload[] | - * +---------------------------+ - * | ... | - * +---------------------------+ - * - * Design principles: - * - Only data needed for Orchestrator<->Scheduler communication is here - * - TensorMap, scope_stack, ready_queues, dep_pool are in private memory - * - Flow control via atomic counters/flags (no locks needed for single-word R/W) - * - * Based on: docs/RUNTIME_LOGIC.md - */ - -#ifndef PTO_SHARED_MEMORY_H -#define PTO_SHARED_MEMORY_H - -#include "pto_runtime2_types.h" - -#ifdef __cplusplus -extern "C" { -#endif - -// ============================================================================= -// Shared Memory Header -// ============================================================================= - -struct PTO2SharedMemoryHandle; - -/** - * Per-ring flow control state in shared memory. - * Written/read by Orchestrator and Scheduler for synchronization. - */ -struct PTO2RingFlowControl { - // Written by Orchestrator, Read by Scheduler - std::atomic heap_top; // Heap ring allocation pointer - std::atomic current_task_index; // Task ring head (next to allocate) - int32_t _pad0; // Alignment padding - - // Written by Scheduler, Read by Orchestrator (for back-pressure) - std::atomic heap_tail; // Heap ring free pointer - std::atomic last_task_alive; // Task ring tail (oldest active task) - int32_t _pad1; // Alignment padding - - void init() { - heap_top.store(0, std::memory_order_relaxed); - current_task_index.store(0, std::memory_order_relaxed); - heap_tail.store(0, std::memory_order_relaxed); - last_task_alive.store(0, std::memory_order_relaxed); - } - - bool validate(PTO2SharedMemoryHandle* handle, int32_t ring_id) const; -}; - -/** - * Per-ring shared memory header section. - * - * Groups flow-control and layout info for a single ring to avoid parallel arrays. - */ -struct PTO2SharedMemoryRingHeader { - PTO2RingFlowControl fc; - uint64_t task_window_size; - uint64_t heap_size; - uint64_t task_descriptors_offset; // Offset from SM base, in bytes -}; - -/** - * Shared memory header structure - * - * Contains per-ring flow control and global layout information. - */ -struct alignas(PTO2_ALIGN_SIZE) PTO2SharedMemoryHeader { - // === PER-RING FLOW CONTROL + LAYOUT INFO (set once at init) === - PTO2SharedMemoryRingHeader rings[PTO2_MAX_RING_DEPTH]; - - // === GLOBAL FIELDS === - std::atomic orchestrator_done; // Flag: orchestration complete - - // Total shared memory size (for validation) - uint64_t total_size; - - // Graph output for copy-back (set by orchestrator when using packed buffer) - // Host finalize copies from this address instead of dev_ptr when non-zero - std::atomic graph_output_ptr; // Address where final output was written (packed buffer) - std::atomic graph_output_size; // Size in bytes - - // === ERROR REPORTING === - - // Orchestrator fatal error code (Orchestrator → Scheduler, AICPU → Host) - // Non-zero signals fatal error. Written by orchestrator, read by scheduler and host. - std::atomic orch_error_code; - - // Scheduler error state (Scheduler → Host, independent of orchestrator) - // Written by scheduler threads on timeout; read by orchestrator and host. - std::atomic sched_error_bitmap; // Bit X set = thread X had error - std::atomic sched_error_code; // Last scheduler error code (last-writer-wins) - std::atomic sched_error_thread; // Thread index of last error writer -}; - -static_assert(sizeof(PTO2SharedMemoryHeader) % PTO2_ALIGN_SIZE == 0, - "PTO2SharedMemoryHeader must be aligned to cache line (PTO2_ALIGN_SIZE)"); - -// ============================================================================= -// Shared Memory Handle -// ============================================================================= - -/** - * Handle for shared memory access - * Provides both Orchestrator and Scheduler views of the same memory - */ -struct PTO2SharedMemoryHandle { - void* sm_base; // Base address of shared memory - uint64_t sm_size; // Total size of shared memory - - // Quick pointers into shared memory regions (per-ring) - PTO2SharedMemoryHeader* header; - PTO2TaskDescriptor* task_descriptors[PTO2_MAX_RING_DEPTH]; - PTO2TaskPayload* task_payloads[PTO2_MAX_RING_DEPTH]; - - // Ownership flag - bool is_owner; // True if this handle allocated the memory - -}; - -// ============================================================================= -// Shared Memory API -// ============================================================================= - -/** - * Calculate required shared memory size - * - * @param task_window_size Number of task slots per ring - * @return Total bytes required - */ -uint64_t pto2_sm_calculate_size(uint64_t task_window_size); - -/** - * Calculate required shared memory size for per-ring task windows. - * - * @param task_window_sizes Array of window sizes per ring - * @return Total bytes required - */ -uint64_t pto2_sm_calculate_size_per_ring(const uint64_t task_window_sizes[PTO2_MAX_RING_DEPTH]); - -/** - * Create shared memory for Orchestrator and Scheduler - * - * @param task_window_size Number of task slots per ring - * @param heap_size Heap size per ring for output buffers - * @return Handle with both views, or NULL on failure - */ -PTO2SharedMemoryHandle* pto2_sm_create(uint64_t task_window_size, - uint64_t heap_size); - -/** - * Create shared memory with default sizes - */ -PTO2SharedMemoryHandle* pto2_sm_create_default(void); - -/** - * Wrap an existing buffer as shared memory (e.g. device GM buffer). - * Caller owns the buffer; handle will not free sm_base. - * - * @param sm_base Base address of pre-allocated buffer - * @param sm_size Total size in bytes - * @param task_window_size Number of task slots per ring (must match buffer layout) - * @param heap_size Heap size per ring (for layout; buffer has no heap region) - * @return Handle, or NULL on failure - */ -PTO2SharedMemoryHandle* pto2_sm_create_from_buffer(void* sm_base, - uint64_t sm_size, - uint64_t task_window_size, - uint64_t heap_size); - -/** - * Destroy shared memory and free resources - */ -void pto2_sm_destroy(PTO2SharedMemoryHandle* handle); - -/** - * Initialize shared memory header with layout information - * Called after memory is allocated - */ -void pto2_sm_init_header(PTO2SharedMemoryHandle* handle, - uint64_t task_window_size, - uint64_t heap_size); - -/** - * Initialize shared memory header with per-ring layout information. - */ -void pto2_sm_init_header_per_ring( - PTO2SharedMemoryHandle* handle, - const uint64_t task_window_sizes[PTO2_MAX_RING_DEPTH], - const uint64_t heap_sizes[PTO2_MAX_RING_DEPTH]); - -// ============================================================================= -// Debug Utilities -// ============================================================================= - -/** - * Print shared memory layout info - */ -void pto2_sm_print_layout(PTO2SharedMemoryHandle* handle); - -/** - * Validate shared memory integrity - * @return true if valid, false if corrupted - */ -bool pto2_sm_validate(PTO2SharedMemoryHandle* handle); - -#ifdef __cplusplus -} -#endif - -#endif // PTO_SHARED_MEMORY_H diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_submit_types.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_submit_types.h deleted file mode 100644 index a0df3c4a6..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_submit_types.h +++ /dev/null @@ -1,119 +0,0 @@ -/* - * Copyright (c) PyPTO Contributors. - * This program is free software, you can redistribute it and/or modify it under the terms and conditions of - * CANN Open Software License Agreement Version 2.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - * ----------------------------------------------------------------------------------------------------------- - */ - -/** - * PTO Submit Types - Shared submit-contract definitions - * - * Header-only definitions shared by orchestration-facing and runtime-facing - * headers. Keeps orchestration slim (no dependency on pto_runtime2_types.h). - */ - -#pragma once - -#include - -inline constexpr int32_t INVALID_KERNEL_ID = -1; - -/** - * Subtask slot count: AIC, AIV0, AIV1 - */ -inline constexpr int32_t PTO2_SUBTASK_SLOT_COUNT = 3; - -/** - * Subtask slot indices - */ -enum class PTO2SubtaskSlot : uint8_t { - AIC = 0, - AIV0 = 1, - AIV1 = 2, -}; - -/** - * Subtask mask bits (for active_mask / subtask_done_mask) - */ -inline constexpr uint8_t PTO2_SUBTASK_MASK_AIC = (1u << 0); // 0x1 -inline constexpr uint8_t PTO2_SUBTASK_MASK_AIV0 = (1u << 1); // 0x2 -inline constexpr uint8_t PTO2_SUBTASK_MASK_AIV1 = (1u << 2); // 0x4 - -/** - * Test whether a subtask slot is active in a given mask - */ -static inline bool pto2_subtask_active(uint8_t mask, PTO2SubtaskSlot slot) { - return (mask & (1u << static_cast(slot))) != 0; -} - -/** - * Mixed-task submit contract. - * - * Each field holds either a valid kernel ID or INVALID_KERNEL_ID (inactive). - * At least one slot must be valid. - */ -struct MixedKernels { - int32_t aic_kernel_id{INVALID_KERNEL_ID}; - int32_t aiv0_kernel_id{INVALID_KERNEL_ID}; - int32_t aiv1_kernel_id{INVALID_KERNEL_ID}; -}; - -/** - * Resource shape — classifies a MixedKernels into one of 3 scheduling buckets. - * - * Multi-subtask tasks (2+ active slots) are all scheduled as MIX, which - * requires a fully-idle cluster (1 AIC + 2 AIV). The actual cores used - * are determined at dispatch time by active_mask — unused cores in the - * cluster remain idle and available for single-core tasks. - */ -enum class PTO2ResourceShape : uint8_t { - AIC = 0, // Single AIC - AIV = 1, // Single AIV - MIX = 2, // Full cluster (dispatch uses active_mask) -}; - -inline constexpr int32_t PTO2_NUM_RESOURCE_SHAPES = 3; - -/** - * Derive resource shape from active_mask. - * Caller must ensure active_mask is valid (at least one bit set). - */ -static inline PTO2ResourceShape pto2_active_mask_to_shape(uint8_t active_mask) { - int bit_count = __builtin_popcount(active_mask); - if (bit_count >= 2) return PTO2ResourceShape::MIX; - if (active_mask & PTO2_SUBTASK_MASK_AIC) return PTO2ResourceShape::AIC; - return PTO2ResourceShape::AIV; -} - -/** - * Compute active_mask from MixedKernels. - */ -static inline uint8_t pto2_mixed_kernels_to_active_mask(const MixedKernels& mk) { - uint8_t mask = 0; - if (mk.aic_kernel_id != INVALID_KERNEL_ID) mask |= PTO2_SUBTASK_MASK_AIC; - if (mk.aiv0_kernel_id != INVALID_KERNEL_ID) mask |= PTO2_SUBTASK_MASK_AIV0; - if (mk.aiv1_kernel_id != INVALID_KERNEL_ID) mask |= PTO2_SUBTASK_MASK_AIV1; - return mask; -} - -/** - * SPMD launch parameters carried inside Arg. - * - * Controls how many logical blocks (SPMD dimension) a single task - * is expanded into at dispatch time. Each block receives a unique - * block_idx in [0, block_num) via the per-dispatch LocalContext. - */ -class PTO2LaunchSpec { - public: - constexpr PTO2LaunchSpec() = default; - - int16_t block_num() const { return block_num_; } - void set_block_num(int16_t n) { block_num_ = n; } - - private: - int16_t block_num_{1}; -}; diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_task_id.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_task_id.h deleted file mode 100644 index 595372f90..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_task_id.h +++ /dev/null @@ -1,50 +0,0 @@ -/* - * Copyright (c) PyPTO Contributors. - * This program is free software, you can redistribute it and/or modify it under the terms and conditions of - * CANN Open Software License Agreement Version 2.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - * ----------------------------------------------------------------------------------------------------------- - */ - -/** - * PTO2TaskId — minimal standalone header. - * - * Factored out of pto_runtime2_types.h so that tensor.h can include it - * without pulling in scheduler-internal constants (heap sizes, timeouts, etc.). - */ - -#pragma once - -#include - -/** - * TaskId: 64-bit encoding used across Runtime2. - * - * raw encoding: (ring_id << 32) | local_id - * - * ring_id: which ring layer (0..PTO2_MAX_RING_DEPTH-1) - * local_id: per-ring monotonic counter - * - * Invalid sentinel: raw == UINT64_MAX (no valid task has this encoding). - */ -struct PTO2TaskId { - uint64_t raw; - - static constexpr PTO2TaskId make(uint8_t ring_id, uint32_t local_id) { - return PTO2TaskId{(static_cast(ring_id) << 32) | static_cast(local_id)}; - } - - static constexpr PTO2TaskId invalid() { return PTO2TaskId{UINT64_MAX}; } - - constexpr uint8_t ring() const { return static_cast(raw >> 32); } - constexpr uint32_t local() const { return static_cast(raw & 0xFFFFFFFFu); } - constexpr bool is_valid() const { return raw != UINT64_MAX; } - - constexpr bool operator==(const PTO2TaskId& other) const { return raw == other.raw; } - constexpr bool operator!=(const PTO2TaskId& other) const { return raw != other.raw; } -}; - -static_assert(sizeof(PTO2TaskId) == 8, "PTO2TaskId must stay 8 bytes (shared memory ABI)"); diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_tensormap.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_tensormap.cpp deleted file mode 100644 index 794636e3a..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_tensormap.cpp +++ /dev/null @@ -1,256 +0,0 @@ -/** - * PTO Runtime2 - TensorMap Implementation - * - * Implements TensorMap with ring buffer pool, lazy invalidation, - * and chain truncation optimization. - * - * Key features: - * 1. O(1) insert at bucket head - * 2. O(valid_entries) lookup with chain truncation - * 3. Automatic stale entry cleanup during lookup - * 4. Periodic explicit cleanup for long chains - * - * Based on: docs/RUNTIME_LOGIC.md - */ - -#include "pto_tensormap.h" - -#include -#include - -#include "common.h" -#include "common/unified_log.h" -#include "pto_orchestrator.h" - -// ============================================================================= -// TensorMap Lookup Chain Length Statistics (compile-time toggle) -// ============================================================================= -#if PTO2_TENSORMAP_PROFILING -uint64_t g_lookup_chain_total = 0; -uint64_t g_lookup_count = 0; -int32_t g_lookup_chain_max = 0; -uint64_t g_lookup_overlap_checks = 0; -uint64_t g_lookup_overlap_hits = 0; -uint64_t g_insert_count = 0; -#endif - -// ============================================================================= -// Initialization and Destruction -// ============================================================================= - -bool PTO2TensorMap::init(int32_t new_num_buckets, int32_t new_pool_size, const int32_t new_task_window_sizes[PTO2_MAX_RING_DEPTH]) { - // Validate power of 2 for fast modulo - if ((new_num_buckets & (new_num_buckets - 1)) != 0) { - return false; // num_buckets must be power of 2 - } - - // Allocate buckets - buckets = (PTO2TensorMapEntry**)malloc(new_num_buckets * sizeof(PTO2TensorMapEntry*)); - if (!buckets) { - return false; - } - - // Initialize all buckets to empty (-1) - for (int32_t i = 0; i < new_num_buckets; i++) { - buckets[i] = nullptr; - } - - num_buckets = new_num_buckets; - - // Allocate entry pool (64-byte aligned for cache-line-aligned entries) - entry_pool = (PTO2TensorMapEntry*)aligned_alloc(alignof(PTO2TensorMapEntry), new_pool_size * sizeof(PTO2TensorMapEntry)); - if (!entry_pool) { - free(buckets); - buckets = NULL; - return false; - } - memset(entry_pool, 0, new_pool_size * sizeof(PTO2TensorMapEntry)); - - // Allocate free entry list - free_entry_list = (PTO2TensorMapEntry**)calloc(new_pool_size, sizeof(PTO2TensorMapEntry*)); - if (!free_entry_list) { - free(buckets); - free(entry_pool); - buckets = NULL; - entry_pool = NULL; - return false; - } - - pool_size = new_pool_size; - next_entry_idx = 0; - free_num = 0; - - // Initialize all entries as not in bucket - for (int32_t i = 0; i < pool_size; i++) { - entry_pool[i].bucket_index = -1; - entry_pool[i].next_in_bucket = nullptr; - entry_pool[i].prev_in_bucket = nullptr; - entry_pool[i].next_in_task = nullptr; - entry_pool[i].prev_in_task = nullptr; - entry_pool[i].producer_task_id = PTO2TaskId{}; - } - - // Allocate per-ring per-task entry tracking (each ring has its own window size) - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - task_entry_heads[r] = (PTO2TensorMapEntry**)malloc(new_task_window_sizes[r] * sizeof(PTO2TensorMapEntry*)); - if (!task_entry_heads[r]) { - // Cleanup previously allocated rings - for (int j = 0; j < r; j++) { - free(task_entry_heads[j]); - task_entry_heads[j] = NULL; - } - free(entry_pool); - free(buckets); - free(free_entry_list); - entry_pool = NULL; - buckets = NULL; - free_entry_list = NULL; - return false; - } - for (int32_t i = 0; i < new_task_window_sizes[r]; i++) { - task_entry_heads[r][i] = nullptr; - } - task_window_sizes[r] = new_task_window_sizes[r]; - } - - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - last_task_alives[r] = 0; - last_cleanup[r] = 0; - } - - return true; -} - -bool PTO2TensorMap::init_default(const int32_t new_task_window_sizes[PTO2_MAX_RING_DEPTH]) { - return init(PTO2_TENSORMAP_NUM_BUCKETS, PTO2_TENSORMAP_POOL_SIZE, new_task_window_sizes); -} - -void PTO2TensorMap::destroy() { - if (buckets) { - free(buckets); - buckets = NULL; - } - - if (entry_pool) { - free(entry_pool); - entry_pool = NULL; - } - - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - if (task_entry_heads[r]) { - free(task_entry_heads[r]); - task_entry_heads[r] = NULL; - } - } - - if (free_entry_list) { - free(free_entry_list); - free_entry_list = NULL; - } -} - -// ============================================================================= -// Debug Utilities -// ============================================================================= - -void PTO2TensorMap::print_stats() { - int32_t valid = 0; - int32_t stale = 0; - int32_t empty_buckets = 0; - int32_t max_chain = 0; - int64_t total_chain = 0; - int32_t non_empty_buckets = 0; - - // Count entries - for (int32_t i = 0; i < pool_size; i++) { - if (entry_pool[i].bucket_index != -1) { - if (entry_valid(entry_pool[i])) { - valid++; - } else { - stale++; - } - } - } - - // Count bucket stats - for (int32_t b = 0; b < num_buckets; b++) { - int32_t chain_len = 0; - auto cur_entry = buckets[b]; - - while (cur_entry != nullptr) { - chain_len++; - cur_entry = cur_entry->next_in_bucket; - } - - if (chain_len == 0) { - empty_buckets++; - } else { - non_empty_buckets++; - total_chain += chain_len; - if (chain_len > max_chain) { - max_chain = chain_len; - } - } - } - - LOG_INFO("=== TensorMap Statistics ==="); - LOG_INFO("Pool size: %d", pool_size); - LOG_INFO("Pool next entry idx: %d", next_entry_idx); - LOG_INFO("Pool free_num: %d", free_num); - LOG_INFO("Num buckets: %d", num_buckets); - LOG_INFO("Valid entries: %d", valid); - LOG_INFO("Stale entries: %d", stale); - LOG_INFO("Empty buckets: %d", empty_buckets); - LOG_INFO("Max chain len: %d", max_chain); - LOG_INFO("Avg chain len: %.2f", non_empty_buckets > 0 ? (float)total_chain / non_empty_buckets : 0); - for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { - LOG_INFO("Last task alive[%d]: %d", r, last_task_alives[r]); - } - LOG_INFO("============================"); -} - -int32_t PTO2TensorMap::valid_count() { - int32_t count = 0; - - for (int32_t i = 0; i < pool_size; i++) { - if (entry_pool[i].bucket_index != -1 && entry_valid(entry_pool[i])) { - count++; - } - } - - return count; -} - -void PTO2TensorMap::sync_tensormap(uint8_t ring_id, int32_t sm_last_task_alive) { - sync_validity(ring_id, sm_last_task_alive); - // Only attempt cleanup when last_task_alive has actually advanced; - // otherwise cleanup_retired would empty-loop and we'd spin forever. - if (sm_last_task_alive - last_cleanup[ring_id] >= PTO2_TENSORMAP_CLEANUP_INTERVAL) { - cleanup_retired(ring_id, last_cleanup[ring_id], sm_last_task_alive); - last_cleanup[ring_id] = sm_last_task_alive; - } -} - -// ============================================================================= -// TensorMap Lookup Profiling -// ============================================================================= -#if PTO2_TENSORMAP_PROFILING -PTO2TensorMapProfilingData pto2_tensormap_get_profiling() { - PTO2TensorMapProfilingData d; - d.lookup_chain_total = g_lookup_chain_total; - d.lookup_count = g_lookup_count; - d.lookup_chain_max = g_lookup_chain_max; - d.overlap_checks = g_lookup_overlap_checks; - d.overlap_hits = g_lookup_overlap_hits; - d.insert_count = g_insert_count; - - // Reset - g_lookup_chain_total = 0; - g_lookup_count = 0; - g_lookup_chain_max = 0; - g_lookup_overlap_checks = 0; - g_lookup_overlap_hits = 0; - g_insert_count = 0; - return d; -} -#endif diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_tensormap.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_tensormap.h deleted file mode 100644 index 0916d96da..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_tensormap.h +++ /dev/null @@ -1,521 +0,0 @@ -/* - * Copyright (c) PyPTO Contributors. - * This program is free software, you can redistribute it and/or modify it under the terms and conditions of - * CANN Open Software License Agreement Version 2.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - * ----------------------------------------------------------------------------------------------------------- - */ - -/** - * PTO Runtime2 - TensorMap Interface - * - * TensorMap provides producer lookup for dependency discovery: - * - Maps Tensor -> producer task ID - * - Used by pto_submit_task() to find dependencies - * - * Key design features: - * 1. Ring buffer pool for entries (no malloc/free) - * 2. Lazy invalidation (entries become stale when producer retires) - * 3. Per-task per-ring entry tracking for efficient cleanup - * 4. OVERLAP DETECTION: Detects dependencies for overlapping sub-regions - * - * Hash table with chaining: - * - buckets[] array of head offsets - * - Entries linked via next_in_bucket - * - Insert at head (newest first) for sorted chains - * - * CRITICAL: Hash only by base_ptr - * ============================== - * For overlap detection to work, ALL sub-regions of the same base tensor - * MUST be in the SAME hash bucket. This allows lookup to compare all - * potentially overlapping regions. - * - * Overlap detection: Two regions create a dependency if: - * 1. Same base_ptr (raw tensor pointer) - * 2. Byte ranges [offset, offset+size) intersect - * - * Based on: docs/RUNTIME_LOGIC.md - */ - -#pragma once - -#include "common.h" // NOLINT(build/include_subdir) -#include "pto_runtime2_types.h" // NOLINT(build/include_subdir) -#include "tensor.h" // NOLINT(build/include_subdir) - -struct PTO2OrchestratorState; // forward declare - -// ============================================================================= -// TensorMap Lookup Profiling (must precede inline lookup/insert methods) -// ============================================================================= -#ifndef PTO2_TENSORMAP_PROFILING -#define PTO2_TENSORMAP_PROFILING 0 -#endif - -#if PTO2_TENSORMAP_PROFILING -extern uint64_t g_lookup_chain_total; -extern uint64_t g_lookup_count; -extern int32_t g_lookup_chain_max; -extern uint64_t g_lookup_overlap_checks; -extern uint64_t g_lookup_overlap_hits; -extern uint64_t g_insert_count; -#endif - -// ============================================================================= -// TensorMap Structure -// ============================================================================= - -/** - * TensorMap entry structure — cache-line optimized for lookup - * - * Cache line 1 (64B, lookup hot path): - * next_in_bucket, producer_task_id, buffer_addr — chain traversal + validity + hash match - * version, ndims, is_all_offset_zero, bucket_index — overlap fast path - * shapes[5] — overlap comparison - * - * Cache line 2 (64B, insert/remove/slow-path only): - * prev_in_bucket, next_in_task, prev_in_task — chain manipulation - * offsets[5] — only read when !is_all_offset_zero - * - * When is_all_offset_zero is true, lookup touches only cache line 1. - * Entry size: 128B (2 cache lines) vs previous 192B (3 cache lines with embedded Tensor). - */ -struct alignas(64) PTO2TensorMapEntry { - // === Cache line 1 (64B) — lookup hot path === - uint64_t buffer_addr; // 8B: tensor base address (hash key) - PTO2TensorMapEntry* next_in_bucket; // 8B: next entry in hash bucket chain - PTO2TaskId producer_task_id; // 8B: raw (ring_id << 32) | local_id - int32_t bucket_index; // 4B: bucket index (-1 if unlinked) - uint32_t __padding0__; // 4B: occupies Tensor::start_offset high half - int32_t version; // 4B: tensor version for overlap detection - uint32_t ndims; // 4B: number of dimensions - DataType __padding_dtype__; // 1B: occupies Tensor::dtype - bool is_all_offset_zero; // 1B: fast-path flag - uint8_t __padding1__[2]; - uint32_t shapes[RUNTIME_MAX_TENSOR_DIMS]; // 20B: shape per dimension - - // === Cache line 2 (64B) — insert/remove/slow-path === - PTO2TensorMapEntry* prev_in_bucket; // 8B: prev in hash bucket chain - PTO2TensorMapEntry* next_in_task; // 8B: next entry for same task - PTO2TensorMapEntry* prev_in_task; // 8B: prev entry for same task - uint32_t offsets[RUNTIME_MAX_TENSOR_DIMS]; // 20B: only when !is_all_offset_zero - // padding: 20B to fill 64B - - /** - * Copy overlap-relevant fields from a Tensor into this entry. - */ - void copy_from_tensor(const Tensor& tensor) { - memcpy(this, &tensor, 64); - if (!tensor.is_all_offset_zero) { - for (uint32_t i = 0; i < tensor.ndims; i++) { - offsets[i] = tensor.offsets[i]; - } - } - } - - void copy_tensor_create_info(const TensorCreateInfo& tensor_create_info, uint64_t addr) { - memcpy(this, &tensor_create_info, 64); - buffer_addr = addr; - } - - /** - * Check overlap between input tensor and this entry (the producer output). - * Mirrors Tensor::is_overlap() logic but operates on entry fields directly. - */ - OverlapStatus check_overlap(const Tensor& input) const { - debug_assert(input.buffer.addr == buffer_addr); - debug_assert(input.version >= version); - if (input.version > version) { - return OverlapStatus::OTHER; - } - // Fast path: both have zero offsets → ranges are [0, shape[i]) - if (input.is_all_offset_zero && is_all_offset_zero) { - bool contains = true; - for (uint32_t i = 0; i < ndims; i++) { - if (input.shapes[i] < shapes[i]) { - contains = false; - break; - } - } - return contains ? OverlapStatus::COVERED : OverlapStatus::OTHER; - } - // Slow path: at least one has non-zero offsets - bool contains = true; - for (uint32_t i = 0; i < ndims; i++) { - uint64_t in_off = input.is_all_offset_zero ? 0 : input.offsets[i]; - uint64_t ent_off = is_all_offset_zero ? 0 : offsets[i]; - Segment in_range{in_off, in_off + static_cast(input.shapes[i])}; - Segment ent_range{ent_off, ent_off + static_cast(shapes[i])}; - if (!in_range.line_segment_intersection(ent_range)) { - return OverlapStatus::NO_OVERLAP; - } else if (!in_range.contains(ent_range)) { - contains = false; - } - } - return contains ? OverlapStatus::COVERED : OverlapStatus::OTHER; - } -}; - -static_assert(sizeof(PTO2TensorMapEntry) == 128, "TensorMapEntry must be exactly 2 cache lines (128 bytes)"); -static_assert(offsetof(PTO2TensorMapEntry, buffer_addr) == offsetof(Tensor, buffer.addr)); -static_assert(offsetof(PTO2TensorMapEntry, version) == offsetof(Tensor, version)); -static_assert(offsetof(PTO2TensorMapEntry, ndims) == offsetof(Tensor, ndims)); -static_assert(offsetof(PTO2TensorMapEntry, is_all_offset_zero) == offsetof(Tensor, is_all_offset_zero)); -static_assert(offsetof(PTO2TensorMapEntry, shapes) == offsetof(Tensor, shapes)); -static_assert( - offsetof(PTO2TensorMapEntry, prev_in_bucket) == 64, "TensorMapEntry must be exactly 2 cache lines (128 bytes)"); - -/** - * Stack-allocated lookup result (avoids heap allocation per lookup) - */ -#define PTO2_LOOKUP_MAX_RESULTS 16 -// ============================================================================= -// TensorMap Lookup Chain Length Statistics (compile-time toggle) -// ============================================================================= -struct PTO2LookupResult { - struct Entry { - PTO2TensorMapEntry* entry; - OverlapStatus overlap_status; - }; - Entry entries[PTO2_LOOKUP_MAX_RESULTS]; - int32_t count{0}; - - void push(PTO2TensorMapEntry* entry, OverlapStatus s) { - if (count < PTO2_LOOKUP_MAX_RESULTS) { - entries[count++] = {entry, s}; - } - } -}; - -/** - * TensorMap structure - * - * Hash table with ring buffer entry pool and lazy invalidation. - */ -struct PTO2TensorMap { - // Hash table buckets (fixed size, power of 2) - PTO2TensorMapEntry** buckets; // Array of offsets into entry_pool (-1 = empty) - int32_t num_buckets; // Must be power of 2 for fast modulo - - // Entry pool as ring buffer - PTO2TensorMapEntry* entry_pool; // Ring buffer of entries - PTO2TensorMapEntry** free_entry_list; // free entry ids - int32_t pool_size; // Total pool capacity - int32_t next_entry_idx; // id when next entry insert - int32_t free_num; // free entry number in entry pool - - // Per-ring per-task entry tracking (for efficient bucket cleanup) - // Indexed by [ring_id][local_id & (task_window_sizes[ring_id] - 1)] - PTO2TensorMapEntry** task_entry_heads[PTO2_MAX_RING_DEPTH]; - int32_t task_window_sizes[PTO2_MAX_RING_DEPTH]; // Per-ring task window size (for slot masking) - - // Per-ring validity threshold (for lazy invalidation) - int32_t last_task_alives[PTO2_MAX_RING_DEPTH]; // Cached from shared memory per ring - - // Per-ring cleanup progress (for periodic cleanup_retired) - int32_t last_cleanup[PTO2_MAX_RING_DEPTH]{}; - - PTO2OrchestratorState* orch{nullptr}; - - // new_entry only allocates memory, does not assign attributes - PTO2TensorMapEntry* new_entry() { - if (free_num > 0) { - PTO2TensorMapEntry* res = free_entry_list[--free_num]; - debug_assert(res->bucket_index == -1); - return res; - } - always_assert(next_entry_idx < pool_size); - PTO2TensorMapEntry* res = &entry_pool[next_entry_idx++]; - debug_assert(res->bucket_index == -1); - return res; - } - - void free_entry(PTO2TensorMapEntry& entry) { - always_assert(entry.bucket_index != -1); // must still be in a bucket - - // Update predecessor's next pointer (O(1) via prev_in_bucket) - if (entry.prev_in_bucket == nullptr) { - // Entry is the head of its bucket chain, update bucket head - // Must compute hash BEFORE clearing tensor - buckets[entry.bucket_index] = entry.next_in_bucket; - } else { - entry.prev_in_bucket->next_in_bucket = entry.next_in_bucket; - } - - // Update successor's prev pointer - if (entry.next_in_bucket != nullptr) { - entry.next_in_bucket->prev_in_bucket = entry.prev_in_bucket; - } - - free_entry_list[free_num++] = &entry; - entry.bucket_index = -1; - entry.next_in_bucket = nullptr; - entry.prev_in_bucket = nullptr; - entry.next_in_task = nullptr; - entry.prev_in_task = nullptr; - } - - // ============================================================================= - // TensorMap API - // ============================================================================= - - /** - * Initialize TensorMap - * - * @param num_buckets Number of hash buckets (must be power of 2) - * @param pool_size Size of entry pool - * @return true on success, false on allocation failure - */ - bool init(int32_t num_buckets, int32_t pool_size, const int32_t task_window_sizes[PTO2_MAX_RING_DEPTH]); - - /** - * Initialize TensorMap with default sizes - */ - bool init_default(const int32_t task_window_sizes[PTO2_MAX_RING_DEPTH]); - - /** - * Destroy TensorMap and free resources - */ - void destroy(); - - /** - * Update validity threshold from shared memory - * Called periodically to refresh the lazy invalidation threshold. - * - * @param last_task_alive Current value from shared memory - */ - void sync_validity(int32_t ring_id, int32_t last_task_alive) { this->last_task_alives[ring_id] = last_task_alive; } - - /** - * Lookup producer for a tensor region - * - * Searches the hash table for a matching region. - * Returns producer entry if found and valid. - * Stale entries from different rings are skipped (not truncated). - * - * @param tensor Tensor to look up - * @param result Output: stack-allocated result buffer - */ - void lookup(const Tensor& tensor, PTO2LookupResult& result) { - uint32_t bucket_index = hash(tensor.buffer.addr); - PTO2TensorMapEntry* cur_entry = buckets[bucket_index]; - - result.count = 0; -#if PTO2_TENSORMAP_PROFILING - g_lookup_count++; - int32_t chain_len = 0; -#endif - - while (cur_entry != nullptr) { - PTO2TensorMapEntry* next_entry = cur_entry->next_in_bucket; - -#if PTO2_TENSORMAP_PROFILING - chain_len++; -#endif - // Skip stale entries (no chain truncation — entries from different - // rings can be interleaved, so a stale entry from one ring does NOT - // imply subsequent entries from other rings are also stale) - if (!entry_valid(*cur_entry)) { - cur_entry = next_entry; - continue; - } - - // Entry is valid - check if regions OVERLAP (not just exact match) - // Since we hash only by base_ptr, all entries in this bucket have - // potential to overlap. We must check actual byte-range overlap. - if (tensor.buffer.addr == cur_entry->buffer_addr) { -#if PTO2_TENSORMAP_PROFILING - g_lookup_overlap_checks++; -#endif - auto overlap_status = cur_entry->check_overlap(tensor); - if (overlap_status != OverlapStatus::NO_OVERLAP) { - result.push(cur_entry, overlap_status); -#if PTO2_TENSORMAP_PROFILING - g_lookup_overlap_hits++; -#endif - } - } - - // Move to next entry - cur_entry = next_entry; - } -#if PTO2_TENSORMAP_PROFILING - g_lookup_chain_total += chain_len; - if (chain_len > g_lookup_chain_max) g_lookup_chain_max = chain_len; -#endif - } - - /** - * Insert a new entry (called when task produces output) - * - * Allocates from ring buffer pool, may overwrite stale entries. - * Inserts at head of hash bucket chain (maintains task_id ordering). - * - * @param tensor Tensor produced - * @param producer_task_id Task ID of producer - */ - void insert(const Tensor& tensor, PTO2TaskId producer_task_id) { - PTO2TensorMapEntry* entry = new_entry(); - entry->copy_from_tensor(tensor); - link_entry(entry, tensor.buffer.addr, producer_task_id); - } - - /** - * Cleanup stale entries for retired tasks - * - * Called periodically by Orchestrator when last_task_alive advances. - * Removes entries from bucket chains for tasks in [old, new) range. - * - * @param old_last_task_alive Previous threshold - * @param new_last_task_alive New threshold - */ - void cleanup_retired(int32_t ring_id, int32_t old_last_task_alive, int32_t new_last_task_alive) { - // Iterate through retired tasks on this ring and remove their entries - for (int32_t local_id = old_last_task_alive; local_id < new_last_task_alive; local_id++) { - int32_t task_slot = local_id & (task_window_sizes[ring_id] - 1); - PTO2TensorMapEntry* cur_entry = task_entry_heads[ring_id][task_slot]; - - while (cur_entry != nullptr) { - PTO2TensorMapEntry* next_entry = cur_entry->next_in_task; // Save before clearing - // Only remove if this entry belongs to the retiring task - // (slot may have been reused by a newer task) - debug_assert(cur_entry->producer_task_id == - PTO2TaskId::make(static_cast(ring_id), static_cast(local_id))); - free_entry(*cur_entry); - cur_entry = next_entry; - } - - // Clear task's entry head (slot will be reused by local_id + task_window_sizes[ring_id]) - task_entry_heads[ring_id][task_slot] = nullptr; - } - } - - // ============================================================================= - // Internal Helpers (exposed for testing) - // ============================================================================= - - /** - * Compute hash for tensor addr - * - * Multiplicative hash using the golden-ratio constant. Multiplication - * mixes ALL input bits into the high bits of the product, so aligned - * addresses (low bits all-zero) still distribute evenly. We extract - * the top log2(num_buckets) bits which carry the most entropy. - */ - uint32_t hash(uint64_t key) { - key *= 0x9E3779B97F4A7C15ULL; - return static_cast(key >> (64 - __builtin_ctz(num_buckets))); - } - - /** - * Link an initialized entry into bucket and task chains. - */ - void link_entry(PTO2TensorMapEntry* entry, uint64_t addr, PTO2TaskId producer_task_id) { -#if PTO2_TENSORMAP_PROFILING - g_insert_count++; -#endif - uint32_t bucket_index = hash(addr); - auto ring_id = producer_task_id.ring(); - auto local_id = producer_task_id.local(); - int32_t task_slot = local_id & (task_window_sizes[ring_id] - 1); - - entry->producer_task_id = producer_task_id; - - // Insert at head of hash bucket - entry->bucket_index = bucket_index; - entry->next_in_bucket = buckets[bucket_index]; - if (entry->next_in_bucket != nullptr) { - entry->next_in_bucket->prev_in_bucket = entry; - } - buckets[bucket_index] = entry; - entry->prev_in_bucket = nullptr; - - // Link to task's entry list - entry->next_in_task = task_entry_heads[ring_id][task_slot]; - entry->prev_in_task = nullptr; - if (entry->next_in_task != nullptr) { - entry->next_in_task->prev_in_task = entry; - } - task_entry_heads[ring_id][task_slot] = entry; - } - - /** - * Check if entry is valid (producer has not retired) - */ - bool entry_valid(const PTO2TensorMapEntry& entry) const { - return static_cast(entry.producer_task_id.local()) >= last_task_alives[entry.producer_task_id.ring()]; - } - - void remove_entry(PTO2TensorMapEntry& entry) { - remove_from_task(entry); - free_entry(entry); - } - - /** - * Remove entry from its task chain (O(1) with prev pointer) - * Called during pool wrap-around to unlink reused entries. - */ - void remove_from_task(PTO2TensorMapEntry& entry) { - always_assert(entry.bucket_index != -1); // must still be in a bucket - // Update predecessor's next pointer (O(1) via prev_in_task) - if (entry.prev_in_task == nullptr) { - // Entry is the head of its task chain, update task_entry_heads - int32_t ring_id = entry.producer_task_id.ring(); - int32_t local_id = static_cast(entry.producer_task_id.local()); - int32_t task_slot = local_id & (task_window_sizes[ring_id] - 1); - task_entry_heads[ring_id][task_slot] = entry.next_in_task; - } else { - entry.prev_in_task->next_in_task = entry.next_in_task; - } - - // Update successor's prev pointer - if (entry.next_in_task != nullptr) { - entry.next_in_task->prev_in_task = entry.prev_in_task; - } - - entry.next_in_task = nullptr; - entry.prev_in_task = nullptr; - } - - // ============================================================================= - // Debug Utilities - // ============================================================================= - - /** - * Print TensorMap statistics - */ - void print_stats(); - - /** - * Get count of valid entries - */ - int32_t valid_count(); - - // ============================================================================= - // TensorMap Synchronization - // ============================================================================= - - /** - * Sync TensorMap validity threshold from shared memory - * - * Called periodically to refresh the lazy invalidation threshold. - * Also triggers cleanup if threshold has advanced significantly. - */ - void sync_tensormap(uint8_t ring_id, int32_t sm_last_task_alive); -}; - -#if PTO2_TENSORMAP_PROFILING -struct PTO2TensorMapProfilingData { - uint64_t lookup_chain_total; - uint64_t lookup_count; - int32_t lookup_chain_max; - uint64_t overlap_checks; - uint64_t overlap_hits; - uint64_t insert_count; -}; - -PTO2TensorMapProfilingData pto2_tensormap_get_profiling(); -#endif diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_types.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_types.h deleted file mode 100644 index 6c3eb3acc..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/pto_types.h +++ /dev/null @@ -1,284 +0,0 @@ -/* - * Copyright (c) PyPTO Contributors. - * This program is free software, you can redistribute it and/or modify it under the terms and conditions of - * CANN Open Software License Agreement Version 2.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - * ----------------------------------------------------------------------------------------------------------- - */ -/** - * Orchestration Build Graph Types - Data structures for orchestration runtime extensions - * - * Standalone header defining orchestration-specific types for: - * - TaskOutputTensors: Return value from submit containing materialized output Tensors - * - Arg: Aggregated argument container for pto_submit_task API - * - * Tensor descriptor types (Tensor, PTOBufferHandle, TensorCreateInfo) are - * defined in tensor.h. - * - * This header is independent of orch_build_graph_runtime.h to allow inclusion from runtime.h - * without type conflicts (Handshake, TensorPair, HostApi). - */ - -#ifndef SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_TYPES_H_ -#define SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_TYPES_H_ - -#include -#include - -#if defined(__aarch64__) -#include -#endif - -#include "pto_submit_types.h" // NOLINT(build/include_subdir) -- PTO2LaunchSpec -#include "task_args.h" // NOLINT(build/include_subdir) -- TaskArgs base class -#include "tensor.h" // NOLINT(build/include_subdir) -#include "tensor_arg.h" // NOLINT(build/include_subdir) -- canonical TensorArgType definition - -// Task arguments -#define MAX_TENSOR_ARGS 16 // Maximum tensor arguments per task -#define MAX_SCALAR_ARGS 128 // Maximum scalar arguments per task -#define PTO2_MAX_OUTPUTS 16 // Maximum outputs per task -#define PTO2_MAX_INPUTS 16 // Maximum inputs per task -#define PTO2_MAX_INOUTS 8 // Maximum in-out args per task - -// ============================================================================= -// Task Output Tensors (return value from submit) -// ============================================================================= - -/** - * TaskOutputTensors — returned by submit, holds materialized output Tensors. - * - * Only runtime-created outputs are stored here, indexed in add_output order. - * - * The underlying storage is uninitialized; only output_count elements are - * valid after submit returns. This avoids default-constructing Tensor[] - * on the hot path (2 KB of unnecessary zeroing per submit). - * - * Users must hold a named TaskOutputTensors variable and borrow via get_ref(); - * binding get_ref() on an rvalue is compile-time rejected to prevent dangling. - */ -class TaskOutputTensors { - public: // NOLINT(whitespace/indent) - TaskOutputTensors() : output_count_(0) {} - - bool empty() const { return output_count_ == 0; } - uint32_t size() const { return output_count_; } - - /// Borrow a materialized output tensor by index (lvalue only). - const Tensor& get_ref(uint32_t index) const& { - always_assert(index < output_count_); - return *tensors_[index]; - } - const Tensor& get_ref(uint32_t index) const&& = delete; - - /// Runtime-internal: append one materialized output Tensor. - void materialize_output(const Tensor& tensor) { - always_assert(output_count_ < PTO2_MAX_OUTPUTS); - tensors_[output_count_++] = &tensor; - } - - private: // NOLINT(whitespace/indent) - uint32_t output_count_; - const Tensor* tensors_[PTO2_MAX_OUTPUTS]; -}; - -// ============================================================================= -// Argument Types (for pto_submit_task API) -// ============================================================================= - -// TensorArgType is defined in tensor_arg.h (included above) - -/** - * Tagged union for a single Arg slot — either a Tensor* or a TensorCreateInfo value. - * The active member is determined by TensorArgType (OUTPUT → create_info, else → ptr). - */ -union TensorRef { - const Tensor* ptr; - const TensorCreateInfo* create_info; - TensorRef() : ptr(nullptr) {} -}; - -/** - * Aggregated argument container for pto_submit_task - * - * Inherits storage from TaskArgs. - * Each tensor slot stores a TensorRef union (Tensor* or TensorCreateInfo) - * discriminated by the corresponding tag(). - * Tensors are dispatched first in kernel args, followed by scalars. - * - * Output arguments follow two distinct ownership models: - * - add_output(const TensorCreateInfo&): OUTPUT — runtime allocates buffer - * and materializes a new Tensor, returned via TaskOutputTensors. - * - add_inout(const Tensor&): INOUT — reuses an existing Tensor as the write target. - * - * Example: - * Tensor x = make_tensor_external(dev_a, shapes, 2); - * TensorCreateInfo ci(shapes, 2); // must outlive submit - * Arg args; - * args.add_input(x); - * args.add_output(ci); - * args.add_scalar(some_value); - * TaskOutputTensors outs = pto2_rt_submit_aic_task(kernel_id, args); - * const Tensor& y = outs.get_ref(0); - */ -struct Arg : TaskArgs { - bool has_error{false}; - const char* error_msg{nullptr}; - PTO2LaunchSpec launch_spec; // SPMD launch parameters (block_num, etc.) - - void reset() { - clear(); - has_error = false; - error_msg = nullptr; - } - - void set_error(const char* msg) { - if (!has_error) { - has_error = true; - error_msg = msg; - } - } - - bool check_add_tensor_valid() { - if (scalar_count_ != 0) { - set_error( - "add_input/add_output/add_inout called after add_scalar: " - "all tensors must be added before any scalars"); - return false; - } - if (tensor_count_ >= MAX_TENSOR_ARGS) { - set_error("Too many tensor args (exceeds MAX_TENSOR_ARGS=16)"); - return false; - } - return true; - } - - void add_input(const Tensor& t) { - if (!check_add_tensor_valid()) { - return; - } - tensors_[tensor_count_].ptr = &t; - tags_[tensor_count_] = TensorArgType::INPUT; - tensor_count_++; - } - - /// Standard future-output path: runtime allocates buffer from heap, - /// materializes Tensor into TaskOutputTensors. - /// The TensorCreateInfo must outlive the submit call (pointer is stored). - void add_output(const TensorCreateInfo& ci) { - if (!check_add_tensor_valid()) { - return; - } - tensors_[tensor_count_].create_info = &ci; - tags_[tensor_count_] = TensorArgType::OUTPUT; - tensor_count_++; - } - - /// Prevent passing temporaries — the pointer would dangle before submit. - void add_output(TensorCreateInfo&&) = delete; - - void add_inout(const Tensor& t) { - if (!check_add_tensor_valid()) { - return; - } - tensors_[tensor_count_].ptr = &t; - tags_[tensor_count_] = TensorArgType::INOUT; - tensor_count_++; - } - - /// Write-only existing tensor: skips OverlapMap lookup, depends on creator. - void add_output(const Tensor& t) { - if (!check_add_tensor_valid()) return; - tensors_[tensor_count_].ptr = &t; - tags_[tensor_count_] = TensorArgType::OUTPUT_EXISTING; - tensor_count_++; - } - - /// No-dependency existing tensor: skips OverlapMap lookup, depends on creator only. - void add_no_dep(const Tensor& t) { - if (!check_add_tensor_valid()) return; - tensors_[tensor_count_].ptr = &t; - tags_[tensor_count_] = TensorArgType::NO_DEP; - tensor_count_++; - } - - /** - * Add a scalar value. Type is deduced from the argument; - * the value is bit-cast to uint64_t for storage. - * - * args.add_scalar(uint64_val); // existing usage unchanged - * args.add_scalar(3.14f); // float, auto bit-cast - * args.add_scalar(int32_t(42)); // int32, auto bit-cast - */ - template - void add_scalar(T value) { - if (scalar_count_ >= MAX_SCALAR_ARGS) { - set_error("Too many scalar args (exceeds MAX_SCALAR_ARGS=128)"); - return; - } - scalars_[scalar_count_++] = to_u64(value); - } - - void add_scalars(const uint64_t* values, int count) { - if (scalar_count_ + count > MAX_SCALAR_ARGS) { - set_error("Too many scalar args (exceeds MAX_SCALAR_ARGS=128)"); - return; - } - memcpy(&scalars_[scalar_count_], values, count * sizeof(uint64_t)); - scalar_count_ += count; - } - - /** - * Zero-extend int32 bit patterns into uint64 scalar slots. - * Negative values are treated as their unsigned 32-bit representation - * (e.g., -1 → 0x00000000FFFFFFFF, not 0xFFFFFFFFFFFFFFFF). - * Uses NEON to process 4 elements per iteration on aarch64. - */ - void add_scalars_i32(const int32_t* values, int count) { - if (scalar_count_ + count > MAX_SCALAR_ARGS) { - set_error("Too many scalar args (exceeds MAX_SCALAR_ARGS=128)"); - return; - } - uint64_t* dst = &scalars_[scalar_count_]; -#if defined(__aarch64__) - int i = 0; - for (; i + 4 <= count; i += 4) { - uint32x4_t v = vld1q_u32(reinterpret_cast(values + i)); - uint64x2_t lo = vmovl_u32(vget_low_u32(v)); - uint64x2_t hi = vmovl_u32(vget_high_u32(v)); - vst1q_u64(dst + i, lo); - vst1q_u64(dst + i + 2, hi); - } - for (; i < count; i++) { - dst[i] = static_cast(static_cast(values[i])); - } -#else - for (int i = 0; i < count; i++) { - dst[i] = static_cast(static_cast(values[i])); - } -#endif - scalar_count_ += count; - } - - /** - * Copy scalars from another Arg's scalar array. - * Useful when multiple tasks share the same scalar data (e.g., block indices). - */ - void copy_scalars_from(const Arg& src, int src_offset, int count) { - if (src_offset + count > src.scalar_count_) { - set_error("Source scalar range out of bounds in copy_scalars_from"); - return; - } - if (scalar_count_ + count > MAX_SCALAR_ARGS) { - set_error("Too many scalar args (exceeds MAX_SCALAR_ARGS=128)"); - return; - } - memcpy(&scalars_[scalar_count_], &src.scalars_[src_offset], count * sizeof(uint64_t)); - scalar_count_ += count; - } -}; - -#endif // SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_PTO_TYPES_H_ diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/runtime.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/runtime.cpp deleted file mode 100644 index 38f74b8bc..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/runtime.cpp +++ /dev/null @@ -1,144 +0,0 @@ -/* - * Copyright (c) PyPTO Contributors. - * This program is free software, you can redistribute it and/or modify it under the terms and conditions of - * CANN Open Software License Agreement Version 2.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - * ----------------------------------------------------------------------------------------------------------- - */ -/** - * Runtime Class - Implementation - * - * Device execution and handshake control. - * Task graph construction is handled by PTO2Runtime. - */ - -#include "runtime.h" // NOLINT(build/include_subdir) - -#include "common/unified_log.h" -#include "pto_runtime2_types.h" // NOLINT(build/include_subdir) -#include "pto_shared_memory.h" // NOLINT(build/include_subdir) - -// ============================================================================= -// Constructor -// ============================================================================= - -Runtime::Runtime() { - // NOTE: host_api is initialized in InitRuntime() (host-only code) - // because the CApi functions don't exist when compiled for device. - - // Initialize handshake buffers - memset(workers, 0, sizeof(workers)); - worker_count = 0; - sche_cpu_num = 1; - orch_thread_num = 1; - ready_queue_shards = RUNTIME_DEFAULT_READY_QUEUE_SHARDS; - pto2_task_window_size = 0; - pto2_heap_size = 0; - pto2_dep_pool_size = 0; - orch_to_sched = false; - - // Initialize tensor pairs - tensor_pair_count = 0; - - // Initialize device orchestration state - orch_built_on_host_ = true; - pto2_gm_sm_ptr_ = nullptr; - pto2_gm_heap_ptr_ = nullptr; - pto2_slot_states_ptr_ = nullptr; - orch_args_storage_.clear(); - - // Initialize device orchestration SO binary - device_orch_so_size_ = 0; - - // Initialize kernel binary tracking - registered_kernel_count_ = 0; - - // Initialize function address mapping - for (int i = 0; i < RUNTIME_MAX_FUNC_ID; i++) { - func_id_to_addr_[i] = 0; - } -} - -// ============================================================================= -// Tensor Pair Management -// ============================================================================= - -void Runtime::record_tensor_pair(void* host_ptr, void* dev_ptr, size_t size) { - if (tensor_pair_count >= RUNTIME_MAX_TENSOR_PAIRS) { - LOG_ERROR("[Runtime] Tensor pairs full (max=%d)", RUNTIME_MAX_TENSOR_PAIRS); - return; - } - tensor_pairs[tensor_pair_count].host_ptr = host_ptr; - tensor_pairs[tensor_pair_count].dev_ptr = dev_ptr; - tensor_pairs[tensor_pair_count].size = size; - tensor_pair_count++; - LOG_INFO("Recorded tensor pair: host=%p dev=%p size=%zu", host_ptr, dev_ptr, size); -} - -TensorPair* Runtime::get_tensor_pairs() { return tensor_pairs; } - -int Runtime::get_tensor_pair_count() const { return tensor_pair_count; } - -void Runtime::clear_tensor_pairs() { tensor_pair_count = 0; } - -// ============================================================================= -// Device orchestration -// ============================================================================= - -bool Runtime::get_orch_built_on_host() const { return orch_built_on_host_; } -void* Runtime::get_pto2_gm_sm_ptr() const { return pto2_gm_sm_ptr_; } -void* Runtime::get_pto2_gm_heap_ptr() const { return pto2_gm_heap_ptr_; } -const ChipStorageTaskArgs& Runtime::get_orch_args() const { return orch_args_storage_; } -void Runtime::set_orch_built_on_host(bool v) { orch_built_on_host_ = v; } -void Runtime::set_pto2_gm_sm_ptr(void* p) { pto2_gm_sm_ptr_ = p; } -void Runtime::set_pto2_gm_heap(void* p) { pto2_gm_heap_ptr_ = p; } -void Runtime::set_pto2_slot_states_ptr(void* p) { pto2_slot_states_ptr_ = p; } -void Runtime::set_orch_args(const ChipStorageTaskArgs& args) { orch_args_storage_ = args; } - -// Device orchestration SO binary (for dlopen on AICPU thread 3) -// Copies data to internal storage to avoid lifetime issues with Python ctypes arrays -void Runtime::set_device_orch_so(const void* data, size_t size) { - if (data == nullptr || size == 0) { - device_orch_so_size_ = 0; - return; - } - if (size > RUNTIME_MAX_ORCH_SO_SIZE) { - LOG_ERROR("[Runtime] Orchestration SO too large (%zu > %d)", size, RUNTIME_MAX_ORCH_SO_SIZE); - device_orch_so_size_ = 0; - return; - } - memcpy(device_orch_so_storage_, data, size); - device_orch_so_size_ = size; -} - -const void* Runtime::get_device_orch_so_data() const { - return device_orch_so_size_ > 0 ? device_orch_so_storage_ : nullptr; -} - -size_t Runtime::get_device_orch_so_size() const { return device_orch_so_size_; } - -uint64_t Runtime::get_function_bin_addr(int func_id) const { - if (func_id < 0 || func_id >= RUNTIME_MAX_FUNC_ID) return 0; - return func_id_to_addr_[func_id]; -} - -void Runtime::set_function_bin_addr(int func_id, uint64_t addr) { - if (func_id >= 0 && func_id < RUNTIME_MAX_FUNC_ID) { - func_id_to_addr_[func_id] = addr; - if (addr != 0 && registered_kernel_count_ < RUNTIME_MAX_FUNC_ID) { - registered_kernel_func_ids_[registered_kernel_count_++] = func_id; - } - } -} - -int Runtime::get_registered_kernel_count() const { return registered_kernel_count_; } - -int Runtime::get_registered_kernel_func_id(int index) const { - if (index < 0 || index >= registered_kernel_count_) return -1; - return registered_kernel_func_ids_[index]; -} - -void Runtime::clear_registered_kernels() { registered_kernel_count_ = 0; } diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/runtime.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/runtime.h deleted file mode 100644 index 208a6b13a..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/runtime.h +++ /dev/null @@ -1,290 +0,0 @@ -/* - * Copyright (c) PyPTO Contributors. - * This program is free software, you can redistribute it and/or modify it under the terms and conditions of - * CANN Open Software License Agreement Version 2.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - * ----------------------------------------------------------------------------------------------------------- - */ -/** - * Runtime Class - Device Execution and Handshake Control - * - * This class manages device-side execution through AICPU-AICore handshake - * protocol. Task graph construction is handled by PTO2Runtime; this class - * only handles: - * - Handshake buffers for AICPU-AICore communication - * - Execution parameters (block_dim, sche_cpu_num) - * - Tensor pair management for host-device memory tracking - * - Device orchestration state (pto2_gm_sm_ptr_, orch_args_) - * - Function address mapping (func_id_to_addr_) - * - * Task dispatch uses a per-core PTO2DispatchPayload written by the scheduler. - * At dispatch time, build_payload() copies tensor pointers and scalars from - * the task payload into the per-core args[], populates SPMD context, then - * signals AICore via DATA_MAIN_BASE. - */ - -#ifndef SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_RUNTIME_H_ -#define SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_RUNTIME_H_ - -#include -#include -#include // for fprintf, printf -#include // for memset - -#include "common/core_type.h" -#include "common/perf_profiling.h" -#include "common/platform_config.h" -#include "pto2_dispatch_payload.h" -#include "task_args.h" - -// ============================================================================= -// Configuration Macros -// ============================================================================= - -#define RUNTIME_MAX_ARGS 128 -#define RUNTIME_MAX_WORKER 72 // 24 AIC + 48 AIV cores -#define RUNTIME_MAX_TENSOR_PAIRS 64 -#define RUNTIME_MAX_FUNC_ID 32 -#define RUNTIME_MAX_ORCH_SO_SIZE (4 * 1024 * 1024) // 1MB max for orchestration SO - -// Default ready queue shards: one shard per worker thread (total minus orchestrator) -constexpr int RUNTIME_DEFAULT_READY_QUEUE_SHARDS = PLATFORM_MAX_AICPU_THREADS - 1; - -// ============================================================================= -// Data Structures -// ============================================================================= - -/** - * Handshake Structure - Shared between Host, AICPU, and AICore - * - * This structure facilitates communication and synchronization between - * AICPU and AICore during task execution. - * - * Protocol State Machine: - * 1. Initialization: AICPU sets aicpu_ready=1 - * 2. Acknowledgment: AICore sets aicore_done=core_id+1 - * 3. Task Dispatch: AICPU writes DATA_MAIN_BASE after updating the per-core payload - * 4. Task Execution: AICore reads the cached PTO2DispatchPayload and executes - * 5. Task Completion: AICore writes FIN to COND; AICPU observes completion - * 6. Shutdown: AICPU sets control=1, AICore exits - * - * Each AICore instance has its own handshake buffer to enable concurrent - * task execution across multiple cores. - */ - -/** - * Handshake buffer for AICPU-AICore communication - * - * Each AICore has its own handshake buffer for synchronization with AICPU. - * The structure is cache-line aligned (64 bytes) to prevent false sharing - * between cores and optimize cache coherency operations. - * - * Field Access Patterns: - * - aicpu_ready: Written by AICPU, read by AICore - * - aicore_done: Written by AICore, read by AICPU - * - task: Written by AICPU, read by AICore (0 = not ready, non-zero = PTO2DispatchPayload*) - * - task_status: Written by both (AICPU=1 on dispatch, AICore=0 on completion) - * - control: Written by AICPU, read by AICore (0 = continue, 1 = quit) - * - core_type: Written by AICPU, read by AICore (CoreType::AIC or CoreType::AIV) - */ -struct Handshake { - volatile uint32_t aicpu_ready; // AICPU ready signal: 0=not ready, 1=ready - volatile uint32_t aicore_done; // AICore ready signal: 0=not ready, core_id+1=ready - volatile uint64_t task; // Init: PTO2DispatchPayload* (set before aicpu_ready); runtime: unused - volatile int32_t task_status; // Task execution status: 0=idle, 1=busy - volatile int32_t control; // Control signal: 0=execute, 1=quit - volatile CoreType core_type; // Core type: CoreType::AIC or CoreType::AIV - volatile uint64_t perf_records_addr; // Performance records address - volatile uint32_t perf_buffer_status; // 0 = not full, 1 == full - volatile uint32_t physical_core_id; // Physical core ID - volatile uint32_t aicpu_regs_ready; // AICPU register init done: 0=pending, 1=done - volatile uint32_t aicore_regs_ready; // AICore ID reported: 0=pending, 1=done -} __attribute__((aligned(64))); - -/** - * Tensor pair for tracking host-device memory mappings. - * Used for copy-back during finalize. - */ -struct TensorPair { - void* host_ptr; - void* dev_ptr; - size_t size; -}; - -/** - * Host API function pointers for device memory operations. - * Allows runtime to use pluggable device memory backends. - */ -struct HostApi { - void* (*device_malloc)(size_t size); - void (*device_free)(void* dev_ptr); - int (*copy_to_device)(void* dev_ptr, const void* host_ptr, size_t size); - int (*copy_from_device)(void* host_ptr, const void* dev_ptr, size_t size); - uint64_t (*upload_kernel_binary)(int func_id, const uint8_t* bin_data, size_t bin_size); - void (*remove_kernel_binary)(int func_id); -}; - -/** - * Task structure - Compatibility stub for platform layer - * - * RT2 uses PTO2DispatchPayload instead of Task for task dispatch. - * This stub exists only for API compatibility with device_runner.cpp. - * Since get_task_count() returns 0, this struct is never actually used. - */ -struct Task { - int func_id; - uint64_t function_bin_addr; -}; - -// ============================================================================= -// Runtime Class -// ============================================================================= - -/** - * Runtime class for device execution and handshake control - * - * This class manages AICPU-AICore communication through handshake buffers. - * Task graph construction is handled by PTO2Runtime; this class only handles - * execution control and device orchestration state. - */ -class Runtime { - public: // NOLINT(whitespace/indent) - // Handshake buffers for AICPU-AICore communication - Handshake workers[RUNTIME_MAX_WORKER]; // Worker (AICore) handshake buffers - int worker_count; // Number of active workers - - // Execution parameters for AICPU scheduling - int sche_cpu_num; // Number of AICPU threads for scheduling - int orch_thread_num; // Number of orchestrator threads (default 1) - int ready_queue_shards; // Number of ready queue shards (1..MAX_AICPU_THREADS, default MAX-1) - - // Ring buffer size overrides (0 = use compile-time defaults) - uint64_t pto2_task_window_size; - uint64_t pto2_heap_size; - uint64_t pto2_dep_pool_size; - - // PTO2 integration: kernel_id -> GM function_bin_addr mapping - // NOTE: Made public for direct access from aicore code - uint64_t func_id_to_addr_[RUNTIME_MAX_FUNC_ID]; - - // Profiling support - bool enable_profiling; // Enable profiling flag - - // Orchestrator-to-scheduler transition control - // When true, orchestrator threads convert to scheduler threads after orchestration completes. - // When false (default), orchestrator threads exit after orchestration without dispatching tasks. - // Controlled via PTO2_ORCH_TO_SCHED environment variable. - bool orch_to_sched; - uint64_t perf_data_base; // Performance data shared memory base address (device-side) - - private: // NOLINT(whitespace/indent) - // Tensor pairs for host-device memory tracking - TensorPair tensor_pairs[RUNTIME_MAX_TENSOR_PAIRS]; - int tensor_pair_count; - - // Kernel binary tracking for cleanup - int registered_kernel_func_ids_[RUNTIME_MAX_FUNC_ID]; - int registered_kernel_count_; - - // Device orchestration: when false, orchestration runs on device (thread 3) - bool orch_built_on_host_; - void* pto2_gm_sm_ptr_; // GM pointer to PTO2 shared memory (device) - void* pto2_gm_heap_ptr_; // GM heap for orchestrator output buffers (device) - void* pto2_slot_states_ptr_; // Pointer to PTO2TaskSlotState array (scheduler-private, for profiling) - ChipStorageTaskArgs orch_args_storage_; // Copy of args for device - - // Device orchestration SO binary (for dlopen on AICPU thread 3) - // Stored as a copy to avoid lifetime issues with Python ctypes arrays - uint8_t device_orch_so_storage_[RUNTIME_MAX_ORCH_SO_SIZE]; - size_t device_orch_so_size_; - - public: // NOLINT(whitespace/indent) - /** - * Constructor - zero-initialize all arrays - */ - Runtime(); - - // ========================================================================= - // Tensor Pair Management - // ========================================================================= - - /** - * Record a host-device tensor pair for copy-back during finalize. - */ - void record_tensor_pair(void* host_ptr, void* dev_ptr, size_t size); - - /** - * Get pointer to tensor pairs array. - */ - TensorPair* get_tensor_pairs(); - - /** - * Get number of recorded tensor pairs. - */ - int get_tensor_pair_count() const; - - /** - * Clear all recorded tensor pairs. - */ - void clear_tensor_pairs(); - - // ========================================================================= - // Performance Profiling - // ========================================================================= - - // ========================================================================= - // Device orchestration (for AICPU thread 3) - // ========================================================================= - - bool get_orch_built_on_host() const; - void* get_pto2_gm_sm_ptr() const; - void* get_pto2_gm_heap_ptr() const; - const ChipStorageTaskArgs& get_orch_args() const; - void set_orch_built_on_host(bool v); - void set_pto2_gm_sm_ptr(void* p); - void set_pto2_gm_heap(void* p); - void set_pto2_slot_states_ptr(void* p); - void set_orch_args(const ChipStorageTaskArgs& args); - - // Device orchestration SO binary (for dlopen on AICPU thread 3) - void set_device_orch_so(const void* data, size_t size); - const void* get_device_orch_so_data() const; - size_t get_device_orch_so_size() const; - - uint64_t get_function_bin_addr(int func_id) const; - void set_function_bin_addr(int func_id, uint64_t addr); - - int get_registered_kernel_count() const; - int get_registered_kernel_func_id(int index) const; - void clear_registered_kernels(); - - // ========================================================================= - // Deprecated API (for platform compatibility, always returns 0/nullptr) - // Task graph is now managed by PTO2Runtime, not Runtime - // ========================================================================= - - /** @deprecated Task count is now in PTO2 shared memory */ - int get_task_count() const { return 0; } - - /** @deprecated RT2 uses PTO2DispatchPayload, not Task. Always returns nullptr. */ - Task* get_task(int) { return nullptr; } - - /** @deprecated Use PTO2 dispatch mode */ - bool get_use_pto2_dispatch() const { return true; } - - /** @deprecated Use PTO2 dispatch mode */ - void set_use_pto2_dispatch(bool) {} - - // ========================================================================= - // Host API (host-only, not copied to device) - // ========================================================================= - - // Host API function pointers for device memory operations - // NOTE: Placed at end of class to avoid affecting device memory layout - HostApi host_api; -}; - -#endif // SRC_A2A3_RUNTIME_TENSORMAP_AND_RINGBUFFER_RUNTIME_RUNTIME_H_ diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/tensor.h b/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/tensor.h deleted file mode 100644 index ae836df47..000000000 --- a/src/a2a3/runtime/tensormap_and_ringbuffer_unmodified/runtime/tensor.h +++ /dev/null @@ -1,493 +0,0 @@ -/* - * Copyright (c) PyPTO Contributors. - * This program is free software, you can redistribute it and/or modify it under the terms and conditions of - * CANN Open Software License Agreement Version 2.0 (the "License"). - * Please refer to the License for details. You may not use this file except in compliance with the License. - * THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, - * INCLUDING BUT NOT LIMITED TO NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. - * See LICENSE in the root of the software repository for the full text of the License. - * ----------------------------------------------------------------------------------------------------------- - */ -#pragma once - -#include -#include - -#include -#include -#include -#include - -#include "common.h" // NOLINT(build/include_subdir) -#include "data_type.h" // NOLINT(build/include_subdir) -#include "pto_task_id.h" // NOLINT(build/include_subdir) - -constexpr int RUNTIME_MAX_TENSOR_DIMS = 5; - -/** - * Buffer Handle - * - * Represents a device memory buffer with address and total size in bytes. - * This is the underlying memory allocation that a Tensor describes access patterns for. - */ -struct PTOBufferHandle { - uint64_t addr; // Device memory address (bytes) - uint64_t size; // Total buffer size in bytes -}; - -enum class OverlapStatus { - NO_OVERLAP, - COVERED, - OTHER, -}; - -struct Segment { - uint64_t begin; - uint64_t end; - - bool line_segment_intersection(const Segment& other) const { return end > other.begin && other.end > begin; } - bool contains(const Segment& other) const { return begin <= other.begin && other.end <= end; } -}; - -/** - * TensorCreateInfo — submit-time create-info for runtime-allocated outputs. - * - * Carries the metadata required to materialize a fresh contiguous output: - * dtype, ndims, raw_shapes (== shapes), manual_dep, and an optional - * initial value fill. - * - * Layout (64B) is aligned with Tensor cacheline 1 so that - * init_from_create_info() can copy the entire cacheline with a single memcpy, - * then overwrite buffer/owner metadata and refresh start_offset later. - * - * Arg::add_output() stores a pointer to this object, so the original - * must remain valid (not a temporary) until after the submit call. - */ -class alignas(64) TensorCreateInfo { - public: // NOLINT(whitespace/indent) - TensorCreateInfo( - const uint32_t shapes[], uint32_t ndims, DataType dtype = DataType::FLOAT32, bool manual_dep = false) - : initial_value(0), - has_initial_value(false), - version(0), - ndims(ndims), - dtype(dtype), - is_all_offset_zero(true), - is_raw_eq_shapes(true), - manual_dep(manual_dep) { - for (uint32_t i = 0; i < ndims; i++) { - raw_shapes[i] = shapes[i]; - } - } - - void copy(const TensorCreateInfo& other) { memcpy(this, &other, sizeof(other)); } - - template - void set_initial_value(T value) { - has_initial_value = true; - initial_value = to_u64(value); - } - - uint64_t buffer_size_bytes() const { - uint64_t total = 1; - for (uint32_t i = 0; i < ndims; i++) { - total *= raw_shapes[i]; - } - return total * get_element_size(dtype); - } - - public: // NOLINT(whitespace/indent) - // --- Bytes [0, 32): TensorCreateInfo-only fields --- - // These occupy the same positions as Tensor::buffer, Tensor::owner_task_id, - // and Tensor::start_offset. The runtime overwrites owner metadata after the - // memcpy and refreshes start_offset during payload materialization. - uint64_t initial_value; - bool has_initial_value; - uint8_t __pad1__[7]; - uint64_t __pad2__; // → Tensor::owner_task_id - uint64_t __pad3__; // → Tensor::start_offset (zeroed) - - // --- Bytes [32, 64): Matches Tensor cacheline 1 layout --- - int32_t version; // Always 0 for create-info outputs - uint32_t ndims; - DataType dtype; - bool is_all_offset_zero; // Always true for create-info outputs - bool is_raw_eq_shapes; // Always true for create-info outputs - bool manual_dep; - uint32_t raw_shapes[RUNTIME_MAX_TENSOR_DIMS]; // → Tensor::shapes - - TensorCreateInfo() = default; - - friend struct Arg; -}; - -static_assert(sizeof(TensorCreateInfo) == 64); - -/** - * Tensor descriptor for Task input/output (128B = 2 cache lines) - * - * Describes a memory access pattern on Global Memory (GM) using - * raw_shapes (underlying buffer dimensions), shapes (current view dimensions), - * and offsets (multi-dimensional offset into the buffer). - * - * - `buffer` contains the underlying memory allocation (addr in bytes, size in bytes) - * - `raw_shapes[]`, `shapes[]`, `offsets[]` are in ELEMENTS - * - `dtype` specifies element type for interpreting buffer contents - * - * Fast-path flags (all on cache line 1): - * - is_all_offset_zero: when true, offsets[] are implicitly zero — skip offset read/write - * - is_raw_eq_shapes: when true, raw_shapes[] == shapes[] — skip raw_shapes read/write, - * use shapes[] wherever raw_shapes would be needed - * - manual_dep: when true, keep creator retention only and skip OverlapMap dependency tracking - * - * When BOTH flags are true, cache line 2 is never accessed. - * - * Layout: cache line 1 holds hot-path fields (buffer, owner_task_id, start_offset, - * version, ndims, dtype, flags, shapes); cache line 2 holds warm-path fields (raw_shapes, offsets). - * - * Construction: - * Users cannot default-construct or directly construct a Tensor. - * Valid Tensors are obtained only through controlled entry points: - * - make_tensor_external(...) - * - from_tensor_arg(...) - * - TaskOutputTensors returned by submit(...) - * - Tensor::view() / reshape() / transpose() on an existing valid Tensor - */ -struct alignas(64) Tensor { - // === Cache line 1 (64B) — hot path === - PTOBufferHandle buffer; // Underlying memory buffer (addr in bytes, size in bytes) - PTO2TaskId owner_task_id; // Creator task; PTO2TaskId::invalid() for external tensors - uint64_t start_offset; // Cached 1D element offset (precomputed from raw_shapes + offsets) - int32_t version; // Tensor version for overlap detection - uint32_t ndims; // Number of dimensions used - DataType dtype; // Data type of tensor elements - bool is_all_offset_zero; // True when all offsets[] are zero (skip offset read/write) - bool is_raw_eq_shapes; // True when raw_shapes[] == shapes[] (skip raw_shapes read/write) - bool manual_dep; // True when dependency tracking is creator-only (skip OverlapMap lookup/insert) - uint32_t shapes[RUNTIME_MAX_TENSOR_DIMS]; // Current view shape per dimension - - // === Cache line 2 (64B) — warm path === - uint32_t raw_shapes[RUNTIME_MAX_TENSOR_DIMS]; // Underlying buffer shape per dimension - uint32_t offsets[RUNTIME_MAX_TENSOR_DIMS]; // Multi-dimensional offset per dimension - uint8_t _pad_cl2[24]; // Tail padding (bytes 104–127) - - // --- Copy / move / destroy are public (valid tensors can be freely copied) --- - Tensor(const Tensor&) = default; - Tensor& operator=(const Tensor&) = default; - Tensor(Tensor&&) = default; - Tensor& operator=(Tensor&&) = default; - ~Tensor() = default; - - /// Return the effective raw_shapes pointer (shapes[] when is_raw_eq_shapes). - /// Avoids cache line 2 access for the common case. - const uint32_t* get_raw_shapes() const { return is_raw_eq_shapes ? shapes : raw_shapes; } - - // --- Initialization (operates on already-constructed Tensor) --- - void init(void* addr, - uint64_t buffer_size_bytes, - const uint32_t in_raw_shapes[], - const uint32_t in_shapes[], - const uint32_t in_offsets[], - uint32_t in_ndims, - DataType in_dtype, - int32_t in_version, - bool in_is_all_offset_zero = false, - bool in_is_raw_eq_shapes = false, - bool in_manual_dep = false) { - buffer = {reinterpret_cast(addr), buffer_size_bytes}; - ndims = in_ndims; - dtype = in_dtype; - version = in_version; - is_all_offset_zero = in_is_all_offset_zero; - is_raw_eq_shapes = in_is_raw_eq_shapes; - manual_dep = in_manual_dep; - for (uint32_t i = 0; i < in_ndims; i++) { - shapes[i] = in_shapes[i]; - } - if (!in_is_raw_eq_shapes) { - for (uint32_t i = 0; i < in_ndims; i++) { - raw_shapes[i] = in_raw_shapes[i]; - } - } - if (!in_is_all_offset_zero) { - for (uint32_t i = 0; i < in_ndims; i++) { - offsets[i] = in_offsets[i]; - } - } - owner_task_id = PTO2TaskId::invalid(); - } - - void init(const Tensor& other) { - memcpy(this, &other, 64); // fast copy cache line 1 - if (!other.is_raw_eq_shapes) { - for (uint32_t i = 0; i < other.ndims; i++) { - raw_shapes[i] = other.raw_shapes[i]; - } - } - if (!other.is_all_offset_zero) { - for (uint32_t i = 0; i < other.ndims; i++) { - offsets[i] = other.offsets[i]; - } - } - } - - void init_with_view( - const Tensor& other, const uint32_t view_shapes[], const uint32_t view_offsets[], bool in_manual_dep = false) { - buffer = other.buffer; - ndims = other.ndims; - dtype = other.dtype; - version = other.version; - manual_dep = in_manual_dep; - // view always diverges shapes from raw_shapes, so is_raw_eq_shapes = false. - // Read parent's effective raw_shapes (avoids parent cache line 2 when parent is_raw_eq_shapes). - is_raw_eq_shapes = false; - const uint32_t* parent_raw = other.get_raw_shapes(); - for (uint32_t i = 0; i < ndims; i++) { - raw_shapes[i] = parent_raw[i]; - shapes[i] = view_shapes[i]; - } - // Compute offsets and zero-flag - bool all_zero = true; - if (other.is_all_offset_zero) { - for (uint32_t i = 0; i < ndims; i++) { - if (view_offsets[i] != 0) { - all_zero = false; - break; - } - } - if (!all_zero) { - for (uint32_t i = 0; i < ndims; i++) { - offsets[i] = view_offsets[i]; - } - } - } else { - all_zero = false; - for (uint32_t i = 0; i < ndims; i++) { - offsets[i] = other.offsets[i] + view_offsets[i]; - } - } - is_all_offset_zero = all_zero; - owner_task_id = other.owner_task_id; - } - - /// Compute 1D flat element offset from multi-dimensional indices. - /// Uses Horner's method (forward traversal, no stride variable). - uint64_t compute_flat_offset(const uint32_t indices[], uint32_t in_ndims) const { - if (in_ndims == 0) return 0; - const uint32_t* rs = get_raw_shapes(); - uint64_t offset = 0; - if (is_all_offset_zero) { - for (uint32_t d = 0; d < in_ndims; d++) offset = offset * rs[d] + indices[d]; - } else { - for (uint32_t d = 0; d < in_ndims; d++) offset = offset * rs[d] + indices[d] + offsets[d]; - } - return offset; - } - - /// Materialize a TensorCreateInfo into this Tensor (fresh contiguous output). - /// Single 64B memcpy covers the entire cacheline 1, then buffer is overwritten. - void init_from_create_info(const TensorCreateInfo& ci, void* addr, uint64_t buffer_size) { - memcpy(this, &ci, 64); - buffer = {reinterpret_cast(addr), buffer_size}; - owner_task_id = PTO2TaskId::invalid(); // caller (orchestrator) overwrites with actual task_id - if (ci.has_initial_value) { - fill_initial_value(ci.initial_value); - } - } - - void fill_initial_value(uint64_t initial_value) { - always_assert(reinterpret_cast(buffer.addr) != nullptr); - uint64_t elem_size = get_element_size(dtype); - char* dst = reinterpret_cast(buffer.addr); - constexpr uint64_t BLK = 64; - uint64_t blk = (buffer.size < BLK) ? buffer.size : BLK; - for (uint64_t b = 0; b < blk; b += elem_size) { - memcpy(dst + b, &initial_value, elem_size); - } - uint64_t filled = blk; - while (filled < buffer.size) { - uint64_t copy_size = ((buffer.size - filled) < filled) ? (buffer.size - filled) : filled; - memcpy(dst + filled, dst, copy_size); - filled += copy_size; - } - } - - // --- Operations --- - void update_start_offset() { - if (is_all_offset_zero) { - start_offset = 0; - return; - } - const uint32_t* rs = get_raw_shapes(); - uint64_t result = 0; - uint64_t stride = 1; - for (int i = static_cast(ndims) - 1; i >= 0; i--) { - result += offsets[i] * stride; - stride *= rs[i]; - } - start_offset = result; - } - - void copy(const Tensor& other) { init(other); } - - Tensor view(const uint32_t view_shapes[], const uint32_t view_offsets[], bool manual_dep = false) const { - Tensor result; - result.init_with_view(*this, view_shapes, view_offsets, manual_dep); - return result; - } - - bool is_contiguous() const { - if (is_raw_eq_shapes || ndims == 0) { - return true; - } - for (uint32_t i = 1; i < ndims; i++) { - if (shapes[i] != raw_shapes[i]) { - return false; - } - } - return true; - } - - bool valid_reshape(const uint32_t new_shapes[], uint32_t new_ndims) const { - uint64_t x = numel(); - uint64_t y = 1; - for (uint32_t i = 0; i < new_ndims; i++) { - y *= new_shapes[i]; - } - return x == y; - } - - Tensor reshape(const uint32_t new_shapes[], uint32_t new_ndims, bool manual_dep = false) const { - debug_assert(valid_reshape(new_shapes, new_ndims)); - always_assert(is_contiguous()); - Tensor result; - result.copy(*this); - result.ndims = new_ndims; - result.is_all_offset_zero = true; - result.is_raw_eq_shapes = true; - result.manual_dep = manual_dep; - for (uint32_t i = 0; i < new_ndims; i++) { - result.shapes[i] = new_shapes[i]; - } - return result; - } - - bool valid_transpose(uint32_t x, uint32_t y) const { return x < ndims && y < ndims; } - - Tensor transpose(uint32_t x, uint32_t y, bool manual_dep = false) const { - debug_assert(valid_transpose(x, y)); - Tensor result; - result.copy(*this); - result.manual_dep = manual_dep; - // transpose swaps the same dims in both arrays, so equality is preserved - std::swap(result.shapes[x], result.shapes[y]); - if (!result.is_raw_eq_shapes) { - std::swap(result.raw_shapes[x], result.raw_shapes[y]); - } - if (!result.is_all_offset_zero) { - std::swap(result.offsets[x], result.offsets[y]); - } - return result; - } - - uint64_t numel() const { - if (ndims == 0) { - return 0; - } - uint64_t total = 1; - for (uint32_t i = 0; i < ndims; i++) { - total *= shapes[i]; - } - return total; - } - - bool is_same_memref(const Tensor& other) const { return buffer.addr == other.buffer.addr; } - - std::string dump() const { - std::stringstream ss; - std::string indent = " "; - ss << "{" << std::endl; - ss << indent << "buffer.addr: " << buffer.addr << std::endl; - ss << indent << "buffer.size: " << buffer.size << " bytes" << std::endl; - ss << indent << "dtype: " << get_dtype_name(dtype) << std::endl; - ss << indent << "ndims: " << ndims << std::endl; - ss << indent << "version: " << version << std::endl; - - const uint32_t* rs = get_raw_shapes(); - ss << indent << "raw_shapes: ["; - for (uint32_t i = 0; i < ndims; i++) { - if (i > 0) { - ss << ", "; - } - ss << rs[i]; - } - ss << "]" << std::endl; - ss << indent << "shapes: ["; - for (uint32_t i = 0; i < ndims; i++) { - if (i > 0) { - ss << ", "; - } - ss << shapes[i]; - } - ss << "]" << std::endl; - ss << indent << "offsets: ["; - for (uint32_t i = 0; i < ndims; i++) { - if (i > 0) { - ss << ", "; - } - ss << (is_all_offset_zero ? 0u : offsets[i]); - } - ss << "]" << std::endl; - ss << "}" << std::endl; - return ss.str(); - } - - private: - // Default and parameterized constructors are private. - // Valid Tensors come only from controlled entry points. - Tensor() = default; - - Tensor(void* addr, - uint64_t buffer_size_bytes, - const uint32_t raw_shapes[], - const uint32_t shapes[], - const uint32_t offsets[], - uint32_t ndims, - DataType dtype, - int32_t version, - bool is_all_offset_zero = false, - bool is_raw_eq_shapes = false, - bool manual_dep = false) { - init(addr, - buffer_size_bytes, - raw_shapes, - shapes, - offsets, - ndims, - dtype, - version, - is_all_offset_zero, - is_raw_eq_shapes, - manual_dep); - } - - // Friends that need to construct Tensors - friend struct PTO2TaskPayload; - friend inline Tensor make_tensor_external( - void* addr, const uint32_t shapes[], uint32_t ndims, DataType dtype, bool manual_dep, int32_t version); -}; - -static_assert(sizeof(Tensor) == 128, "Tensor must be exactly 2 cache lines (128 bytes)"); -static_assert(offsetof(Tensor, raw_shapes) == 64); -static_assert(offsetof(Tensor, owner_task_id) == 16, "owner_task_id must be at bytes 16-23 (cacheline 1)"); -static_assert(offsetof(Tensor, start_offset) == 24, "start_offset must be at bytes 24-31 (cacheline 1)"); - -// TensorCreateInfo layout must match Tensor cacheline 1 for memcpy optimization -static_assert(sizeof(TensorCreateInfo) == 64, "TensorCreateInfo must match Tensor cacheline 1 size (64 bytes)"); -static_assert(offsetof(TensorCreateInfo, version) == offsetof(Tensor, version)); -static_assert(offsetof(TensorCreateInfo, ndims) == offsetof(Tensor, ndims)); -static_assert(offsetof(TensorCreateInfo, dtype) == offsetof(Tensor, dtype)); -static_assert(offsetof(TensorCreateInfo, is_all_offset_zero) == offsetof(Tensor, is_all_offset_zero)); -static_assert(offsetof(TensorCreateInfo, is_raw_eq_shapes) == offsetof(Tensor, is_raw_eq_shapes)); -static_assert(offsetof(TensorCreateInfo, manual_dep) == offsetof(Tensor, manual_dep)); -static_assert(offsetof(TensorCreateInfo, raw_shapes) == offsetof(Tensor, shapes)); diff --git a/src/a5/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md b/src/a5/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md index c6d0e3ebd..9f7d1f68c 100644 --- a/src/a5/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md +++ b/src/a5/runtime/tensormap_and_ringbuffer/docs/SUBMIT_BY_CLUSTER.md @@ -204,9 +204,9 @@ Milestone command (device): ```bash python examples/scripts/run_example.py \ - -k tests/st/tensormap_and_ringbuffer/batch_paged_attention/kernels \ - -g tests/st/tensormap_and_ringbuffer/batch_paged_attention/golden.py \ - -p a2a3 -d 9 + -k tests/st/a5/tensormap_and_ringbuffer/batch_paged_attention/kernels \ + -g tests/st/a5/tensormap_and_ringbuffer/batch_paged_attention/golden.py \ + -p a5 -d 9 ``` Final validation: diff --git a/tests/st/a2a3/aicpu_build_graph/paged_attention/README.md b/tests/st/a2a3/aicpu_build_graph/paged_attention/README.md index b56d9774d..b3dd362ce 100644 --- a/tests/st/a2a3/aicpu_build_graph/paged_attention/README.md +++ b/tests/st/a2a3/aicpu_build_graph/paged_attention/README.md @@ -58,14 +58,14 @@ Block n: QK -> SF -> PV --+ ```bash # Run on hardware (specify device ID) python examples/scripts/run_example.py \ - -k tests/st/aicpu_build_graph/paged_attention/kernels \ - -g tests/st/aicpu_build_graph/paged_attention/golden.py \ + -k tests/st/a2a3/aicpu_build_graph/paged_attention/kernels \ + -g tests/st/a2a3/aicpu_build_graph/paged_attention/golden.py \ -p a2a3 -d 0 # Run multi-block test case PA_CASE=Case2 python examples/scripts/run_example.py \ - -k tests/st/aicpu_build_graph/paged_attention/kernels \ - -g tests/st/aicpu_build_graph/paged_attention/golden.py \ + -k tests/st/a2a3/aicpu_build_graph/paged_attention/kernels \ + -g tests/st/a2a3/aicpu_build_graph/paged_attention/golden.py \ -p a2a3 -d 0 ``` diff --git a/tests/st/a2a3/host_build_graph/paged_attention/README.md b/tests/st/a2a3/host_build_graph/paged_attention/README.md index bb280c331..3524c7e75 100644 --- a/tests/st/a2a3/host_build_graph/paged_attention/README.md +++ b/tests/st/a2a3/host_build_graph/paged_attention/README.md @@ -63,14 +63,14 @@ Block n: QK -> SF -> PV --+ ```bash # Run on hardware (specify device ID) python examples/scripts/run_example.py \ - -k tests/st/host_build_graph/paged_attention/kernels \ - -g tests/st/host_build_graph/paged_attention/golden.py \ + -k tests/st/a2a3/host_build_graph/paged_attention/kernels \ + -g tests/st/a2a3/host_build_graph/paged_attention/golden.py \ -p a2a3 -d 0 # Run multi-block test case PA_CASE=Case2 python examples/scripts/run_example.py \ - -k tests/st/host_build_graph/paged_attention/kernels \ - -g tests/st/host_build_graph/paged_attention/golden.py \ + -k tests/st/a2a3/host_build_graph/paged_attention/kernels \ + -g tests/st/a2a3/host_build_graph/paged_attention/golden.py \ -p a2a3 -d 0 ``` diff --git a/tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention/golden.py deleted file mode 100644 index d97a6b9fe..000000000 --- a/tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention/golden.py +++ /dev/null @@ -1,19 +0,0 @@ -from importlib.util import module_from_spec, spec_from_file_location -from pathlib import Path - - -_BASE_GOLDEN = ( - Path(__file__).resolve().parents[2] / "tensormap_and_ringbuffer" / "paged_attention" / "golden.py" -) -_SPEC = spec_from_file_location("tmr_unmodified_paged_attention_golden", _BASE_GOLDEN) -_MODULE = module_from_spec(_SPEC) -assert _SPEC is not None and _SPEC.loader is not None -_SPEC.loader.exec_module(_MODULE) - -ALL_CASES = _MODULE.ALL_CASES -ATOL = _MODULE.ATOL -DEFAULT_CASE = _MODULE.DEFAULT_CASE -RTOL = _MODULE.RTOL -__outputs__ = _MODULE.__outputs__ -compute_golden = _MODULE.compute_golden -generate_inputs = _MODULE.generate_inputs diff --git a/tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention/kernels/kernel_config.py b/tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention/kernels/kernel_config.py deleted file mode 100644 index 91c09945c..000000000 --- a/tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention/kernels/kernel_config.py +++ /dev/null @@ -1,20 +0,0 @@ -from copy import deepcopy -from importlib.util import module_from_spec, spec_from_file_location -from pathlib import Path - - -_BASE_KERNEL_CONFIG = ( - Path(__file__).resolve().parents[3] - / "tensormap_and_ringbuffer" - / "paged_attention" - / "kernels" - / "kernel_config.py" -) -_SPEC = spec_from_file_location("tmr_unmodified_paged_attention_kernel_config", _BASE_KERNEL_CONFIG) -_MODULE = module_from_spec(_SPEC) -assert _SPEC is not None and _SPEC.loader is not None -_SPEC.loader.exec_module(_MODULE) - -ORCHESTRATION = deepcopy(_MODULE.ORCHESTRATION) -KERNELS = deepcopy(_MODULE.KERNELS) -RUNTIME_CONFIG = {**_MODULE.RUNTIME_CONFIG, "runtime": "tensormap_and_ringbuffer_unmodified"} diff --git a/tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention_unroll/golden.py b/tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention_unroll/golden.py deleted file mode 100644 index 74aa2506c..000000000 --- a/tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention_unroll/golden.py +++ /dev/null @@ -1,19 +0,0 @@ -from importlib.util import module_from_spec, spec_from_file_location -from pathlib import Path - - -_BASE_GOLDEN = ( - Path(__file__).resolve().parents[2] / "tensormap_and_ringbuffer" / "paged_attention_unroll" / "golden.py" -) -_SPEC = spec_from_file_location("tmr_unmodified_paged_attention_unroll_golden", _BASE_GOLDEN) -_MODULE = module_from_spec(_SPEC) -assert _SPEC is not None and _SPEC.loader is not None -_SPEC.loader.exec_module(_MODULE) - -ALL_CASES = _MODULE.ALL_CASES -ATOL = _MODULE.ATOL -DEFAULT_CASE = _MODULE.DEFAULT_CASE -RTOL = _MODULE.RTOL -__outputs__ = _MODULE.__outputs__ -compute_golden = _MODULE.compute_golden -generate_inputs = _MODULE.generate_inputs diff --git a/tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention_unroll/kernels/kernel_config.py b/tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention_unroll/kernels/kernel_config.py deleted file mode 100644 index bdf8846cd..000000000 --- a/tests/st/a2a3/tensormap_and_ringbuffer_unmodified/paged_attention_unroll/kernels/kernel_config.py +++ /dev/null @@ -1,20 +0,0 @@ -from copy import deepcopy -from importlib.util import module_from_spec, spec_from_file_location -from pathlib import Path - - -_BASE_KERNEL_CONFIG = ( - Path(__file__).resolve().parents[3] - / "tensormap_and_ringbuffer" - / "paged_attention_unroll" - / "kernels" - / "kernel_config.py" -) -_SPEC = spec_from_file_location("tmr_unmodified_paged_attention_unroll_kernel_config", _BASE_KERNEL_CONFIG) -_MODULE = module_from_spec(_SPEC) -assert _SPEC is not None and _SPEC.loader is not None -_SPEC.loader.exec_module(_MODULE) - -ORCHESTRATION = deepcopy(_MODULE.ORCHESTRATION) -KERNELS = deepcopy(_MODULE.KERNELS) -RUNTIME_CONFIG = {**_MODULE.RUNTIME_CONFIG, "runtime": "tensormap_and_ringbuffer_unmodified"} diff --git a/tests/st/a5/host_build_graph/paged_attention/README.md b/tests/st/a5/host_build_graph/paged_attention/README.md index bb280c331..a9edfb51e 100644 --- a/tests/st/a5/host_build_graph/paged_attention/README.md +++ b/tests/st/a5/host_build_graph/paged_attention/README.md @@ -63,15 +63,15 @@ Block n: QK -> SF -> PV --+ ```bash # Run on hardware (specify device ID) python examples/scripts/run_example.py \ - -k tests/st/host_build_graph/paged_attention/kernels \ - -g tests/st/host_build_graph/paged_attention/golden.py \ - -p a2a3 -d 0 + -k tests/st/a5/host_build_graph/paged_attention/kernels \ + -g tests/st/a5/host_build_graph/paged_attention/golden.py \ + -p a5 -d 0 # Run multi-block test case PA_CASE=Case2 python examples/scripts/run_example.py \ - -k tests/st/host_build_graph/paged_attention/kernels \ - -g tests/st/host_build_graph/paged_attention/golden.py \ - -p a2a3 -d 0 + -k tests/st/a5/host_build_graph/paged_attention/kernels \ + -g tests/st/a5/host_build_graph/paged_attention/golden.py \ + -p a5 -d 0 ``` ## Directory Structure diff --git a/tests/ut/py/test_runtime_builder.py b/tests/ut/py/test_runtime_builder.py index 1852accb5..648f48c05 100644 --- a/tests/ut/py/test_runtime_builder.py +++ b/tests/ut/py/test_runtime_builder.py @@ -43,14 +43,6 @@ def test_discovers_aicpu_build_graph(self, default_test_platform): runtimes = builder.list_runtimes() assert "aicpu_build_graph" in runtimes - def test_discovers_unmodified_tensormap_runtime(self, default_test_platform): - """RuntimeBuilder discovers the unmodified tensormap runtime clone.""" - from runtime_builder import RuntimeBuilder # noqa: PLC0415 - - builder = RuntimeBuilder(platform=default_test_platform) - runtimes = builder.list_runtimes() - assert "tensormap_and_ringbuffer_unmodified" in runtimes - def test_runtime_dir_resolves_to_project_root(self, default_test_platform, test_arch): """runtime_dir resolves to src/{arch}/runtime/ under the project root.""" from runtime_builder import RuntimeBuilder # noqa: PLC0415 diff --git a/tools/README.md b/tools/README.md index 2d807e53c..6d691fbbd 100644 --- a/tools/README.md +++ b/tools/README.md @@ -34,7 +34,7 @@ python3 tools/swimlane_converter.py outputs/perf_swimlane_20260210_143526.json - # 从 kernel_config.py 加载函数名映射 python3 tools/swimlane_converter.py outputs/perf_swimlane_20260210_143526.json \ - -k examples/host_build_graph/paged_attention/kernels/kernel_config.py + -k examples/a2a3/host_build_graph/paged_attention/kernels/kernel_config.py # 使用指定 device id 自动选择 device log(device-) python3 tools/swimlane_converter.py outputs/perf_swimlane_20260210_143526.json -d 0 @@ -102,8 +102,8 @@ log root 解析顺序: ```bash # 运行测试并启用性能分析 - 测试通过后自动生成 merged_swimlane.json python examples/scripts/run_example.py \ - -k examples/host_build_graph/vector_example/kernels \ - -g examples/host_build_graph/vector_example/golden.py \ + -k examples/a2a3/host_build_graph/vector_example/kernels \ + -g examples/a2a3/host_build_graph/vector_example/golden.py \ --enable-profiling ``` @@ -190,7 +190,7 @@ python3 tools/perf_to_mermaid.py outputs/perf_swimlane_20260210_143526.json -o d # 从 kernel_config.py 加载函数名映射 python3 tools/perf_to_mermaid.py outputs/perf_swimlane_20260210_143526.json \ - -k examples/host_build_graph/paged_attention/kernels/kernel_config.py + -k examples/a2a3/host_build_graph/paged_attention/kernels/kernel_config.py # 使用紧凑样式(仅显示任务ID和函数名) python3 tools/perf_to_mermaid.py outputs/perf_swimlane_20260210_143526.json --style compact @@ -270,7 +270,7 @@ flowchart TD ### 功能概述 -`benchmark_rounds.sh` 遍历 `EXAMPLES` 数组中配置的测试用例(位于 `tests/st/tensormap_and_ringbuffer/` 下),依次调用 `run_example.py` 运行每个 example,然后从生成的 device log 中提取 `orch_start` / `orch_end` / `sched_end` 时间戳计算每轮 elapsed 时间。 +`benchmark_rounds.sh` 遍历脚本顶部为对应 runtime 配置的测试用例(位于 `tests/st/{arch}/{runtime}/` 下),依次调用 `run_example.py` 运行每个 example,然后从生成的 device log 中提取 `orch_start` / `orch_end` / `sched_end` 时间戳计算每轮 elapsed 时间。 当前预配置的 examples: - `alternating_matmul_add` diff --git a/tools/benchmark_rounds.sh b/tools/benchmark_rounds.sh index f3a19b9a1..a3949a2ee 100755 --- a/tools/benchmark_rounds.sh +++ b/tools/benchmark_rounds.sh @@ -39,16 +39,6 @@ ABG_EXAMPLE_ORDER=( paged_attention_unroll ) -# --- tensormap_and_ringbuffer_unmodified --- -declare -A TMR_UNMODIFIED_EXAMPLE_CASES=( - [paged_attention]="Case1,Case2" - [paged_attention_unroll]="Case1,Case2" -) -TMR_UNMODIFIED_EXAMPLE_ORDER=( - paged_attention - paged_attention_unroll -) - # --- tensormap_and_ringbuffer_partial_manual --- declare -A TMR_PARTIAL_MANUAL_EXAMPLE_CASES=( [paged_attention_partial_manual]="Case1,Case2" @@ -108,7 +98,6 @@ Options: -d, --device Device ID (default: 0) -n, --rounds Override number of rounds for each example (default: 100) -r, --runtime Runtime to benchmark: tensormap_and_ringbuffer (default), - tensormap_and_ringbuffer_unmodified, tensormap_and_ringbuffer_partial_manual, aicpu_build_graph -e, --examples Comma-separated example names to run (default: runtime-specific full list) @@ -166,10 +155,6 @@ case "$RUNTIME" in declare -n EXAMPLE_CASES=TMR_EXAMPLE_CASES EXAMPLE_ORDER=("${TMR_EXAMPLE_ORDER[@]}") ;; - tensormap_and_ringbuffer_unmodified) - declare -n EXAMPLE_CASES=TMR_UNMODIFIED_EXAMPLE_CASES - EXAMPLE_ORDER=("${TMR_UNMODIFIED_EXAMPLE_ORDER[@]}") - ;; tensormap_and_ringbuffer_partial_manual) TESTS_RUNTIME_DIR="tensormap_and_ringbuffer" declare -n EXAMPLE_CASES=TMR_PARTIAL_MANUAL_EXAMPLE_CASES @@ -180,7 +165,7 @@ case "$RUNTIME" in EXAMPLE_ORDER=("${ABG_EXAMPLE_ORDER[@]}") ;; *) - echo "ERROR: unknown runtime '$RUNTIME'. Use tensormap_and_ringbuffer, tensormap_and_ringbuffer_unmodified, tensormap_and_ringbuffer_partial_manual, or aicpu_build_graph." + echo "ERROR: unknown runtime '$RUNTIME'. Use tensormap_and_ringbuffer, tensormap_and_ringbuffer_partial_manual, or aicpu_build_graph." exit 1 ;; esac diff --git a/tools/swimlane_converter.py b/tools/swimlane_converter.py index f47ae9b47..e17ffe4a7 100644 --- a/tools/swimlane_converter.py +++ b/tools/swimlane_converter.py @@ -1043,7 +1043,7 @@ def _build_parser(): %(prog)s # Use latest .json in outputs/, output to outputs/ %(prog)s perf_swimlane_20260210_143526.json # Output: outputs/merged_swimlane_20260210_143526.json %(prog)s perf_swimlane_20260210_143526.json -o custom_output.json - %(prog)s perf_swimlane_20260210_143526.json -k examples/host_build_graph/paged_attention/kernels/kernel_config.py + %(prog)s perf_swimlane_20260210_143526.json -k examples/a2a3/host_build_graph/paged_attention/kernels/kernel_config.py %(prog)s perf_swimlane_20260210_143526.json -d 0 %(prog)s perf_swimlane_20260210_143526.json -v """, From c72fa9d1c556cbfb808bcce72b14650d53cb46e1 Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Wed, 8 Apr 2026 19:00:30 +0800 Subject: [PATCH 26/35] Fix: auto-pick free NPU for manual-scope tests - add a small hardware test helper that respects PTO_TEST_DEVICE_ID\n- fall back to the lowest NPU with no running processes from npu-smi\n- avoid blocking manual-scope hardware tests behind busy device 0 on shared machines --- tests/ut/hardware_test_utils.py | 34 ++++++++++++++++++++++++++ tests/ut/test_manual_scope_boundary.py | 4 ++- tests/ut/test_manual_scope_guards.py | 4 ++- 3 files changed, 40 insertions(+), 2 deletions(-) create mode 100644 tests/ut/hardware_test_utils.py diff --git a/tests/ut/hardware_test_utils.py b/tests/ut/hardware_test_utils.py new file mode 100644 index 000000000..82d49b91e --- /dev/null +++ b/tests/ut/hardware_test_utils.py @@ -0,0 +1,34 @@ +import os +import re +import subprocess + + +def get_test_device_id(default: str = "0") -> str: + """Pick a hardware test device. + + Respect PTO_TEST_DEVICE_ID when explicitly provided. Otherwise prefer the + lowest-ID NPU that reports no running processes in `npu-smi info`, which is + more stable than blindly defaulting to device 0 on shared machines. + """ + + configured = os.environ.get("PTO_TEST_DEVICE_ID") + if configured: + return configured + + try: + result = subprocess.run( + ["npu-smi", "info"], + capture_output=True, + text=True, + check=False, + ) + except FileNotFoundError: + return default + + if result.returncode != 0: + return default + + free_devices = sorted({int(match) for match in re.findall(r"No running processes found in NPU (\d+)", result.stdout)}) + if free_devices: + return str(free_devices[0]) + return default diff --git a/tests/ut/test_manual_scope_boundary.py b/tests/ut/test_manual_scope_boundary.py index a7028178d..aa043b313 100644 --- a/tests/ut/test_manual_scope_boundary.py +++ b/tests/ut/test_manual_scope_boundary.py @@ -5,6 +5,8 @@ import pytest +from hardware_test_utils import get_test_device_id + PROJECT_ROOT = Path(__file__).parent.parent.parent RUN_EXAMPLE = PROJECT_ROOT / "examples" / "scripts" / "run_example.py" @@ -18,7 +20,7 @@ @pytest.mark.requires_hardware @pytest.mark.skipif(not os.getenv("ASCEND_HOME_PATH"), reason="ASCEND_HOME_PATH not set; Ascend toolkit required") def test_manual_scope_outer_multiwrite_boundary(): - device_id = os.environ.get("PTO_TEST_DEVICE_ID", "0") + device_id = get_test_device_id() command = ( f"source {os.environ['ASCEND_HOME_PATH']}/bin/setenv.bash >/dev/null 2>&1 && " f"{sys.executable} {RUN_EXAMPLE} --build --silent " diff --git a/tests/ut/test_manual_scope_guards.py b/tests/ut/test_manual_scope_guards.py index 11bb6c175..82f5632df 100644 --- a/tests/ut/test_manual_scope_guards.py +++ b/tests/ut/test_manual_scope_guards.py @@ -6,6 +6,8 @@ import pytest +from hardware_test_utils import get_test_device_id + PROJECT_ROOT = Path(__file__).parent.parent.parent RUN_EXAMPLE = PROJECT_ROOT / "examples" / "scripts" / "run_example.py" @@ -38,7 +40,7 @@ ], ) def test_manual_scope_guard_failures(case_name, expected_message): - device_id = os.environ.get("PTO_TEST_DEVICE_ID", "0") + device_id = get_test_device_id() log_dir = Path.home() / "ascend" / "log" / "debug" / f"device-{device_id}" if os.getenv("ASCEND_WORK_PATH"): work_log_dir = Path(os.environ["ASCEND_WORK_PATH"]).expanduser() / "log" / "debug" / f"device-{device_id}" From af357ad2099f128d723ef1d90ca72c7f4df17937 Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Wed, 8 Apr 2026 21:25:11 +0800 Subject: [PATCH 27/35] Fix: harden manual-scope metadata growth - widen manual-local masks to 64 bits so manual submit handles all\n MAX_TENSOR_ARGS entries without truncation\n- keep realloc failures from dropping live metadata buffers by setting a\n fatal out-of-memory runtime error before returning\n- simplify the partial-manual paged-attention valid_len calculation with\n std::min --- .../orchestration/paged_attention_orch.cpp | 3 +- .../runtime/pto_orchestrator.cpp | 51 +++++++++++++++---- .../runtime/pto_orchestrator.h | 2 +- .../runtime/pto_runtime2_types.h | 1 + 4 files changed, 44 insertions(+), 13 deletions(-) diff --git a/examples/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp b/examples/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp index 8a7476953..c5487525d 100644 --- a/examples/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp +++ b/examples/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/orchestration/paged_attention_orch.cpp @@ -101,8 +101,7 @@ aicpu_orchestration_entry(const ChipStorageTaskArgs &orch_args, int orch_thread_ for (uint64_t bn = 0; bn < bn_this_batch; bn++) { uint64_t cur_block_idx = host_block_table[b_idx * block_num + bn]; - uint64_t valid_len = - block_size < (cur_seq - bn * block_size) ? block_size : (cur_seq - bn * block_size); + uint64_t valid_len = std::min(block_size, cur_seq - bn * block_size); uint32_t kv_shapes[2] = {static_cast(block_size), static_cast(head_dim)}; uint32_t kv_offsets[2] = {static_cast(cur_block_idx * block_size), 0}; Tensor kj = key_cache.view(kv_shapes, kv_offsets); diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp index e1a0d9a08..9f31475a1 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp @@ -456,37 +456,63 @@ void pto2_orchestrator_set_scheduler(PTO2OrchestratorState *orch, PTO2SchedulerS // Scope Management // ============================================================================= -static void scope_tasks_push(PTO2OrchestratorState *orch, PTO2TaskSlotState *task_slot_state) { +static bool scope_tasks_push(PTO2OrchestratorState *orch, PTO2TaskSlotState *task_slot_state) { + if (orch->fatal) { + return false; + } if (orch->scope_tasks_size >= orch->scope_tasks_capacity) { int32_t new_cap = orch->scope_tasks_capacity * 2; PTO2TaskSlotState **new_buf = reinterpret_cast(realloc(orch->scope_tasks, new_cap * sizeof(PTO2TaskSlotState *))); - assert(new_buf && "Failed to grow scope task buffer"); + if (new_buf == nullptr) { + LOG_ERROR("Failed to grow scope task buffer to %d entries", new_cap); + orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_OUT_OF_MEMORY, std::memory_order_release); + orch->fatal = true; + return false; + } orch->scope_tasks = new_buf; orch->scope_tasks_capacity = new_cap; } orch->scope_tasks[orch->scope_tasks_size++] = task_slot_state; + return true; } -static void manual_task_meta_push(PTO2OrchestratorState *orch, const PTO2ManualTaskMeta &meta) { +static bool manual_task_meta_push(PTO2OrchestratorState *orch, const PTO2ManualTaskMeta &meta) { + if (orch->fatal) { + return false; + } if (orch->manual_task_meta_size >= orch->manual_task_meta_capacity) { int32_t new_cap = orch->manual_task_meta_capacity * 2; PTO2ManualTaskMeta *new_buf = reinterpret_cast( realloc(orch->manual_task_meta, new_cap * sizeof(PTO2ManualTaskMeta)) ); - assert(new_buf && "Failed to grow manual task meta buffer"); + if (new_buf == nullptr) { + LOG_ERROR("Failed to grow manual task metadata buffer to %d entries", new_cap); + orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_OUT_OF_MEMORY, std::memory_order_release); + orch->fatal = true; + return false; + } orch->manual_task_meta = new_buf; orch->manual_task_meta_capacity = new_cap; } orch->manual_task_meta[orch->manual_task_meta_size++] = meta; + return true; } static int32_t manual_edge_push(PTO2OrchestratorState *orch, const PTO2ManualEdge &edge) { + if (orch->fatal) { + return -1; + } if (orch->manual_edges_size >= orch->manual_edges_capacity) { int32_t new_cap = orch->manual_edges_capacity * 2; PTO2ManualEdge *new_buf = reinterpret_cast(realloc(orch->manual_edges, new_cap * sizeof(PTO2ManualEdge))); - assert(new_buf && "Failed to grow manual edge buffer"); + if (new_buf == nullptr) { + LOG_ERROR("Failed to grow manual edge buffer to %d entries", new_cap); + orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_OUT_OF_MEMORY, std::memory_order_release); + orch->fatal = true; + return -1; + } orch->manual_edges = new_buf; orch->manual_edges_capacity = new_cap; } @@ -861,7 +887,7 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( CYCLE_COUNT_LAP_RECORD(g_orch_sync_cycle, AicpuPhaseId::ORCH_SYNC, task_id.raw); // === STEP 3: Lookup inputs + materialize runtime-created outputs === - uint16_t manual_local_mask = 0; + uint64_t manual_local_mask = 0; for (int i = 0; i < args.tensor_count(); i++) { TensorArgType ptype = args.tag(i); if (ptype == TensorArgType::OUTPUT) { @@ -871,7 +897,7 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( const Tensor *tensor = args.tensor(i).ptr; if constexpr (kManualSubmit) { if (task_owned_by_current_manual_scope(orch, tensor->owner_task_id)) { - manual_local_mask |= static_cast(1u << i); + manual_local_mask |= static_cast(1ULL << i); continue; } } @@ -922,7 +948,7 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( TensorArgType ptype = args.tag(i); if (ptype == TensorArgType::INOUT || ptype == TensorArgType::OUTPUT_EXISTING) { if constexpr (kManualSubmit) { - if ((manual_local_mask & static_cast(1u << i)) != 0) { + if ((manual_local_mask & static_cast(1ULL << i)) != 0) { continue; } } @@ -1199,10 +1225,12 @@ pto2_submit_mixed_task_manual(PTO2OrchestratorState *orch, const MixedKernels &m for (int32_t i = 0; i < args.tensor_count(); i++) { meta.tags[i] = static_cast(args.tag(i)); if (task_owned_by_current_manual_scope(orch, meta.slot_state->payload->tensors[i].owner_task_id)) { - meta.manual_local_mask |= static_cast(1u << i); + meta.manual_local_mask |= static_cast(1ULL << i); } } - manual_task_meta_push(orch, meta); + if (!manual_task_meta_push(orch, meta)) { + return result; + } return result; } @@ -1248,6 +1276,9 @@ void pto2_add_dependency(PTO2OrchestratorState *orch, PTO2TaskId producer_id, PT .next_consumer_edge = consumer_meta.incoming_edge_head, } ); + if (edge_idx < 0) { + return; + } consumer_meta.incoming_edge_head = edge_idx; consumer_meta.incoming_edge_count++; } diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h index 833417aab..023b60de8 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h @@ -46,7 +46,7 @@ struct PTO2ManualTaskMeta { int32_t incoming_edge_head; uint16_t incoming_edge_count; uint8_t tensor_count; - uint16_t manual_local_mask; + uint64_t manual_local_mask; uint8_t tags[MAX_TENSOR_ARGS]; }; diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h index 7705e0e64..4654656c1 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h @@ -80,6 +80,7 @@ #define PTO2_ERROR_DEP_POOL_OVERFLOW 4 #define PTO2_ERROR_INVALID_ARGS 5 // Arg construction error (invalid args) #define PTO2_ERROR_DEPENDENCY_OVERFLOW 6 // Too many unique fanin dependencies for one task +#define PTO2_ERROR_OUT_OF_MEMORY 7 // Runtime metadata buffer growth failed // Scheduler errors (100+): detected in scheduler threads #define PTO2_ERROR_SCHEDULER_TIMEOUT 100 From abace346a9cb01d7ce672481ccf273f82b648778 Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Wed, 8 Apr 2026 22:18:57 +0800 Subject: [PATCH 28/35] Update: move manual external wiring to submit - realize external producer fanout during manual submit while keeping the publish barrier at scope_end - shrink manual scope_end to replay only same-scope explicit edges - keep manual-scope validation and boundary semantics unchanged --- .../runtime/pto_orchestrator.cpp | 150 ++++++++++-------- 1 file changed, 87 insertions(+), 63 deletions(-) diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp index 9f31475a1..3ccace809 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp @@ -640,6 +640,7 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { return; } + int32_t dep_pool_needed[PTO2_MAX_RING_DEPTH] = {0}; for (int32_t task_idx = 0; task_idx < count; task_idx++) { PTO2ManualTaskMeta &meta = orch->manual_task_meta[manual_meta_begin + task_idx]; PTO2TaskSlotState *slot_state = orch->scope_tasks[begin + task_idx]; @@ -652,19 +653,7 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { PTO2TaskPayload *payload = slot_state->payload; PTO2TaskId task_id = slot_state->task->task_id; - uint8_t ring_id = slot_state->ring_id; - auto &dep_pool = orch->rings[ring_id].dep_pool; - auto &fc = orch->sm_handle->header->rings[ring_id].fc; - PTO2FaninBuilder fanin_builder; - fanin_builder.count = payload->fanin_actual_count; - fanin_builder.spill_start = payload->fanin_spill_start; - fanin_builder.spill_pool = - (payload->fanin_spill_pool != nullptr) ? payload->fanin_spill_pool : &orch->rings[ring_id].fanin_pool; int32_t cached_external_count = payload->fanin_actual_count; - int32_t cached_inline_count = std::min(cached_external_count, PTO2_FANIN_INLINE_CAP); - for (int32_t i = 0; i < cached_inline_count; i++) { - fanin_builder.inline_slots[i] = payload->fanin_inline_slot_states[i]; - } int32_t local_edge_count = meta.incoming_edge_count; int32_t fanin_count = cached_external_count + local_edge_count; @@ -683,9 +672,42 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { return; } - // Explicit manual edges are deduped at record time, and current-scope - // producers never appear in cached_external_count because manual submit - // skips local owner/TensorMap discovery for those tensors. + dep_pool_needed[slot_state->ring_id] += local_edge_count; + } + + for (int32_t ring_id = 0; ring_id < PTO2_MAX_RING_DEPTH; ring_id++) { + int32_t needed = dep_pool_needed[ring_id]; + if (needed <= 0) { + continue; + } + auto &dep_pool = orch->rings[ring_id].dep_pool; + auto &fc = orch->sm_handle->header->rings[ring_id].fc; + dep_pool.ensure_space(*orch->scheduler, fc, ring_id, needed); + } + + for (int32_t task_idx = 0; task_idx < count; task_idx++) { + PTO2ManualTaskMeta &meta = orch->manual_task_meta[manual_meta_begin + task_idx]; + PTO2TaskSlotState *slot_state = orch->scope_tasks[begin + task_idx]; + PTO2TaskPayload *payload = slot_state->payload; + int32_t cached_external_count = payload->fanin_actual_count; + int32_t local_edge_count = meta.incoming_edge_count; + if (local_edge_count == 0) { + continue; + } + + uint8_t ring_id = slot_state->ring_id; + auto &dep_pool = orch->rings[ring_id].dep_pool; + auto &fc = orch->sm_handle->header->rings[ring_id].fc; + PTO2FaninBuilder fanin_builder; + fanin_builder.count = cached_external_count; + fanin_builder.spill_start = payload->fanin_spill_start; + fanin_builder.spill_pool = + (payload->fanin_spill_pool != nullptr) ? payload->fanin_spill_pool : &orch->rings[ring_id].fanin_pool; + int32_t cached_inline_count = std::min(cached_external_count, PTO2_FANIN_INLINE_CAP); + for (int32_t i = 0; i < cached_inline_count; i++) { + fanin_builder.inline_slots[i] = payload->fanin_inline_slot_states[i]; + } + auto append_local_fanin_or_fail = [&](PTO2TaskSlotState *prod_state) { if (fanin_builder.count < PTO2_FANIN_INLINE_CAP) { fanin_builder.inline_slots[fanin_builder.count++] = prod_state; @@ -707,69 +729,41 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { fanin_builder.count++; return true; }; + for (int32_t edge_idx = meta.incoming_edge_head; edge_idx >= 0; edge_idx = orch->manual_edges[edge_idx].next_consumer_edge) { - const PTO2ManualEdge &edge = orch->manual_edges[edge_idx]; - PTO2TaskSlotState *prod_state = orch->scope_tasks[begin + edge.producer_idx]; + PTO2TaskSlotState *prod_state = orch->scope_tasks[begin + orch->manual_edges[edge_idx].producer_idx]; if (!append_local_fanin_or_fail(prod_state)) { return; } } - int32_t final_fanin_count = fanin_builder.count; - int32_t inline_count = std::min(final_fanin_count, PTO2_FANIN_INLINE_CAP); - int32_t spill_count = final_fanin_count - inline_count; - dep_pool.ensure_space(*orch->scheduler, fc, ring_id, final_fanin_count + 1); - - slot_state->task_state.store(PTO2_TASK_PENDING, std::memory_order_relaxed); - slot_state->fanin_count = final_fanin_count + 1; - payload->fanin_actual_count = final_fanin_count; + int32_t fanin_count = fanin_builder.count; + int32_t inline_count = std::min(fanin_count, PTO2_FANIN_INLINE_CAP); + int32_t spill_count = fanin_count - inline_count; + payload->fanin_actual_count = fanin_count; payload->fanin_spill_start = (spill_count > 0) ? fanin_builder.spill_start : 0; payload->fanin_spill_pool = (spill_count > 0) ? fanin_builder.spill_pool : nullptr; for (int32_t i = 0; i < inline_count; i++) { payload->fanin_inline_slot_states[i] = fanin_builder.inline_slots[i]; } - int32_t early_finished = 0; - int32_t producer_index = 0; - fanin_builder.for_each([&](PTO2TaskSlotState *producer_slot) { - if (producer_index >= cached_external_count) { - return false; - } - PTO2TaskSlotState &producer_slot_state = *producer_slot; - producer_index++; -#if PTO2_ORCH_PROFILING - pto2_fanout_lock(producer_slot_state, g_orch_fanin_atomic_count, g_orch_fanin_wait_cycle); -#else - pto2_fanout_lock(producer_slot_state); -#endif - int32_t prod_state = producer_slot_state.task_state.load(std::memory_order_acquire); - if (prod_state >= PTO2_TASK_COMPLETED) { - early_finished++; - } else { - producer_slot_state.fanout_head = dep_pool.prepend(producer_slot_state.fanout_head, slot_state); - } - pto2_fanout_unlock(producer_slot_state); - return true; - }); + // External producers were already realized during manual submit. + // At manual scope_end, only same-scope explicit edges remain to wire. for (int32_t edge_idx = meta.incoming_edge_head; edge_idx >= 0; edge_idx = orch->manual_edges[edge_idx].next_consumer_edge) { PTO2TaskSlotState &producer_slot_state = *orch->scope_tasks[begin + orch->manual_edges[edge_idx].producer_idx]; - // Same-scope explicit producers are unpublished until the batch - // publish below, so no scheduler thread can race on fanout state. producer_slot_state.fanout_count += 1; producer_slot_state.fanout_head = dep_pool.prepend(producer_slot_state.fanout_head, slot_state); + if (producer_slot_state.fanout_head == nullptr) { + orch->fatal = true; + return; + } } - if (early_finished > 0) { - slot_state->fanin_refcount.fetch_add(early_finished, std::memory_order_acq_rel); - } + slot_state->fanin_count += local_edge_count; slot_state->dep_pool_mark = dep_pool.top; } - orch->scheduler->publish_manual_scope_tasks(&orch->scope_tasks[begin], count); - } - - if (orch->scheduler && count > 0) { - orch->scheduler->on_scope_end(&orch->scope_tasks[begin], count); + orch->scheduler->publish_manual_scope_tasks_and_end_scope(&orch->scope_tasks[begin], count); } // Rewind the task buffer — these entries are no longer needed @@ -1013,15 +1007,21 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( cur_slot_state.next_block_idx = 0; if constexpr (kManualSubmit) { - int32_t inline_count = std::min(fanin_builder.count, PTO2_FANIN_INLINE_CAP); - int32_t spill_count = fanin_builder.count - inline_count; - payload->fanin_actual_count = fanin_builder.count; + int32_t fanin_count = fanin_builder.count; + int32_t inline_count = std::min(fanin_count, PTO2_FANIN_INLINE_CAP); + int32_t spill_count = fanin_count - inline_count; + payload->fanin_actual_count = fanin_count; payload->fanin_spill_start = (spill_count > 0) ? fanin_builder.spill_start : 0; payload->fanin_spill_pool = (spill_count > 0) ? fanin_builder.spill_pool : nullptr; for (int i = 0; i < inline_count; i++) { payload->fanin_inline_slot_states[i] = fanin_builder.inline_slots[i]; } - fanin_builder.for_each([&](PTO2TaskSlotState *producer_slot) { + + auto &dep_pool = orch->rings[ring_id].dep_pool; + dep_pool.ensure_space(*sched, fc, ring_id, fanin_count); + + int32_t early_finished = 0; + bool fanout_ok = fanin_builder.for_each([&](PTO2TaskSlotState *producer_slot) { PTO2TaskSlotState &producer_slot_state = *producer_slot; #if PTO2_ORCH_PROFILING pto2_fanout_lock(producer_slot_state, g_orch_fanin_atomic_count, g_orch_fanin_wait_cycle); @@ -1029,11 +1029,35 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( pto2_fanout_lock(producer_slot_state); #endif producer_slot_state.fanout_count += 1; + int32_t prod_state = producer_slot_state.task_state.load(std::memory_order_acquire); + if (prod_state >= PTO2_TASK_COMPLETED) { + early_finished++; + } else { + producer_slot_state.fanout_head = + dep_pool.prepend(producer_slot_state.fanout_head, &cur_slot_state); + if (producer_slot_state.fanout_head == nullptr) { + pto2_fanout_unlock(producer_slot_state); + orch->fatal = true; + return false; + } + } pto2_fanout_unlock(producer_slot_state); return true; }); - cur_slot_state.fanin_count = 1; - cur_slot_state.dep_pool_mark = orch->rings[ring_id].dep_pool.top; + if (!fanout_ok) { + return result; + } + cur_slot_state.fanin_count = fanin_count + 1; + if (early_finished > 0) { + cur_slot_state.fanin_refcount.fetch_add(early_finished, std::memory_order_acq_rel); + } + cur_slot_state.dep_pool_mark = dep_pool.top; +#if PTO2_ORCH_PROFILING + g_orch_fanin_atomic_count += fanin_count * 3; + if (early_finished > 0) { + g_orch_fanin_atomic_count += 1; // fanin_refcount.fetch_add + } +#endif } else { auto &dep_pool = orch->rings[ring_id].dep_pool; int32_t fanin_count = fanin_builder.count; From 4530e81d6204c656a1ab8fa19b70a29f3522dc2d Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Wed, 8 Apr 2026 23:48:02 +0800 Subject: [PATCH 29/35] Update: collapse manual scope bookkeeping - remove dead manual replay metadata from the manual-scope path - skip tensormap sync when a manual submit stays fully in-scope - keep only the publish barrier and dep-pool watermark fixup at scope_end On device 4 with 5 rounds, paged_attention_partial_manual improved from 35.27 ms / 35.12 ms orch to 31.60 ms / 31.44 ms orch for Case1, and from 19.80 ms / 19.38 ms orch to 17.93 ms / 17.49 ms orch for Case2. --- .../runtime/pto_orchestrator.cpp | 341 ++++++------------ .../runtime/pto_orchestrator.h | 27 +- 2 files changed, 119 insertions(+), 249 deletions(-) diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp index 3ccace809..1260180c7 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp @@ -392,18 +392,9 @@ bool pto2_orchestrator_init( int32_t init_cap = PTO2_SCOPE_TASKS_INIT_CAP; orch->scope_tasks = reinterpret_cast(malloc(init_cap * sizeof(PTO2TaskSlotState *))); orch->scope_begins = reinterpret_cast(malloc(max_depth * sizeof(int32_t))); - orch->manual_task_meta_begins = reinterpret_cast(malloc(max_depth * sizeof(int32_t))); - orch->manual_edge_begins = reinterpret_cast(malloc(max_depth * sizeof(int32_t))); - orch->manual_task_meta = reinterpret_cast(malloc(init_cap * sizeof(PTO2ManualTaskMeta))); - orch->manual_edges = reinterpret_cast(malloc(init_cap * sizeof(PTO2ManualEdge))); - if (!orch->scope_tasks || !orch->scope_begins || !orch->manual_task_meta_begins || !orch->manual_edge_begins || - !orch->manual_task_meta || !orch->manual_edges) { + if (!orch->scope_tasks || !orch->scope_begins) { free(orch->scope_tasks); free(orch->scope_begins); - free(orch->manual_task_meta_begins); - free(orch->manual_edge_begins); - free(orch->manual_task_meta); - free(orch->manual_edges); for (int r = 0; r < PTO2_MAX_RING_DEPTH; r++) { free(orch->rings[r].fanin_pool.base); free(orch->rings[r].dep_pool.base); @@ -416,10 +407,7 @@ bool pto2_orchestrator_init( orch->scope_stack_top = -1; orch->scope_stack_capacity = max_depth; orch->manual_scope_active = false; - orch->manual_task_meta_size = 0; - orch->manual_task_meta_capacity = init_cap; - orch->manual_edges_size = 0; - orch->manual_edges_capacity = init_cap; + memset(orch->manual_dep_pool_reserve, 0, sizeof(orch->manual_dep_pool_reserve)); return true; } @@ -438,14 +426,6 @@ void pto2_orchestrator_destroy(PTO2OrchestratorState *orch) { orch->scope_tasks = NULL; free(orch->scope_begins); orch->scope_begins = NULL; - free(orch->manual_task_meta_begins); - orch->manual_task_meta_begins = NULL; - free(orch->manual_edge_begins); - orch->manual_edge_begins = NULL; - free(orch->manual_task_meta); - orch->manual_task_meta = NULL; - free(orch->manual_edges); - orch->manual_edges = NULL; } void pto2_orchestrator_set_scheduler(PTO2OrchestratorState *orch, PTO2SchedulerState *scheduler) { @@ -477,50 +457,6 @@ static bool scope_tasks_push(PTO2OrchestratorState *orch, PTO2TaskSlotState *tas return true; } -static bool manual_task_meta_push(PTO2OrchestratorState *orch, const PTO2ManualTaskMeta &meta) { - if (orch->fatal) { - return false; - } - if (orch->manual_task_meta_size >= orch->manual_task_meta_capacity) { - int32_t new_cap = orch->manual_task_meta_capacity * 2; - PTO2ManualTaskMeta *new_buf = reinterpret_cast( - realloc(orch->manual_task_meta, new_cap * sizeof(PTO2ManualTaskMeta)) - ); - if (new_buf == nullptr) { - LOG_ERROR("Failed to grow manual task metadata buffer to %d entries", new_cap); - orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_OUT_OF_MEMORY, std::memory_order_release); - orch->fatal = true; - return false; - } - orch->manual_task_meta = new_buf; - orch->manual_task_meta_capacity = new_cap; - } - orch->manual_task_meta[orch->manual_task_meta_size++] = meta; - return true; -} - -static int32_t manual_edge_push(PTO2OrchestratorState *orch, const PTO2ManualEdge &edge) { - if (orch->fatal) { - return -1; - } - if (orch->manual_edges_size >= orch->manual_edges_capacity) { - int32_t new_cap = orch->manual_edges_capacity * 2; - PTO2ManualEdge *new_buf = - reinterpret_cast(realloc(orch->manual_edges, new_cap * sizeof(PTO2ManualEdge))); - if (new_buf == nullptr) { - LOG_ERROR("Failed to grow manual edge buffer to %d entries", new_cap); - orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_OUT_OF_MEMORY, std::memory_order_release); - orch->fatal = true; - return -1; - } - orch->manual_edges = new_buf; - orch->manual_edges_capacity = new_cap; - } - int32_t edge_idx = orch->manual_edges_size; - orch->manual_edges[orch->manual_edges_size++] = edge; - return edge_idx; -} - static bool in_manual_scope(const PTO2OrchestratorState *orch) { return orch->manual_scope_active; } @@ -590,10 +526,6 @@ void pto2_scope_begin(PTO2OrchestratorState *orch, PTO2ScopeMode mode) { ++orch->scope_stack_top; orch->scope_begins[orch->scope_stack_top] = orch->scope_tasks_size; orch->manual_scope_active = (mode == PTO2ScopeMode::MANUAL); - if (mode == PTO2ScopeMode::MANUAL) { - orch->manual_task_meta_begins[orch->scope_stack_top] = orch->manual_task_meta_size; - orch->manual_edge_begins[orch->scope_stack_top] = orch->manual_edges_size; - } } void pto2_scope_end(PTO2OrchestratorState *orch) { @@ -628,148 +560,43 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { return; } - int32_t manual_meta_begin = orch->manual_task_meta_begins[top]; - int32_t manual_edge_begin = orch->manual_edge_begins[top]; - if (orch->scheduler && count > 0) { - int32_t manual_task_count = orch->manual_task_meta_size - manual_meta_begin; - if (manual_task_count != count) { - LOG_ERROR("manual scope requires pto2_rt_submit_*_manual for every submitted task"); - orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_INVALID_ARGS, std::memory_order_release); - orch->fatal = true; - return; - } - - int32_t dep_pool_needed[PTO2_MAX_RING_DEPTH] = {0}; for (int32_t task_idx = 0; task_idx < count; task_idx++) { - PTO2ManualTaskMeta &meta = orch->manual_task_meta[manual_meta_begin + task_idx]; PTO2TaskSlotState *slot_state = orch->scope_tasks[begin + task_idx]; - if (meta.slot_state != slot_state || meta.scope_task_index != task_idx) { - LOG_ERROR("manual scope task metadata does not match submit order"); - orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_INVALID_ARGS, std::memory_order_release); - orch->fatal = true; - return; - } - PTO2TaskPayload *payload = slot_state->payload; - PTO2TaskId task_id = slot_state->task->task_id; - int32_t cached_external_count = payload->fanin_actual_count; - int32_t local_edge_count = meta.incoming_edge_count; - int32_t fanin_count = cached_external_count + local_edge_count; - - if (fanin_count > PTO2_MAX_INPUTS) { + if (payload->fanin_actual_count > PTO2_MAX_INPUTS) { LOG_ERROR("========================================"); LOG_ERROR("FATAL: Dependency Overflow Detected!"); LOG_ERROR("========================================"); LOG_ERROR("Task requires more than PTO2_MAX_INPUTS unique fanin dependencies."); - LOG_ERROR(" task_id.raw: %" PRIu64, task_id.raw); - LOG_ERROR(" fanin_count: %d / %d", fanin_count, PTO2_MAX_INPUTS); - LOG_ERROR(" reason: manual explicit dependency"); + LOG_ERROR(" task_id.raw: %" PRIu64, slot_state->task->task_id.raw); + LOG_ERROR(" fanin_count: %d / %d", payload->fanin_actual_count, PTO2_MAX_INPUTS); + LOG_ERROR(" reason: manual dependency bookkeeping"); LOG_ERROR("This is a runtime dependency-tracking limit."); LOG_ERROR("========================================"); orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_DEPENDENCY_OVERFLOW, std::memory_order_release); orch->fatal = true; return; } - - dep_pool_needed[slot_state->ring_id] += local_edge_count; - } - - for (int32_t ring_id = 0; ring_id < PTO2_MAX_RING_DEPTH; ring_id++) { - int32_t needed = dep_pool_needed[ring_id]; - if (needed <= 0) { - continue; - } - auto &dep_pool = orch->rings[ring_id].dep_pool; - auto &fc = orch->sm_handle->header->rings[ring_id].fc; - dep_pool.ensure_space(*orch->scheduler, fc, ring_id, needed); } + int32_t dep_pool_mark_prefix = 0; for (int32_t task_idx = 0; task_idx < count; task_idx++) { - PTO2ManualTaskMeta &meta = orch->manual_task_meta[manual_meta_begin + task_idx]; PTO2TaskSlotState *slot_state = orch->scope_tasks[begin + task_idx]; - PTO2TaskPayload *payload = slot_state->payload; - int32_t cached_external_count = payload->fanin_actual_count; - int32_t local_edge_count = meta.incoming_edge_count; - if (local_edge_count == 0) { - continue; - } - - uint8_t ring_id = slot_state->ring_id; - auto &dep_pool = orch->rings[ring_id].dep_pool; - auto &fc = orch->sm_handle->header->rings[ring_id].fc; - PTO2FaninBuilder fanin_builder; - fanin_builder.count = cached_external_count; - fanin_builder.spill_start = payload->fanin_spill_start; - fanin_builder.spill_pool = - (payload->fanin_spill_pool != nullptr) ? payload->fanin_spill_pool : &orch->rings[ring_id].fanin_pool; - int32_t cached_inline_count = std::min(cached_external_count, PTO2_FANIN_INLINE_CAP); - for (int32_t i = 0; i < cached_inline_count; i++) { - fanin_builder.inline_slots[i] = payload->fanin_inline_slot_states[i]; - } - - auto append_local_fanin_or_fail = [&](PTO2TaskSlotState *prod_state) { - if (fanin_builder.count < PTO2_FANIN_INLINE_CAP) { - fanin_builder.inline_slots[fanin_builder.count++] = prod_state; - return true; - } - - PTO2FaninPool &fanin_pool = *fanin_builder.spill_pool; - fanin_pool.ensure_space(*orch->scheduler, fc, ring_id, 1); - int32_t spill_idx = fanin_pool.top; - PTO2FaninSpillEntry *entry = fanin_pool.alloc(); - if (entry == nullptr) { - orch->fatal = true; - return false; - } - if (fanin_builder.count == PTO2_FANIN_INLINE_CAP) { - fanin_builder.spill_start = spill_idx; - } - entry->slot_state = prod_state; - fanin_builder.count++; - return true; - }; - - for (int32_t edge_idx = meta.incoming_edge_head; edge_idx >= 0; - edge_idx = orch->manual_edges[edge_idx].next_consumer_edge) { - PTO2TaskSlotState *prod_state = orch->scope_tasks[begin + orch->manual_edges[edge_idx].producer_idx]; - if (!append_local_fanin_or_fail(prod_state)) { - return; - } - } - - int32_t fanin_count = fanin_builder.count; - int32_t inline_count = std::min(fanin_count, PTO2_FANIN_INLINE_CAP); - int32_t spill_count = fanin_count - inline_count; - payload->fanin_actual_count = fanin_count; - payload->fanin_spill_start = (spill_count > 0) ? fanin_builder.spill_start : 0; - payload->fanin_spill_pool = (spill_count > 0) ? fanin_builder.spill_pool : nullptr; - for (int32_t i = 0; i < inline_count; i++) { - payload->fanin_inline_slot_states[i] = fanin_builder.inline_slots[i]; + // add_dependency may allocate dep entries for an older consumer after + // newer tasks were already submitted. Recompute a monotonic dep-pool + // watermark at publish time so tail reclamation still advances safely. + if (slot_state->dep_pool_mark < dep_pool_mark_prefix) { + slot_state->dep_pool_mark = dep_pool_mark_prefix; + } else { + dep_pool_mark_prefix = slot_state->dep_pool_mark; } - - // External producers were already realized during manual submit. - // At manual scope_end, only same-scope explicit edges remain to wire. - for (int32_t edge_idx = meta.incoming_edge_head; edge_idx >= 0; - edge_idx = orch->manual_edges[edge_idx].next_consumer_edge) { - PTO2TaskSlotState &producer_slot_state = *orch->scope_tasks[begin + orch->manual_edges[edge_idx].producer_idx]; - producer_slot_state.fanout_count += 1; - producer_slot_state.fanout_head = dep_pool.prepend(producer_slot_state.fanout_head, slot_state); - if (producer_slot_state.fanout_head == nullptr) { - orch->fatal = true; - return; - } - } - slot_state->fanin_count += local_edge_count; - slot_state->dep_pool_mark = dep_pool.top; } orch->scheduler->publish_manual_scope_tasks_and_end_scope(&orch->scope_tasks[begin], count); } // Rewind the task buffer — these entries are no longer needed orch->scope_tasks_size = begin; - orch->manual_task_meta_size = manual_meta_begin; - orch->manual_edges_size = manual_edge_begin; orch->scope_stack_top--; orch->manual_scope_active = false; @@ -857,6 +684,30 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( PTO2FaninBuilder fanin_builder; fanin_builder.spill_pool = &orch->rings[ring_id].fanin_pool; + uint64_t manual_local_mask = 0; + bool needs_tensormap_sync = true; + if constexpr (kManualSubmit) { + needs_tensormap_sync = false; + for (int i = 0; i < args.tensor_count(); i++) { + TensorArgType ptype = args.tag(i); + if (ptype == TensorArgType::OUTPUT) { + continue; + } + + const Tensor *tensor = args.tensor(i).ptr; + if (task_owned_by_current_manual_scope(orch, tensor->owner_task_id)) { + manual_local_mask |= static_cast(1ULL << i); + continue; + } + + bool needs_lookup = (ptype == TensorArgType::INPUT || ptype == TensorArgType::INOUT) && !tensor->manual_dep; + bool needs_insert = + (ptype == TensorArgType::INOUT || ptype == TensorArgType::OUTPUT_EXISTING) && !tensor->manual_dep; + if (needs_lookup || needs_insert) { + needs_tensormap_sync = true; + } + } + } CYCLE_COUNT_LAP_RECORD(g_orch_alloc_cycle, AicpuPhaseId::ORCH_ALLOC, task_id.raw); @@ -870,7 +721,9 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( // === STEP 2: Sync TensorMap validity and optional cleanup === int32_t sm_last_task_alive = fc.last_task_alive.load(std::memory_order_acquire); - orch->tensor_map.sync_tensormap(ring_id, sm_last_task_alive); + if (!kManualSubmit || needs_tensormap_sync) { + orch->tensor_map.sync_tensormap(ring_id, sm_last_task_alive); + } if constexpr (!kManualSubmit) { if (sched) { @@ -881,7 +734,6 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( CYCLE_COUNT_LAP_RECORD(g_orch_sync_cycle, AicpuPhaseId::ORCH_SYNC, task_id.raw); // === STEP 3: Lookup inputs + materialize runtime-created outputs === - uint64_t manual_local_mask = 0; for (int i = 0; i < args.tensor_count(); i++) { TensorArgType ptype = args.tag(i); if (ptype == TensorArgType::OUTPUT) { @@ -890,8 +742,7 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( const Tensor *tensor = args.tensor(i).ptr; if constexpr (kManualSubmit) { - if (task_owned_by_current_manual_scope(orch, tensor->owner_task_id)) { - manual_local_mask |= static_cast(1ULL << i); + if ((manual_local_mask & static_cast(1ULL << i)) != 0) { continue; } } @@ -1238,23 +1089,6 @@ pto2_submit_mixed_task_manual(PTO2OrchestratorState *orch, const MixedKernels &m } result.task_id = task_id; result.outputs = outputs; - - PTO2ManualTaskMeta meta{}; - meta.slot_state = orch->scope_tasks[orch->scope_tasks_size - 1]; - meta.scope_task_index = orch->scope_tasks_size - 1 - current_manual_scope_begin(orch); - meta.incoming_edge_head = -1; - meta.incoming_edge_count = 0; - meta.tensor_count = static_cast(args.tensor_count()); - meta.manual_local_mask = 0; - for (int32_t i = 0; i < args.tensor_count(); i++) { - meta.tags[i] = static_cast(args.tag(i)); - if (task_owned_by_current_manual_scope(orch, meta.slot_state->payload->tensors[i].owner_task_id)) { - meta.manual_local_mask |= static_cast(1ULL << i); - } - } - if (!manual_task_meta_push(orch, meta)) { - return result; - } return result; } @@ -1284,27 +1118,86 @@ void pto2_add_dependency(PTO2OrchestratorState *orch, PTO2TaskId producer_id, PT return; } - int32_t meta_begin = orch->manual_task_meta_begins[orch->scope_stack_top]; - PTO2ManualTaskMeta &consumer_meta = orch->manual_task_meta[meta_begin + consumer_idx]; - for (int32_t edge_idx = consumer_meta.incoming_edge_head; edge_idx >= 0; - edge_idx = orch->manual_edges[edge_idx].next_consumer_edge) { - if (orch->manual_edges[edge_idx].producer_idx == producer_idx) { - return; + PTO2TaskSlotState *producer_slot_state = orch->scope_tasks[current_manual_scope_begin(orch) + producer_idx]; + PTO2TaskSlotState *consumer_slot_state = orch->scope_tasks[current_manual_scope_begin(orch) + consumer_idx]; + PTO2TaskPayload *consumer_payload = consumer_slot_state->payload; + + bool duplicate = false; + pto2_for_each_fanin_slot_state(*consumer_payload, [&](PTO2TaskSlotState *fanin_slot_state) { + if (fanin_slot_state == producer_slot_state) { + duplicate = true; + return false; } + return true; + }); + if (duplicate) { + return; } - int32_t edge_idx = manual_edge_push( - orch, - PTO2ManualEdge{ - .producer_idx = producer_idx, - .consumer_idx = consumer_idx, - .next_consumer_edge = consumer_meta.incoming_edge_head, + + if (consumer_payload->fanin_actual_count >= PTO2_MAX_INPUTS) { + LOG_ERROR("========================================"); + LOG_ERROR("FATAL: Dependency Overflow Detected!"); + LOG_ERROR("========================================"); + LOG_ERROR("Task requires more than PTO2_MAX_INPUTS unique fanin dependencies."); + LOG_ERROR(" consumer_id.raw: %" PRIu64, consumer_id.raw); + LOG_ERROR(" fanin_count: %d / %d", consumer_payload->fanin_actual_count + 1, PTO2_MAX_INPUTS); + LOG_ERROR(" reason: explicit add_dependency"); + LOG_ERROR("========================================"); + orch->sm_handle->header->orch_error_code.store(PTO2_ERROR_DEPENDENCY_OVERFLOW, std::memory_order_release); + orch->fatal = true; + return; + } + + auto &dep_pool = orch->rings[producer_slot_state->ring_id].dep_pool; + auto &fc = orch->sm_handle->header->rings[producer_slot_state->ring_id].fc; + dep_pool.ensure_space(*orch->scheduler, fc, producer_slot_state->ring_id, 1); + + PTO2FaninBuilder fanin_builder; + fanin_builder.count = consumer_payload->fanin_actual_count; + fanin_builder.spill_start = consumer_payload->fanin_spill_start; + fanin_builder.spill_pool = + (consumer_payload->fanin_spill_pool != nullptr) ? consumer_payload->fanin_spill_pool + : &orch->rings[consumer_slot_state->ring_id].fanin_pool; + int32_t cached_inline_count = std::min(fanin_builder.count, PTO2_FANIN_INLINE_CAP); + for (int32_t i = 0; i < cached_inline_count; i++) { + fanin_builder.inline_slots[i] = consumer_payload->fanin_inline_slot_states[i]; + } + + if (fanin_builder.count < PTO2_FANIN_INLINE_CAP) { + fanin_builder.inline_slots[fanin_builder.count++] = producer_slot_state; + } else { + PTO2FaninPool &fanin_pool = *fanin_builder.spill_pool; + fanin_pool.ensure_space(*orch->scheduler, fc, consumer_slot_state->ring_id, 1); + int32_t spill_idx = fanin_pool.top; + PTO2FaninSpillEntry *entry = fanin_pool.alloc(); + if (entry == nullptr) { + orch->fatal = true; + return; + } + if (fanin_builder.count == PTO2_FANIN_INLINE_CAP) { + fanin_builder.spill_start = spill_idx; } - ); - if (edge_idx < 0) { + entry->slot_state = producer_slot_state; + fanin_builder.count++; + } + + producer_slot_state->fanout_count += 1; + producer_slot_state->fanout_head = dep_pool.prepend(producer_slot_state->fanout_head, consumer_slot_state); + if (producer_slot_state->fanout_head == nullptr) { + orch->fatal = true; return; } - consumer_meta.incoming_edge_head = edge_idx; - consumer_meta.incoming_edge_count++; + + int32_t inline_count = std::min(fanin_builder.count, PTO2_FANIN_INLINE_CAP); + int32_t spill_count = fanin_builder.count - inline_count; + consumer_payload->fanin_actual_count = fanin_builder.count; + consumer_payload->fanin_spill_start = (spill_count > 0) ? fanin_builder.spill_start : 0; + consumer_payload->fanin_spill_pool = (spill_count > 0) ? fanin_builder.spill_pool : nullptr; + for (int32_t i = 0; i < inline_count; i++) { + consumer_payload->fanin_inline_slot_states[i] = fanin_builder.inline_slots[i]; + } + consumer_slot_state->fanin_count += 1; + consumer_slot_state->dep_pool_mark = dep_pool.top; } // ============================================================================= diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h index 023b60de8..9f70d8ed3 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h @@ -40,22 +40,6 @@ // Orchestrator State // ============================================================================= -struct PTO2ManualTaskMeta { - PTO2TaskSlotState *slot_state; - int32_t scope_task_index; - int32_t incoming_edge_head; - uint16_t incoming_edge_count; - uint8_t tensor_count; - uint64_t manual_local_mask; - uint8_t tags[MAX_TENSOR_ARGS]; -}; - -struct PTO2ManualEdge { - int32_t producer_idx; - int32_t consumer_idx; - int32_t next_consumer_edge; -}; - /** * Orchestrator state structure (private to Orchestrator) * @@ -105,15 +89,8 @@ struct PTO2OrchestratorState { // Cross-thread notification uses shared memory orch_error_code (atomic) bool fatal; - // === MANUAL-SCOPE METADATA === - int32_t *manual_task_meta_begins; // start index in manual_task_meta for each scope - int32_t *manual_edge_begins; // start index in manual_edges for each scope - PTO2ManualTaskMeta *manual_task_meta; - int32_t manual_task_meta_size; - int32_t manual_task_meta_capacity; - PTO2ManualEdge *manual_edges; - int32_t manual_edges_size; - int32_t manual_edges_capacity; + // === MANUAL-SCOPE STATE === + int32_t manual_dep_pool_reserve[PTO2_MAX_RING_DEPTH]; // Hidden alloc tasks complete synchronously inside the orchestrator and // therefore bypass the executor's normal worker-completion counter path. From 9a9974d583e5f690b2a6f02be43533b0d70f43f7 Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Thu, 9 Apr 2026 00:18:57 +0800 Subject: [PATCH 30/35] Update: refresh manual dependency design note - rewrite the design note to match the current manual submit and publish-only\n scope_end implementation\n- record the fresh four-way paged-attention comparison and benchmark\n entrypoints, including the detached worktree flow for the old runtime\n- remove the dead manual_dep_pool_reserve state from the orchestrator --- docs/manual-dep-for-tensormap-design.md | 1496 +++-------------- .../runtime/pto_orchestrator.cpp | 1 - .../runtime/pto_orchestrator.h | 4 - 3 files changed, 238 insertions(+), 1263 deletions(-) diff --git a/docs/manual-dep-for-tensormap-design.md b/docs/manual-dep-for-tensormap-design.md index e489d6e08..5c5c3f619 100644 --- a/docs/manual-dep-for-tensormap-design.md +++ b/docs/manual-dep-for-tensormap-design.md @@ -2,1375 +2,355 @@ ## Goal -Bring the human-created dependency workflow from `aicpu_build_graph` into `tensormap_and_ringbuffer` in a scoped way: +Add a scoped manual-dependency mode to `tensormap_and_ringbuffer` without +regressing the default automatic path: -- `PTO2_SCOPE(PTO2ScopeMode::MANUAL) { ... }` -- Tensors crossing scope boundaries use TensorMap semantics -- Tensors used entirely inside the manual scope use explicit `add_dependency` +- `PTO2_SCOPE()` keeps the existing automatic mode +- `PTO2_SCOPE(PTO2ScopeMode::MANUAL)` enables scoped manual dependency wiring +- same-manual-scope edges use explicit `pto2_rt_add_dependency(...)` +- cross-scope edges still use `owner_task_id` and TensorMap discovery -This is not a port of `aicpu_build_graph`'s fully-explicit runtime model. The target is a hybrid model inside `tensormap_and_ringbuffer`: +This is a hybrid model, not a port of `aicpu_build_graph`. -- same-scope dependency tracking: explicit -- cross-scope dependency tracking: TensorMap -- scope-local lifetime management: unchanged ring/scope ownership model +## API Surface -## Code-Checked Baseline - -This draft is reviewed against the current implementations in: - -- `src/a2a3/runtime/aicpu_build_graph/runtime/pto_orchestrator.{h,cpp}` -- `src/a2a3/runtime/aicpu_build_graph/orchestration/pto_orchestration_api.h` -- `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.{h,cpp}` -- `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.{h,cpp}` -- `src/a2a3/runtime/tensormap_and_ringbuffer/orchestration/pto_orchestration_api.h` -- `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/tensor.h` - -The important current-code facts are: - -- `aicpu_build_graph` already has explicit `add_dependency`, returns `SubmitResult { task_id, outputs }`, and batch-publishes tasks at `scope_end`. -- `tensormap_and_ringbuffer` currently derives dependencies during submit, returns only `TaskOutputTensors`, and uses `scope_end` only for lifetime release. -- `tensormap_and_ringbuffer` orchestration is TLS-based today: `PTO2_SCOPE()` and `pto2_rt_submit_*()` do not take an explicit `PTO2Runtime*`. -- In `tensormap_and_ringbuffer`, `Tensor::manual_dep` is creator-retention-only mode: it skips OverlapMap lookup/insert, but `owner_task_id` retention still applies. - -## Confirmed Decisions - -These decisions are already aligned with the requested direction: - -1. `tensormap` scope may contain a manual scope. -2. Manual scope may not contain another manual scope. -3. The design must not simplify away multi-write cases. -4. For an outer-scope tensor written inside a manual scope, readiness is the writer task completion time, not `scope_end`. -5. Therefore, a task inside a manual scope that writes an outer-scope tensor must still publish that tensor frontier through TensorMap before later submissions depend on it. -6. For an outer-scope tensor read inside a manual scope, the dependency must still be forced by TensorMap/owner-based boundary seeding during manual submit. -7. Tasks created inside a manual scope should be batch-published to the scheduler at `scope_end`, matching `aicpu_build_graph` semantics for explicit dependency closure inside the scope. - -## Change Control Requirements - -The implementation PR must follow these rules: - -- Keep the change strictly scoped to manual dependency support in `tensormap_and_ringbuffer`. -- Do not refactor unrelated runtime behavior while doing this work. -- Do not change existing auto-scope TensorMap semantics. -- Do not change scope lifetime semantics. -- Prefer the smallest invasive write set that cleanly supports the feature. -- Preserve existing examples/tests unless a targeted update is required to cover the new feature. -- Any behavior change outside manual-scope execution must be treated as a regression. - -## Repository Rule Requirements - -The implementation must carefully follow the repository's coding rules and conventions: - -- obey `CLAUDE.md` directory ownership and workflow rules -- obey `.claude/rules/architecture.md` -- obey `.claude/rules/codestyle.md` -- keep platform-isolation preprocessor ordering consistent with repo rules -- avoid comment styles that encode plan phases or temporary implementation notes -- preserve current behavior unless this spec explicitly requires otherwise -- avoid adding new tensor metadata unless it is strictly necessary for correctness -- prefer provenance on task-side state over changing hot-path `Tensor` layout - -## Tooling Requirements - -The implementation and follow-up PRs must also respect the current repository tooling state: - -- PR #424 has already aligned C and C++ sources with `clang-format`. -- Local development should use `clang-format` `v21.1.0`, matching `.pre-commit-config.yaml`. -- Developers should configure local save-time auto-formatting with that exact `clang-format` version to avoid unnecessary AI-driven formatting churn. -- The feature PR should not include unrelated bulk reformatting. -- `.clang-tidy` is now part of the repository toolchain, but many checks are still intentionally disabled in the config file. -- This feature PR must satisfy the currently active `clang-tidy` expectations for touched code. -- Gradually enabling additional `clang-tidy` checks and fixing old violations is a separate ongoing stream of work, not something this feature should broaden into unless directly required for touched code. - -## Non-Goals - -- Do not replace `tensormap_and_ringbuffer` with a fully explicit runtime. -- Do not require explicit export/import APIs at scope boundaries. -- Do not constrain v1 to single-writer exported tensors. -- Do not change the existing rule that inner-scope temporary tensors do not outlive their owning scope unless already represented by an outer-scope tensor. - -## Current Runtime Behavior Relevant To This Design - -## Scope lifetime - -In `tensormap_and_ringbuffer`, each submitted task starts with one scope-held fanout reference. On `scope_end`, the scheduler releases that reference. When fanout is otherwise exhausted, the task can become `CONSUMED` and its slot/buffer can be reclaimed. - -This means: - -- outer-scope tensors may flow into inner scopes -- inner-scope temporaries are scope-local by default -- `scope_end` affects lifetime ownership, not semantic readiness of a cross-scope tensor write - -## Current dependency model - -Today the runtime derives dependencies in `pto_orchestrator.cpp` using: - -- creator retention through `owner_task_id` -- modifier lookup through TensorMap overlap search -- TensorMap insert for `INOUT` and `OUTPUT_EXISTING` - -There is already a `Tensor::manual_dep` bit, but in current code it is only a per-tensor creator-retention mode: it skips TensorMap overlap lookup/insert while still keeping `owner_task_id` retention. That is not sufficient for scoped hybrid semantics because the scope, not the tensor alone, decides whether a use is same-scope or cross-scope. - -## Discovery vs Execution Separation - -This distinction is central to the frozen design. - -TensorMap is not the execution-time dependency engine. It is only a producer-discovery mechanism. - -The scheduler's fanin/fanout graph is the execution-time dependency engine. - -In current `tensormap_and_ringbuffer`, submit does two different things: - -1. Discover producers from tensors. -- creator retention from `owner_task_id` in `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp` -- overlap lookup from TensorMap in `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp` - -2. Convert discovered producers into scheduler edges. -- accumulate unique producer slot states in local `fanin_states[]` -- wire producer `fanout_head`, consumer `fanin_count`, and consumer `fanin_refcount` - -After that conversion, execution no longer cares how the dependency was found. - -The scheduler only sees: - -- producer fanout list -- consumer fanin counters - -This is why manual dependency integration should work as follows: - -- do not put manual dependencies into TensorMap -- do not bind manual dependencies to tensors -- at manual `scope_end`, realize manual dependencies directly as normal producer-consumer scheduler edges - -So for a task inside manual scope: - -1. Start a local dedup buffer such as `fanin_states[]`. -2. During submit, add producers from outer-tensor creator retention and TensorMap lookup. -3. Cache that deduped external producer set in the task payload. -4. At `scope_end`, add producers from recorded manual edges. -5. Dedup both sources together before wiring the scheduler edges. -6. Run the normal wiring path into: - - `payload->fanin_slot_states[]` - - `fanin_count` - - producer `fanout_head` - -Then after publish: - -- manual deps and TensorMap-derived deps are indistinguishable -- both are handled by the existing scheduler readiness and completion fanout path - -Concise conclusion: - -- TensorMap discovers boundary tensor-related dependencies during manual submit -- manual deps bypass discovery and are replayed only at manual `scope_end` -- both become the same scheduler edges before publish -- execution uses only the scheduler edge machinery, not TensorMap - -## Implemented Manual-Scope Algorithm - -The current implementation is a submit/scope-end split: - -- manual submit still allocates task ids, task slots, and payloads immediately -- manual submit still does boundary producer discovery for cross-scope tensors -- manual submit still updates TensorMap frontier for cross-scope writes -- manual submit does not publish tasks to the scheduler ready graph -- manual `scope_end` replays only explicit same-scope edges plus cached external fanins, then batch-publishes tasks - -### How tensor arguments are handled - -The runtime decision is per tensor argument, not per scope: - -| Tensor use in a manual-scope task | How dependency is found | Uses TensorMap? | What must be maintained | -| --- | --- | --- | --- | -| tensor created in the current manual scope, then reused in the current manual scope | explicit `add_dependency` only | no | recorded manual edge in scope-local edge buffer | -| outer/external `INPUT` | creator retention plus overlap lookup | yes, at manual submit unless `manual_dep=true` | cached external producer set in task payload | -| outer/external `INOUT` | creator retention plus overlap lookup for incoming state | yes, at manual submit unless `manual_dep=true` | cached external producer set plus updated writer frontier | -| outer/external `OUTPUT_EXISTING` | creator retention only for incoming owner, no overlap lookup | yes for outgoing frontier update unless `manual_dep=true` | updated writer frontier | -| runtime-created `OUTPUT` inside manual scope | no incoming dependency | no immediate lookup | `owner_task_id` on produced tensor so later users can classify it as manual-local | - -`TensorArgType` matters here: - -- `INPUT`: read-only; needs incoming producer discovery, no outgoing frontier update -- `INOUT`: read old value and write new value; needs both incoming producer discovery and outgoing frontier update -- `OUTPUT_EXISTING`: overwrite an existing outer buffer; does not need overlap lookup for an old modifier chain, but still needs outgoing frontier update -- `OUTPUT`: fresh runtime allocation; has no incoming dependency and becomes manual-local to later tasks in the same manual scope - -### What manual submit iterates - -For each submitted task in a manual scope, the runtime iterates tensor args in submit order: - -1. Allocate task id, slot state, and payload immediately. -2. For each tensor arg, classify it as manual-local or outer/external from `owner_task_id` plus current manual-scope ownership. -3. For manual-local tensors: - - skip creator-retention wiring - - skip TensorMap lookup/insert - - rely on explicit `add_dependency` -4. For outer/external tensors: - - keep creator-retention from `owner_task_id` - - run TensorMap overlap lookup for `INPUT` and `INOUT` unless `manual_dep=true` - - cache the deduped external producer set in the task payload - - update TensorMap frontier for outer writes in original submit order unless `manual_dep=true` -5. Leave the task unpublished behind one deferred publish barrier. - -### What manual scope_end iterates - -At manual `scope_end`, the runtime iterates tasks in the current manual scope in original submit order: - -1. Read the cached external producer set from each task payload. -2. Replay explicit same-scope edges recorded by `add_dependency`. -3. Merge and dedup: - - cached external producers - - explicit manual producers -4. Realize the final producer set into the normal scheduler fanin/fanout structures. -5. Release the deferred publish barrier and batch-publish the manual-scope tasks. -6. Release the usual scope-held lifetime reference. - -This is why manual `scope_end` is still expensive on non-unroll paged attention: - -- it walks every manual-scope task -- it merges cached external fanins with explicit same-scope edges -- it mutates scheduler fanin/fanout state in one serial finalize step - -### What state manual scope maintains - -The runtime keeps a small amount of scope-local metadata instead of a second execution engine: - -- `scope_tasks[]`: task order for the current scope -- `manual_task_meta[]`: per-task metadata for manual finalize -- `manual_edges[]`: explicit same-scope producer-consumer edges -- `payload->fanin_slot_states[]`: cached external producers discovered at manual submit -- `fanin_count` plus one deferred publish barrier - -This split was chosen because it preserves the normal scheduler after publish: - -- no second execution-time dependency engine -- no change to the ready queue model -- no change to worker dispatch or completion handling -- only boundary discovery stays on the TensorMap path -- only same-scope replay is deferred to manual `scope_end` - -### Why these decisions were made - -Each part of the split exists to avoid a specific incorrect or too-expensive alternative: - -- keep cross-scope producer discovery on TensorMap - - otherwise outer reads and writes would lose the current producer frontier and later submissions could see stale state -- keep same-scope manual-local edges explicit - - otherwise manual mode would still pay repeated TensorMap lookup/insert for the tensors it is trying to optimize -- defer scheduler publication to manual `scope_end` - - otherwise tasks with partially wired explicit edges could become runnable too early -- keep only one post-publish scheduler mechanism - - otherwise the runtime would need a second dependency engine and a second completion path, which is high-risk and unnecessary - -## Problem Statement - -If we simply copy `aicpu_build_graph` semantics into `tensormap_and_ringbuffer`, we get a wrong boundary model: - -- suppressing TensorMap for all tensors inside `PTO2_SCOPE(PTO2ScopeMode::MANUAL)` is incorrect -- delaying publication of an outer tensor until `scope_end` is incorrect - -The reason is that cross-scope tensors must become visible at the actual writer frontier. Outside consumers should depend on the task that really produced the latest visible state, not on scope closure. - -So the correct split is: - -- same-scope tensor relations inside the manual scope: explicit edges only -- cross-scope tensor relations: preserve TensorMap behavior - -## Required Semantics - -## Core rule - -`PTO2_SCOPE(PTO2ScopeMode::MANUAL)` means: - -- if a tensor was created inside this manual scope and is reused inside this manual scope, the dependency must be established by explicit `add_dependency` -- all outer-scope tensors still use existing TensorMap/owner metadata -- tasks submitted inside the manual scope remain invisible to the scheduler until `scope_end` - -This rule applies per tensor use site, not as a global on/off switch for the whole submit. - -## Two Different Publication Semantics - -The design must distinguish two different kinds of publication: - -1. Scheduler publication -2. TensorMap boundary publication - -### Scheduler publication - -For tasks inside `PTO2_SCOPE(PTO2ScopeMode::MANUAL)`: - -- submit builds the internal task records and records explicit dependencies -- those tasks are not yet published as executable scheduler work -- `scope_end` batch-publishes them to the scheduler - -This is required so all same-scope explicit edges are fully wired before any task in the manual scope can start execution. - -### TensorMap boundary publication - -For cross-scope tensors touched by tasks inside `PTO2_SCOPE(PTO2ScopeMode::MANUAL)`: - -- outside tasks submitted after a writer task is submitted must still be able to discover that writer frontier through TensorMap -- therefore the producer frontier for an external tensor written inside the manual scope must be updated during manual submit -- however the tensor is still not semantically ready until that producer task actually completes - -So: - -- scheduler visibility of the task is controlled by manual `scope_end` -- dependency readiness for later consumers is still enforced by waiting on producer task completion - -The document must not conflate these two mechanisms. - -More precisely: - -- before manual `scope_end`, the task record already exists and TensorMap boundary discovery/publication has already happened -- before manual `scope_end`, the task is still invisible to the executable scheduler graph -- after manual `scope_end`, the task becomes part of the executable published graph -- once published, the task may enter `READY` immediately or remain `PENDING` depending on whether its dependencies are already satisfied - -## Discussion Guardrails - -The following clarifications are recorded to reduce implementation drift and hallucination risk: - -1. Deferred publish does not mean deferred task allocation. -- Manual tasks still allocate task ids, slot state, and payload at submit time. -- What is deferred is explicit same-scope edge realization and ready-queue publication. - -2. Manual dependencies are not tracked by TensorMap. -- TensorMap is only used for tensor-related producer discovery. -- Manual dependencies are explicit producer-consumer edges recorded by orchestration. -- At manual `scope_end`, both kinds of dependencies are converted into the same scheduler fanin/fanout structures. - -3. After manual `scope_end`, there is no special execution-time manual mechanism. -- The runtime should not keep a second dependency engine alive after publish. -- Once the scope is finalized, all dependencies are handled only by the existing scheduler fanin/fanout path. - -4. Submit-time boundary wiring does not change tensor readiness semantics. -- Outer writes become TensorMap-visible at manual submit. -- Their semantic readiness is still producer-task completion. - -## Tensor categories - -For a task submitted inside a manual scope, every tensor argument falls into one of these categories: - -1. Outer-scope tensor, read only -2. Outer-scope tensor, written in place -3. Tensor created inside this manual scope, used again inside this manual scope -4. Outer-scope tensor accessed through a derived view/reshape/transpose inside the manual scope -5. External tensor with no owner task - -The runtime only needs one special classification for v1: - -- tensor created in the current manual scope - -Everything else stays on the existing TensorMap path. - -## Expected behavior by category - -### 1. Outer-scope tensor, read only - -- The first internal consumer must still get its dependency from TensorMap/owner-based boundary seeding. -- This is not optional and must not be delegated to explicit manual edges inside the scope. -- Manual scope does not remove the need to wait for the outer producer frontier. -- In other words, outer-read boundary correctness is still forced by TensorMap-side logic. - -### 2. Outer-scope tensor, written in place - -- The internal writer must still publish its producer frontier for TensorMap boundary tracking. -- That boundary frontier must become visible at manual submit, in original submit order, so later submissions can attach to the correct writer task immediately. -- Readiness of the written tensor is the completion of that writer task. -- Multiple writes inside the same manual scope are allowed. -- TensorMap should continue tracking the latest producer frontier exactly as in auto scope while the scope is still unpublished to the scheduler. - -### 3. Tensor created inside this manual scope and reused only inside this manual scope - -- No TensorMap lookup/insert. -- No automatic same-scope dependency derivation. -- Orchestration must call `add_dependency` explicitly for correctness. - -### 4. Outer-scope tensor accessed through a derived view/reshape/transpose inside the manual scope - -This is the real aliasing case that matters for the design. It must be handled by ownership classification, not by raw pointer equality. - -An outer-scope tensor may be sliced or reshaped inside the manual scope, but it is still outer-scope. - -If orchestration is reading or mutating an outer-scope tensor through a derived view that inherits the outer owner/scope identity, that is still cross-scope and should keep TensorMap behavior. - -A tensor created inside the manual scope should not later become an outer-scope alias. That would violate the existing scope lifetime model rather than define a supported boundary case. - -### 5. External tensor with no owner task - -- There is no creator dependency. -- Reads need no dependency unless TensorMap contains a producer entry. -- Writes to such a tensor should still publish to TensorMap if the tensor is cross-scope visible. - -## Recommended API Shape - -## Orchestration API - -Keep the existing `tensormap_and_ringbuffer` orchestration style: TLS-based helpers with no explicit runtime argument in user orchestration code. Do not make the public surface look like `aicpu_build_graph`'s `PTO2_SCOPE(rt)` family just to add manual mode. - -Add explicit edge wiring to `tensormap_and_ringbuffer` orchestration API: - -```cpp -void pto2_rt_add_dependency(PTO2TaskId producer, PTO2TaskId consumer); -``` - -Use an explicit scope mode enum for the scoped API: +The orchestration API now uses an enum instead of the old boolean-style scope +switch: ```cpp -enum class PTO2ScopeMode : uint8_t { - AUTO = 0, - MANUAL = 1, -}; - -PTO2_SCOPE(PTO2ScopeMode::MANUAL) { - ... +PTO2_SCOPE() { + // default: PTO2ScopeMode::AUTO } -``` - -`PTO2_SCOPE()` remains the auto-scope form by default. `PTO2_SCOPE(PTO2ScopeMode::MANUAL)` enters manual mode explicitly. - -Do not change `TaskOutputTensors`. - -Add new manual-submit APIs with `_manual` suffix so orchestration code can get task ids without changing existing normal submit call sites. This mirrors the role of `aicpu_build_graph`'s `SubmitResult`, but keeps the existing `tensormap_and_ringbuffer` submit APIs intact: - -```cpp -struct PTO2ManualSubmitResult { - PTO2TaskId task_id; - TaskOutputTensors outputs; -}; - -PTO2ManualSubmitResult pto2_rt_submit_task_manual(const MixedKernels& mixed_kernels, const Arg& args); -PTO2ManualSubmitResult pto2_rt_submit_aic_task_manual(int32_t kernel_id, const Arg& args); -PTO2ManualSubmitResult pto2_rt_submit_aiv_task_manual(int32_t kernel_id, const Arg& args); -``` - -These APIs are intended for use inside `PTO2_SCOPE(PTO2ScopeMode::MANUAL)` where explicit dependency wiring is required. - -This design intentionally splits task APIs, not tensor storage APIs: - -- auto scope uses existing `pto2_rt_submit_*` task APIs -- manual scope uses `pto2_rt_submit_*_manual` task APIs -- both modes continue using the same `Tensor`, `Arg`, and `TensorArgType` model - -So manual mode changes how tasks are recorded and finalized, not how tensors are represented. - -## Rejected API Alternatives - -The following alternatives were considered and rejected for v1: - -1. Add a new user-facing “external tensor” API for manual scope. -- Rejected because manual mode only needs to identify manual-local tensors. -- Everything else can be treated as outer/external and handled by the existing TensorMap path. -- Adding a second tensor annotation API would increase surface area without adding necessary information. - -2. Change `TaskOutputTensors` to carry task ids. -- Rejected to avoid broad churn in existing orchestration code. -- Manual mode gets separate `_manual` submit APIs instead. - -3. Create a second tensor representation for manual mode. -- Rejected because payload already stores the copied tensor/scalar data needed for deferred finalize. -- The task API split is enough; tensor storage stays unified. - -## Runtime API - -Add runtime ops support: - -```cpp -void (*add_dependency)(PTO2Runtime* rt, PTO2TaskId producer, PTO2TaskId consumer); -void (*scope_begin)(PTO2Runtime* rt, PTO2ScopeMode mode); -``` - -The orchestration-facing helper can stay TLS-style and hide the runtime pointer, for example by plumbing the mode through the existing `pto2_rt_scope_begin()` / `PTO2ScopeGuard` path. - -Add manual-scope entry/exit plumbing by extending the existing runtime entry point with a mode flag: - -```cpp -void pto2_rt_scope_begin(PTO2Runtime* rt, PTO2ScopeMode mode = PTO2ScopeMode::AUTO); -``` - -Recommendation: extend scope state with a mode flag and keep one scope stack. Avoid separate manual/non-manual stacks. - -## Internal Design - -## Scope state - -Each scope frame needs: -- `begin_index` into `scope_tasks` -- scope mode: normal or manual -- `begin_index` into a manual-edge buffer when the scope is manual -- `begin_index` into a manual-task-meta buffer when the scope is manual - -Manual scope needs a local edge registry because `add_dependency` should record edges during orchestration but should not mutate scheduler fanin/fanout state until manual `scope_end`. - -Manual scope also needs a compact per-task metadata stream so `scope_end` can replay the deferred dependency logic without copying full `Arg` objects. - -## Tensor metadata - -Current `Tensor` already stores: - -- `owner_task_id` -- `manual_dep` - -Recommendation: do not add new tensor metadata in v1. - -The narrowed v1 rule only needs to identify tensors created in the current manual scope. That can be derived from: - -- `tensor.owner_task_id` -- the set of task ids created in the current manual scope - -So the preferred approach is: - -- keep `Tensor` layout unchanged -- keep `owner_task_id` as the provenance pointer -- track the current manual scope's owned task membership in scope-local orchestrator state - -`manual_dep` should not become the primary mechanism for scoped semantics. It may remain as a per-tensor escape hatch for existing behavior, but the manual-scope design should be driven by: - -- current scope mode -- tensor owner task id -- whether that owner belongs to the current manual scope - -## Shared Tensor Path, Split Task APIs - -The design should keep one shared tensor recording path across normal and manual scope: - -- `Arg` remains the user-facing container for tensor refs, tensor create-info, scalars, and `TensorArgType` -- `PTO2TaskPayload` remains the destination for copied tensors and scalars -- runtime-created outputs still receive `owner_task_id` during submit - -What changes in manual scope is only the task API and the time when dependency logic runs: - -- normal submit APIs perform dependency lookup and TensorMap insert immediately -- manual submit APIs only allocate the task, copy payload data, and record compact finalize metadata -- manual `scope_end` replays only explicit same-scope edges and combines them with the external producer set already cached at submit - -This keeps normal-mode APIs unchanged while avoiding a second tensor representation for manual mode. - -## Classification rule - -In manual scope, the runtime only needs one special classification: - -```cpp -manual_local_tensor = - tensor.owner_task_id.is_valid() && - current_manual_scope_owns(tensor.owner_task_id); -``` - -Then: - -```cpp -if (!in_manual_scope) { - use existing tensormap behavior; -} else if (manual_local_tensor) { - use explicit add_dependency only; -} else { - use existing TensorMap/owner behavior; +PTO2_SCOPE(PTO2ScopeMode::MANUAL) { + auto qk = pto2_rt_submit_aic_task_manual(...); + auto sf = pto2_rt_submit_aiv_task_manual(...); + pto2_rt_add_dependency(qk.task_id, sf.task_id); } ``` -Important nuance: - -- tensors created in the current manual scope use explicit same-scope dependencies -- all outer tensors stay on the existing TensorMap path, even if two tasks inside the manual scope both access them -- this means outer tensors may still create implicit same-scope edges through TensorMap inside a manual scope -- this is an accepted v1 decision and should be documented in the PR description as a deliberate tradeoff - -This is why a separate user-facing “external tensor” API is not required for v1: - -- manual mode only needs to identify manual-local tensors -- everything else is treated as outer/external and goes through the existing TensorMap path -- that decision can be derived from `owner_task_id` plus the current manual scope's owned-task membership check - -## Scheduler-Safe Hybrid Design +Current modes: -The scheduler changes should be localized and should not disturb existing auto-scope behavior. +- `PTO2ScopeMode::AUTO` +- `PTO2ScopeMode::MANUAL` -### Design principle +Current restrictions: -Keep two execution paths: +- manual scope cannot be nested inside another manual scope +- manual submit APIs are only valid inside `PTO2_SCOPE(PTO2ScopeMode::MANUAL)` +- `pto2_rt_add_dependency(...)` requires both tasks to belong to the current + manual scope -- auto scope path: existing `tensormap_and_ringbuffer` behavior -- manual scope path: deferred dependency realization and deferred scheduler publication +## Current Design -The auto path should remain unchanged as much as possible. +### High-level rule -### What a manual-scope task must count as dependencies +Manual mode only changes same-scope dependency discovery. -For a task inside `PTO2_SCOPE(PTO2ScopeMode::MANUAL)`, total fanin is: +- in-scope tensors: use explicit task edges +- cross-scope tensors: keep TensorMap semantics +- scope-local lifetime model: unchanged +- scheduler execution model after publish: unchanged -- explicit manual dependencies added by `add_dependency` -- external dependencies derived from TensorMap/owner logic for outer-scope reads -- one extra publish barrier released only at manual `scope_end` +### Why this split exists -In other words: +We need two properties at the same time: -```cpp -fanin_count = - manual_dep_edges + - external_tensor_deps + - 1; // publish barrier -``` +1. Manual scopes must be able to skip TensorMap work for same-scope chains. +2. Outer producers and outer consumers must still see the correct frontier. -This is the key mechanism that lets the scheduler ignore manual-local TensorMap lookup while still respecting out-of-scope dependencies. +If we disabled TensorMap for everything inside a manual scope, cross-scope +reads and writes would become incorrect. If we kept TensorMap for everything, +manual mode would not remove the overhead we care about. -### What submit should do in manual scope +The chosen split is: -For a task submitted inside manual scope: +- same-scope manual-local traffic: explicit edges only +- boundary traffic: existing creator-retention + TensorMap lookup/insert -1. Allocate slot and payload exactly as today. -2. Initialize `task_state = PENDING`. -3. Initialize `fanin_count = 1` and `fanin_refcount = 0` for deferred publication. -4. Return a stable `task_id` immediately so orchestration can call `add_dependency`. -5. Do not realize explicit manual edges into scheduler fanin/fanout yet. -6. Realize outer-boundary producer discovery immediately: - - creator retention from `owner_task_id` - - TensorMap lookup for outer `INPUT` / `INOUT` - - covered-entry removal for outer `INOUT` -7. Publish outer writes to TensorMap immediately for outer `INOUT` / `OUTPUT_EXISTING`. -8. Do not push the task into ready queues during submit. -9. Retain every cached external producer strongly enough that it cannot be reclaimed or slot-reused before manual `scope_end`. -10. Cache the deduped external producer set in the task payload so manual `scope_end` can realize the scheduler edges without touching TensorMap. -11. Preserve enough scope-local information so manual `scope_end` can realize explicit same-scope edges before publish. +## Dependency Semantics -Submit-time task records are still required even though execution is deferred: +### Tensor origin matters first -- manual submit APIs must return stable task ids immediately -- runtime-created outputs need `owner_task_id` immediately so later scope-local tensors and their derived views can be recognized -- the scheduler only sees these tasks after manual `scope_end` +Each tensor argument is classified at submit time: -Manual mode should also record a compact per-task finalize descriptor rather than a second full copy of `Arg`. +- `manual-local`: the tensor owner was created inside the current manual scope +- `boundary`: anything else, including external tensors and tensors produced by + tasks outside the current manual scope -Recommended shape: +Manual-local tensors skip TensorMap entirely. Boundary tensors stay on the +normal TensorMap path unless `manual_dep=true`. -```cpp -struct PTO2ManualTaskMeta { - uint64_t packed_tags; // compact encoding of TensorArgType for this task - uint16_t tensor_count; - uint16_t edge_begin; // range in manual_edges[] - uint16_t edge_count; - uint16_t _pad; -}; - -struct PTO2ManualEdge { - uint16_t producer_idx; // index in current manual scope's task slice - uint16_t consumer_idx; -}; -``` +### What `INPUT`, `OUTPUT`, `INOUT`, and friends mean -Why this is low-overhead: +`TensorArgType` in the runtime: -- tensor values are already copied into `PTO2TaskPayload` -- scalars are already copied into `PTO2TaskPayload` -- tags are much smaller than copying `Arg` again -- the edge list is dense, append-only, and scope-local +- `INPUT`: read-only existing tensor +- `OUTPUT`: fresh runtime-allocated tensor +- `INOUT`: read existing state and publish a new state +- `OUTPUT_EXISTING`: write-only existing tensor +- `NO_DEP`: existing tensor with no TensorMap dependency work and no publish -The design should prefer a packed tag stream plus a dense edge stream over storing duplicated tensor refs or explicit user-marked external tensors. +### Behavior matrix -That gives a manual pre-publish state: +| Arg kind | Manual-local tensor | Boundary tensor | +| --- | --- | --- | +| `INPUT` | no creator retention, no TensorMap lookup, requires explicit manual edge | creator retention; TensorMap lookup unless `manual_dep=true` | +| `OUTPUT` | fresh local tensor; later same-scope uses rely on explicit manual edges | not applicable | +| `INOUT` | no TensorMap lookup/insert, requires explicit manual edge | creator retention; TensorMap lookup for incoming state; TensorMap insert for outgoing state unless `manual_dep=true` | +| `OUTPUT_EXISTING` | no TensorMap insert, requires explicit manual edge if later reused in scope | creator retention; TensorMap insert for outgoing state unless `manual_dep=true` | +| `NO_DEP` | creator-only object passing, no publish | same | -- task records and task ids already exist -- explicit edges are only recorded, not yet wired into scheduler fanin/fanout -- external TensorMap-derived producers are already discovered and cached -- cached external producers are retained so deferred publish cannot attach to a reused slot -- outer writes are already reflected in TensorMap frontier state -- the task is still unpublished as executable scheduler work because the publish barrier is not yet released +### `manual_dep=true` still matters -### What scope_end should do in manual scope +`Tensor::manual_dep` keeps its existing meaning: -Manual `scope_end` needs one additional finalize-and-publish step before the existing lifetime-release step completes. +- skip TensorMap lookup/insert +- keep creator-only retention via `owner_task_id` -Recommended sequence: +This is orthogonal to manual scope mode. It is a per-tensor override, not a +replacement for scoped manual dependency wiring. -1. For every task directly owned by this manual scope: - - realize recorded explicit `add_dependency` edges into scheduler fanin/fanout state - - start from the external producer set cached during submit - - dedup explicit same-scope edges against those cached external producers - - realize the final deduped producer set into scheduler fanin/fanout state -2. After all dependency realization is complete for the scope: - - release the publish barrier by incrementing `fanin_refcount` - - if `fanin_refcount == fanin_count`, transition to `READY` and push to ready queue - - otherwise keep the task in published `PENDING` state so later producer completion can resolve it -3. Release the scope lifetime reference exactly as current `on_scope_end` does +## Submit-Time Algorithm -This can be implemented as a manual-scope finalize path in the orchestrator plus a small scheduler helper for the publish-barrier release. +Current implementation is in +`src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp`. -Example helper shape: +For a manual submit: -```cpp -void publish_manual_scope_tasks(PTO2TaskSlotState** task_slot_states, int32_t count); -``` +1. Allocate the task slot, payload, and task id immediately. +2. Classify each tensor arg as manual-local or boundary. +3. Build `manual_local_mask` for same-scope tensors. +4. Decide whether TensorMap sync is needed at all: + - if every relevant arg is manual-local or `manual_dep=true`, skip sync + - otherwise run the normal TensorMap sync +5. For each non-`OUTPUT` arg that is not manual-local: + - always do creator retention from `owner_task_id` + - for `INPUT` and `INOUT`, do TensorMap lookup unless `manual_dep=true` +6. For `INOUT` and `OUTPUT_EXISTING` boundary args: + - update TensorMap frontier unless `manual_dep=true` +7. Initialize scheduler state, but keep the task unpublished behind a deferred + publish barrier. -This helper should reuse the existing ready-transition logic as much as possible. +Important consequence: -### How external dependency replay works +- cross-scope dependency discovery is paid at submit time +- same-scope dependency discovery is not replayed from tensors later -Manual submit should discover external dependencies in original submit order, using: +## `scope_end` Algorithm -- `scope_tasks[]` for task order -- `manual_task_meta[]` for packed tags and edge ranges -- `PTO2TaskPayload::tensors[]` for actual tensor values +Manual `scope_end` is now intentionally small. -For each task in that order during submit: +It does not replay explicit same-scope edges from a separate side buffer +anymore. Those edges are already realized when `pto2_rt_add_dependency(...)` +is called. -1. Decode tensor tags from `packed_tags`. -2. For each tensor arg: - - if `owner_task_id` belongs to the current manual scope's owned task set, treat it as manual-local and skip TensorMap logic - - otherwise treat it as outer/external -3. For outer/external tensors: - - apply creator-retention logic from `owner_task_id` - - apply existing TensorMap overlap lookup for `INPUT` / `INOUT` -4. Cache the deduped external producer set in the task payload. -5. After lookup for this task: - - apply normal TensorMap insertion for outer writes (`INOUT` / `OUTPUT_EXISTING`) +Current manual `scope_end` does: -This submit order matters: +1. validate `fanin_actual_count` +2. repair a monotonic `dep_pool_mark` prefix +3. batch-publish the scope tasks to the scheduler +4. perform the normal scope lifetime release -- it preserves current tensormap behavior for multiple writes to outer tensors -- earlier outer writes from the same manual scope become visible to later tasks in the same manual scope during submit -- that matches the accepted v1 tradeoff that outer tensors may still induce implicit same-scope TensorMap edges -- it requires the same TensorMap validity synchronization that normal auto submit uses before lookup/insert +That is the key change from the older draft design. The old replay-heavy model +is gone. -The split must not be implemented as: +## What Is Maintained -- deferring all lookups and inserts to `scope_end` -- wiring scheduler fanout during manual submit -- counting cached external producers and explicit manual edges independently without one dedup pass at publish time +Current manual mode keeps only the state that is still needed: -Those variants would diverge from current tensormap semantics and are considered incorrect for this design. +- `scope_tasks[]`: ordered list of tasks in the current scope +- `manual_scope_active`: current scope mode +- per-task `fanin_slot_states[]` / `fanin_actual_count` +- normal scheduler `fanin_count`, `fanin_refcount`, `fanout_head` +- `dep_pool_mark` for tail reclamation -### Important case: external dependency already produced before manual publish +Removed from the active design: -For a manual-scope task that reads an outer-scope tensor: +- manual edge replay buffers +- manual task metadata used only for finalize-time dependency reconstruction +- manual-scope dependency re-materialization at `scope_end` -- if the external producer task has already completed when scheduler realization happens at manual `scope_end`, that edge should immediately contribute to `fanin_refcount` -- then manual `scope_end` releases only the publish barrier, and the task may become `READY` immediately +## Why Partial-Manual Was Slow Before -If the external producer has only published its TensorMap frontier but not yet completed: +The bad version paid two costs at once: -- the manual-scope consumer has already cached that producer at submit time and is published at manual `scope_end` -- but it remains in published `PENDING` -- later producer completion notifies fanout and increments `fanin_refcount` -- once `fanin_refcount == fanin_count`, the consumer transitions to `READY` +1. It still did TensorMap-like work for the same-scope region. +2. It also paid a serial `scope_end` replay barrier to rebuild dependencies and + publish tasks. -This is the desired hybrid behavior: +That was the worst possible combination: extra submit cost plus extra finalize +cost. -- explicit same-scope dependency replay happens at manual `scope_end`, before publish -- cross-scope dependency discovery already happened at manual submit -- dependency satisfaction is still handled by the normal runtime execution path after publish +The current design removes that double payment: -### Why this is low-risk +- same-scope edges are explicit and immediate +- boundary discovery stays on the TensorMap path +- manual `scope_end` is only a publish barrier, not a dependency replay pass -- no change to ready queue implementation -- no change to worker dispatch loop -- no change to normal TensorMap scope behavior -- no need for a new scheduler task state -- reuse the existing `fanin_count` / `fanin_refcount` / `PENDING -> READY` transition model +## Zero-Overhead Target -The main new behavior is submit-time boundary discovery plus deferred release of explicit same-scope publish for manual-scope tasks. +The zero-overhead target here means: -## Current-Manual-Scope Ownership Without Tensor Changes +- no extra cost on `PTO2ScopeMode::AUTO` +- no extra TensorMap work for manual-local traffic +- no second dependency engine after publish -To decide whether a tensor is manual-local or outer-visible, the orchestrator only needs to know whether its `owner_task_id` belongs to the current manual scope. +What manual mode is allowed to cost: -Recommended minimal design: +- explicit dependency calls that the example asked for +- one deferred publish barrier at `scope_end` +- boundary TensorMap work only when the task actually crosses scope boundaries -- keep `Tensor` unchanged -- use `Tensor.owner_task_id` as the provenance link -- keep a scope-local registry of task ids created in the current manual scope +## Example Requirements -A good low-risk implementation is to reuse the existing flat `scope_tasks` buffer plus a parallel manual-edge buffer, rather than widening hot structs unnecessarily. +Manual mode only helps when the example exposes a same-scope producer/consumer +chain that TensorMap would otherwise rediscover. -Classification then becomes: +For paged attention, the profitable chain is: -```cpp -if (!tensor.owner_task_id.is_valid()) { - // external tensor with no producer task -} else { - manual_local = current_manual_scope_owns(tensor.owner_task_id); -} +```text +qk_matmul -> softmax_prepare -> pv_matmul -> online_update ``` -## Lifecycle Clarification - -This design needs precise task-lifecycle terms: +Inside a manual scope: -- `COMPLETED`: task execution has finished; produced tensor data is semantically ready -- `CONSUMED` / reclaimed: all fanout references and the owning-scope reference have been released, so the task slot may be reused and `last_task_alive` may advance -- tensor readiness: data-level concept, typically tied to producer task completion +- intermediate tensors in that chain should stay manual-local +- explicit edges should connect those tasks directly +- outer tensors such as the external KV cache and the final output still keep + boundary semantics -This matters for deferred manual `scope_end` wiring: +If an example keeps using boundary tensors everywhere, manual mode cannot +remove much runtime work. -- an outer-scope producer task may already be `COMPLETED` before the inner manual scope ends -- that is fine, and the manual finalize path should treat it as an already-satisfied dependency -- what must remain true is that the producer task has not yet reached the reclaimed / slot-reusable state before the inner manual `scope_end` +## Benchmark Enablement -Why this is expected to hold: +Current branch benchmark entrypoints: -- tasks created in the current manual scope are still protected by the current manual scope reference until manual `scope_end` -- tasks created in an outer still-active scope may complete early, but the outer scope still holds their scope reference until that outer scope ends -- therefore an inner manual scope can still rely on the producer state already discovered and retained during manual submit when it finalizes - -This does not mean the producer task is still runnable or incomplete. -It may already be `COMPLETED`; the manual finalize path should then treat it as an already-satisfied dependency. - -So the safety argument is not "outer producers cannot complete early". The correct statement is: - -- outer producers may complete before inner manual `scope_end` -- they should still remain alive enough to be discoverable until the deferred boundary wiring for that inner manual scope has finished - -## External Dependency Publication In Manual Scope - -The spec needs two explicit rules here. - -### External reads - -A task inside manual scope that reads an outer-scope tensor: - -- must still collect the external producer through TensorMap/owner logic -- must cache that dependency during manual submit -- must include that cached dependency in its fanin during manual `scope_end`, before manual batch publish -- must not require the user to restate that outer dependency manually - -### External writes - -A task inside manual scope that writes an outer-scope tensor: - -- must publish its producer frontier to TensorMap during manual submit -- must not publish same-scope temporary tensors into TensorMap -- may still be `PENDING` and unpublished to the scheduler until manual `scope_end` - -This is safe because later outside submissions only need to identify the producer task and wire dependency to it. Actual execution readiness is still controlled by task completion and the scheduler's normal completion path. - -## Manual Dependencies And External Dependencies On The Same Task - -A single task inside manual scope may simultaneously depend on: - -- explicit same-scope manual producers -- external TensorMap-derived producers - -This is supported by the same fanin accounting model. - -Example: - -```cpp -PTO2_SCOPE(PTO2ScopeMode::MANUAL) { - t0 = in-scope producer of tmp - t1 = consumer of tmp and outer tensor X - add_dependency(t0, t1) -} +```bash +./tools/benchmark_rounds.sh -d 4 -n 5 -c 6622890 -r aicpu_build_graph +./tools/benchmark_rounds.sh -d 4 -n 5 -c 6622890 -r tensormap_and_ringbuffer +./tools/benchmark_rounds.sh -d 4 -n 5 -c 6622890 -r tensormap_and_ringbuffer_partial_manual ``` -At manual `scope_end`, for `t1`: - -- `t0 -> t1` contributes one explicit manual fanin edge -- outer tensor `X` contributes boundary-derived external fanin edges -- publish barrier contributes one extra deferred fanin unit - -`t1` becomes READY only after: - -- explicit in-scope producers complete -- external producers complete -- manual `scope_end` releases the publish barrier - -That is the intended scheduler behavior. - -## Explicit edge wiring - -`pto2_add_dependency` from `aicpu_build_graph` can be reused conceptually, but manual scope should not wire scheduler fanin/fanout immediately. - -Recommended behavior inside manual scope: - -- validate that both task ids belong to the current manual scope -- record the edge in a scope-local manual-edge buffer -- do not increment `fanin_count` yet -- do not mutate producer `fanout_head` yet - -Then at manual `scope_end`: - -- realize each recorded edge into the scheduler's existing fanin/fanout structures -- increment `fanin_count` -- record producer in consumer payload for release traversal -- handle the already-completed producer case exactly once, during realization - -This avoids racing live external completion against partially built manual dependency state. - -Important discussion note: - -- the deduped producer set for one consumer must include all sources together: - - explicit manual edges - - creator retention from `owner_task_id` - - TensorMap overlap lookup - -The implementation must not count these sources independently and then wire fanout multiple times for the same producer-consumer pair. - -## Scope-end behavior - -Manual scope changes scheduler publication semantics for tasks inside that scope: - -- tasks in manual scope are batch-published to the scheduler at `scope_end` -- same-scope explicit edges must be fully wired before that publish happens - -Manual scope does not change lifetime release semantics: - -- `scope_end` still releases the owning-scope fanout reference +The old unmodified runtime is intentionally not kept on this branch. To rerun +it side-by-side: -Manual scope also does not change cross-scope readiness semantics: - -- external tensor readiness is still producer-task completion, not `scope_end` -- but external-writer frontier information must already be visible to later TensorMap lookups at manual submit - -This manual-scope behavior intentionally combines: - -- `aicpu_build_graph`-style scope-end batch publish for explicit same-scope dependencies -- `tensormap_and_ringbuffer`-style TensorMap boundary tracking for cross-scope tensors - -## Multiple Writes To Outer Tensors - -This case must be supported in v1. - -Example: - -```cpp -PTO2_SCOPE(PTO2ScopeMode::MANUAL) { - t1 writes outer C - t2 writes outer C - add_dependency(t1, t2) -} -outside task reads C +```bash +git worktree add tmp/worktree_unmodified a71ba16 +cd tmp/worktree_unmodified +PTO_ISA_ROOT=../../examples/scripts/_deps/pto-isa \ + ./tools/benchmark_rounds.sh -d 4 -n 5 -c 6622890 -r tensormap_and_ringbuffer_unmodified ``` -Correct behavior: - -- at manual submit, `t1` publishes `C` to TensorMap -- at manual submit, `t2` publishes `C` again to TensorMap -- outside reader should see `t2` as the latest producer frontier -- because `t1 -> t2` is explicit, `t2` completion is a valid readiness frontier for the final visible state -- outer tensors may still create implicit same-scope TensorMap edges inside the manual scope; this is an accepted v1 tradeoff and should be called out in the PR description - -Potential invalid user pattern: - -```cpp -PTO2_SCOPE(PTO2ScopeMode::MANUAL) { - t1 writes outer C - t2 also writes outer C - // missing add_dependency(t1, t2) -} -``` - -This is a user error. The runtime should not try to reconstruct same-scope writer ordering automatically in manual mode. - -## Reads Of Outer Tensors Inside Manual Scope - -Outer tensors read inside manual scope must still seed internal dependencies from existing producer state through TensorMap/owner logic. - -Otherwise: - -- a task inside manual scope may run before the outer producer of its input -- explicit edges inside the scope are insufficient to protect the outer-to-inner boundary - -So manual mode disables only same-scope auto-derivation, not boundary seeding. - -This is a strict requirement: - -- outer read boundary dependency is forced by TensorMap/owner metadata -- orchestration code inside the manual scope must not be required to recreate that outer dependency manually -- even though the consumer task itself is only batch-published to the scheduler at manual `scope_end`, its fanin accounting must include the external TensorMap-derived dependency discovered at submit time - -## Nesting Rules - -Supported: - -- auto scope contains manual scope -- auto scope contains auto scope - -Not supported in v1: - -- manual scope contains manual scope -- manual scope contains any nested scope with its own publish boundary - -Reason: - -- current ring selection depends on scope depth -- the top scope frame is also the publication and lifetime-release boundary -- allowing a child scope inside `PTO2_SCOPE(PTO2ScopeMode::MANUAL)` would split one manual region across multiple scope/ring boundaries unless extra machinery is added -- rejecting nested scopes inside manual mode keeps `current_manual_scope_owns(...)` a simple membership check over one manual frame - -Recommendation: - -- detect this at `scope_begin` -- fail fast with a clear orchestrator error - -Required error text quality: - -- the message must explicitly say that nested scope inside `PTO2_SCOPE(PTO2ScopeMode::MANUAL)` is not supported in v1 -- the message must explicitly say that `manual scope inside manual scope is not supported` -- the message must identify the offending operation as nested `PTO2_SCOPE(PTO2ScopeMode::MANUAL)` -- the message must not use vague wording such as only `invalid scope state` - -## Blocking Cross-Layer Tensor Access - -`get_tensor_data` and `set_tensor_data` are blocking cross-layer access APIs. Their current contract assumes producer state is already published through TensorMap/owner metadata. - -That assumption does not hold inside manual scope because tasks remain unpublished until manual `scope_end`. - -So v1 should fail fast: +In this work, direct `run_example.py` reruns were more reliable than the +wrapper for collecting fresh device logs, especially for the old runtime and +for cases where the wrapper suppressed useful failure output. -- `get_tensor_data` inside `PTO2_SCOPE(PTO2ScopeMode::MANUAL)` is an error -- `set_tensor_data` inside `PTO2_SCOPE(PTO2ScopeMode::MANUAL)` is an error +## Fresh Performance Snapshot -Required error text quality: +Device and settings used for the rerun set: -- the message must explicitly say that blocking tensor data access is not supported inside `PTO2_SCOPE(PTO2ScopeMode::MANUAL)` -- the message should tell the user to exit the manual scope first +- device: `4` +- rounds: `5` +- ISA commit: `6622890` -## Diagnostics +### End-to-end / scheduler-side comparison -The runtime should detect and report: +These numbers are the freshest rerun values used in this document. -1. nested scope inside manual mode not supported, with an explicit error message -2. `add_dependency` used with invalid task ids -3. dependency overflow from explicit wiring -4. `get_tensor_data` or `set_tensor_data` called inside manual scope +| Example | Case | `aicpu_build_graph` | old `tensormap*` | new `tensormap*` | `tensormap* + partial_manual` | +| --- | --- | ---: | ---: | ---: | ---: | +| `paged_attention` | `Case1` | `31312.6 us` | `37061.0 us` | `36585.4 us` | `31814.5 us` | +| `paged_attention` | `Case2` | `16474.4 us` | `18589.4 us` | `19348.8 us` | `16221.1 us` | +| `paged_attention_unroll` | `Case1` | `1426.8 us` | `1383.6 us`* | `1322.1 us` | `1327.9 us` | +| `paged_attention_unroll` | `Case2` | `728.5 us` | `668.6 us`* | `623.8 us` | `639.1 us` | -Nice-to-have diagnostics: +`*` The old unmodified unroll runtime logs use the older +`Scheduler summary: total_time=...` format instead of the newer full elapsed +markers. Those two rows therefore use scheduler-summary averages for the old +baseline. -- count of explicit edges added in manual scope -- count of cross-scope TensorMap lookups/inserts preserved inside manual scope +### Orchestrator comparison -These are not required for correctness, but will make profiling and debugging practical. +| Example | Case | old `tensormap*` | new `tensormap*` | `tensormap* + partial_manual` | +| --- | --- | ---: | ---: | ---: | +| `paged_attention` | `Case1` | `37060.3 us` | `36584.7 us` | `31657.9 us` | +| `paged_attention` | `Case2` | `18588.6 us` | `19348.2 us` | `15799.4 us` | +| `paged_attention_unroll` | `Case1` | `716.2 us`* | `826.3 us` | `826.0 us` | +| `paged_attention_unroll` | `Case2` | `336.7 us`* | `368.8 us` | `387.6 us` | -## Testing Strategy +`*` Old unmodified unroll uses `orch_func_cost`, not the newer +`orch_end - orch_start` marker. It is still useful for side-by-side direction, +but it is not byte-for-byte the same logging mode. -Add focused coverage before broad workload migration. +Old-baseline rerun logs used in this table: -### Unit-style runtime cases +- `device-1533354_20260409000304311.log` +- `device-1546617_20260409000317313.log` +- `device-1536746_20260409000312312.log` +- `device-1568129_20260409000326314.log` -1. Manual scope diamond on scope-local outputs -- all same-scope edges explicit -- no TensorMap dependence required +## What Helped and What Mattered -2. Manual scope reads outer tensor -- internal first task waits on outer producer frontier +### 1. Collapse manual `scope_end` into publish-only work -3. Manual scope writes outer tensor once -- outside consumer waits on inner writer completion, not `scope_end` - -4. Manual scope writes outer tensor multiple times -- latest writer becomes TensorMap frontier -- correctness depends on explicit same-scope edge wiring -- accepted implicit TensorMap edges on outer tensors are documented - -5. Normal scope containing manual scope -- outer to inner and inner to outer boundary cases both work - -6. Nested scope inside manual mode -- rejected with deterministic error - -### Example-level migration - -Use a small example first, such as vector-style or BGEMM-style, to validate: - -- scope-local temp tensors use explicit edges -- outer tensors still behave through TensorMap - -Only then move to more complex orchestration such as paged attention. - -## Fresh Hardware Benchmark - -Fresh benchmark data was rerun on real hardware on 2026-04-08 with: - -- platform: `a2a3` -- device: `3` -- rounds: `10` -- pinned PTO-ISA commit: `6622890` -- runner: `tools/benchmark_rounds.sh` - -The branch-local compared variants are: - -- `aicpu_build_graph` -- `tensormap_and_ringbuffer` -- `tensormap_and_ringbuffer_partial_manual` - -`tensormap_and_ringbuffer` is the current/new AUTO-path runtime under evaluation. -`tensormap_and_ringbuffer_partial_manual` is the same runtime tree, but benchmarked -through the `_partial_manual` paged-attention scenes. - -The untouched PTO2 baseline is no longer kept in this branch. If a comparison against -an unmodified tensormap runtime is needed, create a temporary worktree from the -baseline commit and run the same benchmark script there. - -### Benchmark Script Selectors - -The benchmark wrapper enables the variants as follows: - -- `./tools/benchmark_rounds.sh -d 3 -n 10 -r aicpu_build_graph -c 6622890` - - benchmarks `tests/st/a2a3/aicpu_build_graph/paged_attention` - - benchmarks `tests/st/a2a3/aicpu_build_graph/paged_attention_unroll` -- `./tools/benchmark_rounds.sh -d 3 -n 10 -r tensormap_and_ringbuffer -c 6622890` - - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer/paged_attention` - - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll` -- `./tools/benchmark_rounds.sh -d 3 -n 10 -r tensormap_and_ringbuffer_partial_manual -c 6622890` - - uses the same ST root as `tensormap_and_ringbuffer` - - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual` - - benchmarks `tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual` - -There is no separate runtime named `partial_manual` in the example `kernel_config.py`. -The partial-manual scenes still declare `RUNTIME_CONFIG["runtime"] = -"tensormap_and_ringbuffer"`, and the benchmark wrapper switches to the -`*_partial_manual` scene directories when `-r tensormap_and_ringbuffer_partial_manual` -is selected. - -Similarly, the current/new AUTO-path runtime is enabled directly by `-r -tensormap_and_ringbuffer`. - -### Fresh Results - -Units below are `elapsed_us (orch_us)`. - -| Workload | Case | `aicpu_build_graph` | `tensormap_and_ringbuffer` | `tensormap_and_ringbuffer_partial_manual` | -| --- | --- | --- | --- | --- | -| `paged_attention` | `Case1` | `31318.9 (-)` | `36996.3 (36995.6)` | `35187.6 (35030.2)` | -| `paged_attention` | `Case2` | `16844.5 (-)` | `19861.8 (19856.8)` | `18685.5 (18274.5)` | -| `paged_attention_unroll` | `Case1` | `1412.7 (-)` | `1323.9 (831.3)` | `1321.3 (884.4)` | -| `paged_attention_unroll` | `Case2` | `705.5 (-)` | `632.5 (378.9)` | `637.5 (406.4)` | - -### Feature-To-Gain Mapping - -The most important question is which change actually moved performance. - -The rerun below isolates the non-unroll partial-manual example on device `3`. All -three rows already include the same rewritten orchestration structure -(`one manual scope per q_idx`, `AIV_HUB` moved inside manual scope, and explicit -`prev_update_task -> up_outs.task_id` chaining). The only thing changing between rows -is boundary annotation. - -| Optimization step | What changed | `Case1` orch delta | `Case2` orch delta | What it means | -| --- | --- | --- | --- | --- | -| Baseline after rewrite | no `manual_dep` boundary hints | baseline `36791.9 us` | baseline `19792.7 us` | structural rewrite alone is the starting point | -| Add input-side hints | `query`, `key_cache`, `value_cache`, and their views use `manual_dep=true` | `-39.6 us` (`-0.1%`) | `-496.0 us` (`-2.5%`) | minor benefit; input-side TensorMap work is not the main bottleneck | -| Add output-side hints on top | `out` and `out_view` also use `manual_dep=true` | `-1683.6 us` (`-4.6%`) | `-1628.5 us` (`-8.4%`) | major gain; repeated external-output overlap tracking was expensive | -| Full boundary hints vs no hints | inputs + output boundary hints together | `-1723.2 us` (`-4.7%`) | `-2124.5 us` (`-10.7%`) | this is the measurable win that was worth keeping | - -Two conclusions matter: - -1. The kept optimization is not “use `manual_dep` everywhere”. - - The measurable gain came mostly from suppressing repeated external-output - TensorMap work on `out` / `out_view`, where same-scope write ordering is already - carried by explicit manual edges. -2. Input-side `manual_dep=true` alone is not enough. - - It helps a little on `Case2`, but almost nothing on `Case1`. - -### Benchmark Takeaways - -1. The non-unroll target is still not met. - - target cell: `paged_attention/Case1` - - `aicpu_build_graph`: `31318.9 us` - - `partial_manual`: `35187.6 us` - - remaining gap: about `+12.4%` - -2. Partial-manual now improves the modified/current tensormap AUTO runtime on both - non-unroll paged-attention cases, but it does not beat `aicpu_build_graph`. - - `paged_attention/Case1`: about `-4.9%` elapsed, about `-5.3%` orch vs current/new - - `paged_attention/Case2`: about `-5.9%` elapsed, about `-8.0%` orch vs current/new - -3. On the unroll scene, both tensormap-family runtimes remain faster than - `aicpu_build_graph` end-to-end, but partial-manual stays slightly worse than the - AUTO tensormap path in orch time. - - `paged_attention_unroll/Case1`: `884.4 us` orch vs `831.3 us` current/new - - `paged_attention_unroll/Case2`: `406.4 us` orch vs `378.9 us` current/new - -4. The remaining performance problem is still concentrated in the non-unroll - partial-manual path, especially the replay/publish cost paid at manual `scope_end`. - -### Boundary Annotation Note - -There is still no explicit “scope arguments” API in -`src/a2a3/runtime/tensormap_and_ringbuffer/orchestration/pto_orchestration_api.h`. - -The closest current mechanism is per-tensor `manual_dep=true` on -`make_tensor_external(...)` and derived `view(...)` objects. That mechanism is not a -good substitute for scope-boundary declaration: - -- it is tensor-local, not scope-local -- it suppresses TensorMap lookup/insert for that tensor -- if used carelessly, it can hide an output frontier that boundary semantics still need - -For the committed non-unroll partial-manual paged-attention example, the stable -improvement came from two pieces together: - -- the orchestration rewrite already present in the example: - - one manual scope per `q_idx` - - move `AIV_HUB` creation into the manual scope - - add an explicit `prev_update_task -> up_outs.task_id` chain -- explicit `manual_dep=true` boundary hints on the external inputs and external output - views in the example itself - -Fresh device-3 measurements for the non-unroll partial-manual example were: - -| Variant | `paged_attention/Case1` orch | `paged_attention/Case2` orch | -| --- | --- | --- | -| rewrite only, no boundary hints | `36791.9 us` | `19792.7 us` | -| rewrite + input-side hints | `36752.3 us` | `19296.7 us` | -| rewrite + input/output hints | `35068.7 us` | `17668.2 us` | +This is the main fix for non-unroll paged attention. -So the important hint is not just the external producers (`query`, `key_cache`, -`value_cache`), but also the external consumer path through `out` / `out_view`. -That is consistent with the current runtime behavior: +Current effect against the new automatic TensorMap runtime: -- `manual_dep=true` skips TensorMap overlap lookup/insert but still keeps creator - retention through `owner_task_id` -- the explicit `prev_update_task` chain already serializes same-scope `ONLINE_UPDATE` - writes -- marking `out` / `out_view` `manual_dep=true` avoids paying repeated external-output - overlap tracking on every block update when that ordering is already explicit +- `paged_attention Case1`: `36584.7 us -> 31657.9 us` orch +- `paged_attention Case2`: `19348.2 us -> 15799.4 us` orch -Why this was kept even though `manual_dep` is not the core semantics: - -- manual scope still uses TensorMap for general cross-scope correctness -- this example already has explicit same-scope write ordering for `ONLINE_UPDATE` -- there is no same-scope external consumer that needs `out` / `out_view` to stay on - TensorMap before manual `scope_end` -- so suppressing repeated output overlap tracking is a valid example-level - optimization, not a change to the runtime's semantic model - -This is still not a general “scope arguments” API. It is an example-local optimization -that is only safe when the manual scope already carries the same-scope write ordering -explicitly and there is no same-scope external consumer that depends on TensorMap -publication before manual `scope_end`. - -## Main Risks - -1. Treating manual scope as a global TensorMap disable switch. -- This breaks cross-scope correctness. - -2. Using `Tensor::manual_dep` as the only signal. -- Scoped semantics should be driven by current manual-scope ownership, not by the tensor flag alone. - -3. Failing to force outer-scope reads through TensorMap/owner dependency seeding. -- This allows manual-scope tasks to read before the outer producer frontier is ready. - -4. Confusing scheduler batch publication with tensor readiness semantics. -- Manual-scope tasks should be scheduler-visible at `scope_end`, but external tensor readiness is still producer completion. +This is the difference between “manual mode is the worst case” and +“manual mode is in the same band as `aicpu_build_graph`”. -5. Letting cross-scope writer frontier become visible only after producer completion. -- This is too late for later outside submissions made after manual `scope_end`. +### 2. Skip TensorMap sync/lookup/insert for fully manual-local traffic -6. Wiring external producers into scheduler fanout during manual submit. -- This can let unpublished tasks become runnable before `scope_end`. - -7. Publishing external writer frontier later than manual submit. -- This makes later boundary lookups see stale producer state and diverges from current tensormap semantics for multiple writes. - -8. Missing a final dedup pass between cached external producers and explicit manual edges. -- This double-counts fanin and can over-release dependencies. +Manual mode now checks whether a submit actually touches a boundary tensor. +If not, it skips TensorMap sync entirely. -9. Missing alias/view inheritance of scope ownership. -- This causes wrong same-scope vs cross-scope classification. +This matters most when the example keeps intermediates local to the manual +scope. In the unrolled example, both automatic and partial-manual runtimes are +already in the sub-millisecond range, which shows that the remaining cost is no +longer dominated by boundary TensorMap work. -10. Turning this feature into a broad runtime refactor. -- This increases regression risk and violates the required change scope. - -11. Allowing blocking cross-layer tensor access inside manual scope. -- `get_tensor_data` and `set_tensor_data` assume published producer state and should fail fast in manual scope. - -12. Replacing the existing scheduler edge machinery with a separate manual execution path. -- This would duplicate fanin/fanout handling, completion notification, and release traversal. -- The design requires one unified post-publish scheduler mechanism. - -13. Using `manual_dep=true` as a blanket scope-boundary annotation. -- This can suppress TensorMap work that is still required for cross-scope correctness. -- It is only safe as a narrowly-scoped example optimization when the same-scope - ordering is already explicit and no early external consumer needs TensorMap - publication for that tensor. - -## Dangerous Risks For The Submit/Scope-End Split - -The implementation should explicitly guard the following failure modes before any -performance tuning claims are accepted. - -1. Early-ready bug from submit-time scheduler mutation. -- Manual submit may discover external producers early, but it must not mutate - producer `fanout_head` or consumer ready state early. -- Required safeguard: manual submit may cache producer slot states only. - -2. Stale frontier bug for outer writes. -- If outer `INOUT` / `OUTPUT_EXISTING` writes stay deferred until `scope_end`, - later submissions can miss the newest writer frontier. -- Required safeguard: publish TensorMap frontier at manual submit in original - task order. - -3. Double-accounting bug across cached external fanins and explicit manual edges. -- One producer may be found both through boundary discovery and through an - explicit edge. -- Required safeguard: publish-time fanin construction must run one dedup pass - over both sources before incrementing `fanin_count` or wiring fanout. - -4. Completed-before-publish bug. -- An external producer may already be `COMPLETED` when the manual scope reaches - `scope_end`. -- Required safeguard: publish-time scheduler wiring must detect already-finished - producers and credit `fanin_refcount` exactly once. +### 3. Keep boundary correctness on the normal path -5. Producer-lifetime bug for cached external fanins. -- A cached producer that is not retained may reach `CONSUMED` and have its slot - reused before the manual scope publishes. -- Required safeguard: manual submit must take a real retained reference on each - unique cached producer, and consumer release must drop that same reference. +Boundary reads and writes still use: -6. Scope-abort visibility bug for submit-time outer writes. -- If manual submit mutates TensorMap for outer writes and the scope later fails, - global TensorMap state can point at unpublished internal writers. -- Required safeguard: treat post-submit fatal paths as terminal for the runtime, - and keep the implementation free of late scope validation after submit-time - TensorMap mutation. +- creator retention +- TensorMap overlap lookup +- TensorMap frontier publish -7. Wrong manual-local classification for aliases and views. -- Boundary discovery must skip TensorMap only for tensors whose - `owner_task_id` belongs to the current manual scope, including derived views. -- Required safeguard: keep classification on task provenance, not on a new - tensor-side mode bit. +This does not make manual mode faster by itself. It is the correctness guard +that prevents stale external state and wrong cross-scope dependencies. -## Recommended Implementation Order +### 4. Example structure still dominates the ceiling -1. Add API surface for `add_dependency` and manual scope mode. -2. Add manual-submit APIs with `_manual` suffix returning task ids plus outputs. -3. Add scope-frame mode plus scope-local manual-edge storage. -4. Implement submit-time outer-tensor TensorMap lookup/insert with cached external fanins. -5. Keep manual `scope_end` TensorMap-free and realize only explicit same-scope edges plus scheduler publish. -6. Implement manual-local tensor classification from `owner_task_id` plus current manual-scope ownership. -7. Add fail-fast nested-scope-in-manual check and block `get_tensor_data` / `set_tensor_data` in manual scope. -8. Add targeted tests for boundary semantics. -9. Migrate one example and validate. +`paged_attention_unroll` already reduces submit pressure aggressively. Because +that example exposes less repeated same-scope dependency work, partial-manual +does not beat the best automatic/unmodified paths there. -## Open Question Resolved +The no-unroll `paged_attention` case is where partial-manual matters most, and +that is also the target case where it now tracks `aicpu_build_graph`. -This design intentionally resolves the central ambiguity: +## Clear Conclusions -- `scope_end` controls lifetime release -- task completion controls semantic readiness +1. Manual scope correctness is preserved by keeping cross-scope tensors on the + normal TensorMap path. +2. Manual scope performance only improves when the example exposes a real + same-scope chain that can stay manual-local. +3. The replay-heavy finalize model was the wrong design. The current + submit-time wiring plus publish-only `scope_end` is the right direction. +4. On non-unroll paged attention, `partial_manual` now matches the intended + performance band: close to `aicpu_build_graph`, clearly better than both old + and current automatic TensorMap runtimes. -For outer tensors written inside manual scope, TensorMap frontier publication happens at -manual submit, while semantic readiness is still producer-task completion. +## Remaining Risks -## File Areas Expected To Change - -- `src/a2a3/runtime/tensormap_and_ringbuffer/orchestration/pto_orchestration_api.h` -- `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.h` -- `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2.cpp` -- `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp` -- `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_runtime2_types.h` -- `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_scheduler.h` -- docs and examples/tests needed to demonstrate the new scoped behavior - -## Recommendation Summary - -Implement manual dependency as a scope-local override inside `tensormap_and_ringbuffer`, not as a runtime-wide replacement of TensorMap: - -- tensors created in the current manual scope: explicit `add_dependency` -- outer tensors: existing TensorMap path -- TensorMap boundary realization for manual scopes: manual submit -- semantic readiness of outer writes: writer completion -- lifetime release: `scope_end` - -That is the smallest design that satisfies the requested model without breaking the core tensormap runtime semantics. +- manual scopes are still single-level only +- old unmodified unroll comparisons rely on older log markers +- explicit dependency misuse is still a fatal orchestration error by design +- the runtime still depends on the example to choose good manual-scope + boundaries; bad example structure can erase the benefit diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp index 1260180c7..adc610b44 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp @@ -407,7 +407,6 @@ bool pto2_orchestrator_init( orch->scope_stack_top = -1; orch->scope_stack_capacity = max_depth; orch->manual_scope_active = false; - memset(orch->manual_dep_pool_reserve, 0, sizeof(orch->manual_dep_pool_reserve)); return true; } diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h index 9f70d8ed3..e3297ada2 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.h @@ -89,15 +89,11 @@ struct PTO2OrchestratorState { // Cross-thread notification uses shared memory orch_error_code (atomic) bool fatal; - // === MANUAL-SCOPE STATE === - int32_t manual_dep_pool_reserve[PTO2_MAX_RING_DEPTH]; - // Hidden alloc tasks complete synchronously inside the orchestrator and // therefore bypass the executor's normal worker-completion counter path. // The executor adds this count into its completed_tasks_ progress counter // after orchestration finishes so shutdown/profiling totals remain closed. int64_t inline_completed_tasks{0}; - // === STATISTICS === #if PTO2_PROFILING int64_t tasks_submitted; From aed352609ce8ae99e5d4036f49fb13e51bfd1c43 Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Thu, 9 Apr 2026 00:22:32 +0800 Subject: [PATCH 31/35] Update: align manual dependency doc with fresh matrix - replace stale replay-heavy scope_end description with the current\n submit-time wiring model\n- document how AUTO, MANUAL, and benchmark selectors map onto the\n paged-attention scenes\n- record the fresh device-6 four-runtime comparison and the observed\n gains for partial-manual and zero-overhead AUTO --- docs/manual-dep-for-tensormap-design.md | 410 +++++++++++++----------- 1 file changed, 219 insertions(+), 191 deletions(-) diff --git a/docs/manual-dep-for-tensormap-design.md b/docs/manual-dep-for-tensormap-design.md index 5c5c3f619..8e815ab43 100644 --- a/docs/manual-dep-for-tensormap-design.md +++ b/docs/manual-dep-for-tensormap-design.md @@ -5,7 +5,7 @@ Add a scoped manual-dependency mode to `tensormap_and_ringbuffer` without regressing the default automatic path: -- `PTO2_SCOPE()` keeps the existing automatic mode +- `PTO2_SCOPE()` stays in automatic mode - `PTO2_SCOPE(PTO2ScopeMode::MANUAL)` enables scoped manual dependency wiring - same-manual-scope edges use explicit `pto2_rt_add_dependency(...)` - cross-scope edges still use `owner_task_id` and TensorMap discovery @@ -14,12 +14,16 @@ This is a hybrid model, not a port of `aicpu_build_graph`. ## API Surface -The orchestration API now uses an enum instead of the old boolean-style scope -switch: +The orchestration-facing API is: ```cpp +enum class PTO2ScopeMode : uint8_t { + AUTO = 0, + MANUAL = 1, +}; + PTO2_SCOPE() { - // default: PTO2ScopeMode::AUTO + // default: AUTO } PTO2_SCOPE(PTO2ScopeMode::MANUAL) { @@ -29,44 +33,14 @@ PTO2_SCOPE(PTO2ScopeMode::MANUAL) { } ``` -Current modes: - -- `PTO2ScopeMode::AUTO` -- `PTO2ScopeMode::MANUAL` - Current restrictions: -- manual scope cannot be nested inside another manual scope -- manual submit APIs are only valid inside `PTO2_SCOPE(PTO2ScopeMode::MANUAL)` +- manual submit APIs are only valid inside + `PTO2_SCOPE(PTO2ScopeMode::MANUAL)` - `pto2_rt_add_dependency(...)` requires both tasks to belong to the current manual scope - -## Current Design - -### High-level rule - -Manual mode only changes same-scope dependency discovery. - -- in-scope tensors: use explicit task edges -- cross-scope tensors: keep TensorMap semantics -- scope-local lifetime model: unchanged -- scheduler execution model after publish: unchanged - -### Why this split exists - -We need two properties at the same time: - -1. Manual scopes must be able to skip TensorMap work for same-scope chains. -2. Outer producers and outer consumers must still see the correct frontier. - -If we disabled TensorMap for everything inside a manual scope, cross-scope -reads and writes would become incorrect. If we kept TensorMap for everything, -manual mode would not remove the overhead we care about. - -The chosen split is: - -- same-scope manual-local traffic: explicit edges only -- boundary traffic: existing creator-retention + TensorMap lookup/insert +- nested scope inside manual scope is rejected in v1 +- blocking tensor access helpers are rejected inside manual scope ## Dependency Semantics @@ -81,131 +55,162 @@ Each tensor argument is classified at submit time: Manual-local tensors skip TensorMap entirely. Boundary tensors stay on the normal TensorMap path unless `manual_dep=true`. -### What `INPUT`, `OUTPUT`, `INOUT`, and friends mean +### `INPUT`, `OUTPUT`, `INOUT`, and friends -`TensorArgType` in the runtime: +`TensorArgType` behavior in the runtime: -- `INPUT`: read-only existing tensor -- `OUTPUT`: fresh runtime-allocated tensor -- `INOUT`: read existing state and publish a new state -- `OUTPUT_EXISTING`: write-only existing tensor -- `NO_DEP`: existing tensor with no TensorMap dependency work and no publish +| Arg kind | Meaning | Incoming dependency work | Outgoing frontier work | +| --- | --- | --- | --- | +| `INPUT` | existing tensor, read-only | creator retention, plus TensorMap lookup unless skipped | none | +| `OUTPUT` | fresh runtime-allocated tensor | none | no TensorMap insert at creation; `owner_task_id` is stamped on the produced tensor | +| `INOUT` | existing tensor, read + write | creator retention, plus TensorMap lookup unless skipped | TensorMap insert unless skipped | +| `OUTPUT_EXISTING` | existing tensor, write-only | creator retention only | TensorMap insert unless skipped | +| `NO_DEP` | existing tensor, creator-retention-only | creator retention only | none | -### Behavior matrix +### Manual-local vs boundary behavior | Arg kind | Manual-local tensor | Boundary tensor | | --- | --- | --- | -| `INPUT` | no creator retention, no TensorMap lookup, requires explicit manual edge | creator retention; TensorMap lookup unless `manual_dep=true` | +| `INPUT` | no TensorMap lookup, requires explicit manual edge | creator retention; TensorMap lookup unless `manual_dep=true` | | `OUTPUT` | fresh local tensor; later same-scope uses rely on explicit manual edges | not applicable | | `INOUT` | no TensorMap lookup/insert, requires explicit manual edge | creator retention; TensorMap lookup for incoming state; TensorMap insert for outgoing state unless `manual_dep=true` | | `OUTPUT_EXISTING` | no TensorMap insert, requires explicit manual edge if later reused in scope | creator retention; TensorMap insert for outgoing state unless `manual_dep=true` | | `NO_DEP` | creator-only object passing, no publish | same | -### `manual_dep=true` still matters +### `manual_dep=true` `Tensor::manual_dep` keeps its existing meaning: - skip TensorMap lookup/insert - keep creator-only retention via `owner_task_id` -This is orthogonal to manual scope mode. It is a per-tensor override, not a -replacement for scoped manual dependency wiring. +It is a per-tensor optimization hint. It is not the core manual-scope +mechanism. + +## Runtime Model -## Submit-Time Algorithm +### High-level flow + +```text +PTO2_SCOPE(MANUAL) + | + v + submit_*_manual() + | + +-- classify tensor args + | |- manual-local -> no TensorMap + | `- boundary -> owner retention + optional TensorMap + | + +-- allocate slot / payload / outputs + | + +-- wire boundary producers immediately + | `- keep one extra fanin publish barrier + | + `-- return { task_id, outputs } + | + v + pto2_rt_add_dependency() + | + `-- wire same-scope producer -> consumer immediately + +scope_end() + | + +-- validate fanin bounds + +-- repair monotonic dep_pool_mark prefix + +-- release publish barrier and batch-publish tasks + `-- do normal scope lifetime release +``` + +### What manual submit iterates Current implementation is in `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp`. For a manual submit: -1. Allocate the task slot, payload, and task id immediately. -2. Classify each tensor arg as manual-local or boundary. -3. Build `manual_local_mask` for same-scope tensors. -4. Decide whether TensorMap sync is needed at all: +1. allocate the task slot, payload, and task id immediately +2. classify each tensor arg as manual-local or boundary +3. build `manual_local_mask` for same-scope tensors +4. decide whether TensorMap sync is needed at all - if every relevant arg is manual-local or `manual_dep=true`, skip sync - otherwise run the normal TensorMap sync -5. For each non-`OUTPUT` arg that is not manual-local: +5. for each non-`OUTPUT` arg that is not manual-local - always do creator retention from `owner_task_id` - for `INPUT` and `INOUT`, do TensorMap lookup unless `manual_dep=true` -6. For `INOUT` and `OUTPUT_EXISTING` boundary args: +6. for `INOUT` and `OUTPUT_EXISTING` boundary args - update TensorMap frontier unless `manual_dep=true` -7. Initialize scheduler state, but keep the task unpublished behind a deferred - publish barrier. +7. initialize scheduler state, but keep the task unpublished behind a deferred + publish barrier Important consequence: -- cross-scope dependency discovery is paid at submit time -- same-scope dependency discovery is not replayed from tensors later +- cross-scope dependency discovery is still paid at submit time +- same-scope dependency discovery is no longer replayed from tensors later -## `scope_end` Algorithm +### What `pto2_rt_add_dependency(...)` does now -Manual `scope_end` is now intentionally small. +This is the key difference from the older design draft. -It does not replay explicit same-scope edges from a separate side buffer -anymore. Those edges are already realized when `pto2_rt_add_dependency(...)` -is called. +`pto2_rt_add_dependency(...)` no longer records an edge for replay at +`scope_end()`. It validates both task ids belong to the current manual scope, +dedups against the consumer payload, ensures dep-pool space, and wires the edge +immediately: -Current manual `scope_end` does: +- increments producer `fanout_count` +- prepends the consumer into the producer fanout list +- appends the producer slot state into `payload->fanin_slot_states[]` +- increments consumer `fanin_count` +- updates consumer `dep_pool_mark` -1. validate `fanin_actual_count` -2. repair a monotonic `dep_pool_mark` prefix -3. batch-publish the scope tasks to the scheduler -4. perform the normal scope lifetime release +That removes the old replay-heavy finalize path. -That is the key change from the older draft design. The old replay-heavy model -is gone. +### What `scope_end()` does now -## What Is Maintained +Manual `scope_end()` is now intentionally small and TensorMap-free. -Current manual mode keeps only the state that is still needed: +It only: -- `scope_tasks[]`: ordered list of tasks in the current scope -- `manual_scope_active`: current scope mode -- per-task `fanin_slot_states[]` / `fanin_actual_count` -- normal scheduler `fanin_count`, `fanin_refcount`, `fanout_head` -- `dep_pool_mark` for tail reclamation +1. validates `fanin_actual_count` +2. repairs a monotonic `dep_pool_mark` prefix +3. calls `publish_manual_scope_tasks_and_end_scope(...)` +4. performs the normal scope lifetime release -Removed from the active design: +There is no explicit-edge replay at `scope_end()` anymore. -- manual edge replay buffers -- manual task metadata used only for finalize-time dependency reconstruction -- manual-scope dependency re-materialization at `scope_end` +## Why This Split Is Correct -## Why Partial-Manual Was Slow Before +### Cross-scope correctness -The bad version paid two costs at once: +Cross-scope tensors still need TensorMap because the runtime must preserve: -1. It still did TensorMap-like work for the same-scope region. -2. It also paid a serial `scope_end` replay barrier to rebuild dependencies and - publish tasks. +- latest-writer frontier tracking +- overlap-based modifier discovery +- boundary ordering across scopes -That was the worst possible combination: extra submit cost plus extra finalize -cost. +If manual scope disabled TensorMap globally, outer reads and writes would +become incorrect. -The current design removes that double payment: +### Same-scope performance -- same-scope edges are explicit and immediate -- boundary discovery stays on the TensorMap path -- manual `scope_end` is only a publish barrier, not a dependency replay pass +Manual-local tensors are exactly where TensorMap is unnecessary work: -## Zero-Overhead Target +- the producer is already known from the current manual scope +- the ordering can be expressed directly by `pto2_rt_add_dependency(...)` +- replaying those edges at `scope_end()` added serial overhead without adding + correctness -The zero-overhead target here means: +### Zero-overhead AUTO path -- no extra cost on `PTO2ScopeMode::AUTO` -- no extra TensorMap work for manual-local traffic -- no second dependency engine after publish +The manual-scope extension must not slow down the normal AUTO runtime. -What manual mode is allowed to cost: - -- explicit dependency calls that the example asked for -- one deferred publish barrier at `scope_end` -- boundary TensorMap work only when the task actually crosses scope boundaries +Fresh measurements below show the current AUTO runtime stays within roughly +`±1%` end-to-end of the unmodified baseline on the two paged-attention scenes, +which is the intended zero-overhead result. ## Example Requirements -Manual mode only helps when the example exposes a same-scope producer/consumer -chain that TensorMap would otherwise rediscover. +Manual mode only helps when the example exposes a real same-scope +producer/consumer chain that TensorMap would otherwise rediscover. For paged attention, the profitable chain is: @@ -228,129 +233,152 @@ remove much runtime work. Current branch benchmark entrypoints: ```bash -./tools/benchmark_rounds.sh -d 4 -n 5 -c 6622890 -r aicpu_build_graph -./tools/benchmark_rounds.sh -d 4 -n 5 -c 6622890 -r tensormap_and_ringbuffer -./tools/benchmark_rounds.sh -d 4 -n 5 -c 6622890 -r tensormap_and_ringbuffer_partial_manual +./tools/benchmark_rounds.sh -d 6 -n 5 -c 6622890 -r aicpu_build_graph --build +./tools/benchmark_rounds.sh -d 6 -n 5 -c 6622890 -r tensormap_and_ringbuffer --build +./tools/benchmark_rounds.sh -d 6 -n 5 -c 6622890 -r tensormap_and_ringbuffer_partial_manual --build ``` +`tensormap_and_ringbuffer_partial_manual` is a selector in +`tools/benchmark_rounds.sh`. The example `kernel_config.py` files still use +`RUNTIME_CONFIG["runtime"] = "tensormap_and_ringbuffer"`. The selector only +switches the scene directories to: + +- `tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual` +- `tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual` + The old unmodified runtime is intentionally not kept on this branch. To rerun it side-by-side: ```bash git worktree add tmp/worktree_unmodified a71ba16 -cd tmp/worktree_unmodified -PTO_ISA_ROOT=../../examples/scripts/_deps/pto-isa \ - ./tools/benchmark_rounds.sh -d 4 -n 5 -c 6622890 -r tensormap_and_ringbuffer_unmodified +( + cd tmp/worktree_unmodified + ./tools/benchmark_rounds.sh -d 6 -n 5 -c 6622890 \ + -r tensormap_and_ringbuffer_unmodified --build +) ``` -In this work, direct `run_example.py` reruns were more reliable than the -wrapper for collecting fresh device logs, especially for the old runtime and -for cases where the wrapper suppressed useful failure output. +For this document, a serial `run_example.py` pass was used instead of the +wrapper so every run used one uncontended process on one device. -## Fresh Performance Snapshot +Fresh result CSV: -Device and settings used for the rerun set: +- `tmp/bench_matrix_20260409_0006_direct/results.csv` -- device: `4` +## Fresh Hardware Results + +Fresh rerun settings: + +- date: `2026-04-09` +- platform: `a2a3` +- device: `6` - rounds: `5` -- ISA commit: `6622890` +- PTO-ISA commit: `6622890` + +Units below are `elapsed_us (orch_us)`. `aicpu_build_graph` does not emit the +same orch timing lines, so only elapsed time is shown there. + +### `paged_attention` + +| Case | `aicpu_build_graph` | `tensormap_and_ringbuffer_unmodified` | `tensormap_and_ringbuffer` | `tensormap_and_ringbuffer_partial_manual` | +| --- | ---: | ---: | ---: | ---: | +| `Case1` | `31037.8` | `36992.8 (36991.9)` | `36791.2 (36790.5)` | `31563.9 (31407.2)` | +| `Case2` | `16719.2` | `18753.6 (18752.8)` | `18615.9 (18615.1)` | `16757.6 (16343.9)` | + +### `paged_attention_unroll` -### End-to-end / scheduler-side comparison +| Case | `aicpu_build_graph` | `tensormap_and_ringbuffer_unmodified` | `tensormap_and_ringbuffer` | `tensormap_and_ringbuffer_partial_manual` | +| --- | ---: | ---: | ---: | ---: | +| `Case1` | `1421.2` | `1320.0 (853.6)` | `1322.5 (820.0)` | `1327.0 (835.5)` | +| `Case2` | `707.8` | `632.5 (383.5)` | `635.9 (391.8)` | `633.7 (365.5)` | -These numbers are the freshest rerun values used in this document. +## Feature / Optimization -> Gain -| Example | Case | `aicpu_build_graph` | old `tensormap*` | new `tensormap*` | `tensormap* + partial_manual` | -| --- | --- | ---: | ---: | ---: | ---: | -| `paged_attention` | `Case1` | `31312.6 us` | `37061.0 us` | `36585.4 us` | `31814.5 us` | -| `paged_attention` | `Case2` | `16474.4 us` | `18589.4 us` | `19348.8 us` | `16221.1 us` | -| `paged_attention_unroll` | `Case1` | `1426.8 us` | `1383.6 us`* | `1322.1 us` | `1327.9 us` | -| `paged_attention_unroll` | `Case2` | `728.5 us` | `668.6 us`* | `623.8 us` | `639.1 us` | +### 1. AUTO stays effectively zero-overhead -`*` The old unmodified unroll runtime logs use the older -`Scheduler summary: total_time=...` format instead of the newer full elapsed -markers. Those two rows therefore use scheduler-summary averages for the old -baseline. +The current AUTO runtime is flat versus the unmodified baseline: -### Orchestrator comparison +- `paged_attention/Case1`: `36791.2 us` vs `36992.8 us` (`-0.5%`) +- `paged_attention/Case2`: `18615.9 us` vs `18753.6 us` (`-0.7%`) +- `paged_attention_unroll/Case1`: `1322.5 us` vs `1320.0 us` (`+0.2%`) +- `paged_attention_unroll/Case2`: `635.9 us` vs `632.5 us` (`+0.5%`) -| Example | Case | old `tensormap*` | new `tensormap*` | `tensormap* + partial_manual` | -| --- | --- | ---: | ---: | ---: | -| `paged_attention` | `Case1` | `37060.3 us` | `36584.7 us` | `31657.9 us` | -| `paged_attention` | `Case2` | `18588.6 us` | `19348.2 us` | `15799.4 us` | -| `paged_attention_unroll` | `Case1` | `716.2 us`* | `826.3 us` | `826.0 us` | -| `paged_attention_unroll` | `Case2` | `336.7 us`* | `368.8 us` | `387.6 us` | +This is the zero-overhead result we needed on the normal tensormap path. -`*` Old unmodified unroll uses `orch_func_cost`, not the newer -`orch_end - orch_start` marker. It is still useful for side-by-side direction, -but it is not byte-for-byte the same logging mode. +### 2. Partial-manual removes the non-unroll gap -Old-baseline rerun logs used in this table: +Against the current AUTO runtime, partial-manual improves the non-unroll scene +substantially: -- `device-1533354_20260409000304311.log` -- `device-1546617_20260409000317313.log` -- `device-1536746_20260409000312312.log` -- `device-1568129_20260409000326314.log` +- `paged_attention/Case1` + - elapsed: `36791.2 us -> 31563.9 us` (`-14.2%`) + - orch: `36790.5 us -> 31407.2 us` (`-14.6%`) +- `paged_attention/Case2` + - elapsed: `18615.9 us -> 16757.6 us` (`-10.0%`) + - orch: `18615.1 us -> 16343.9 us` (`-12.2%`) -## What Helped and What Mattered +Against `aicpu_build_graph`, the remaining end-to-end gap on non-unroll is now +small: -### 1. Collapse manual `scope_end` into publish-only work +- `Case1`: `31563.9 us` vs `31037.8 us` (`+1.7%`) +- `Case2`: `16757.6 us` vs `16719.2 us` (`+0.2%`) -This is the main fix for non-unroll paged attention. +This is the target workload. Partial-manual is now effectively in the same +performance band as `aicpu_build_graph` there. -Current effect against the new automatic TensorMap runtime: +### 3. Unroll already amortizes most of the cost -- `paged_attention Case1`: `36584.7 us -> 31657.9 us` orch -- `paged_attention Case2`: `19348.2 us -> 15799.4 us` orch +On `paged_attention_unroll`, the AUTO tensormap path was already strong, so +partial-manual brings little extra value: -This is the difference between “manual mode is the worst case” and -“manual mode is in the same band as `aicpu_build_graph`”. +- `Case1`: `1322.5 us -> 1327.0 us` elapsed (`+0.3%`) +- `Case2`: `635.9 us -> 633.7 us` elapsed (`-0.3%`) -### 2. Skip TensorMap sync/lookup/insert for fully manual-local traffic +That is expected. The unroll example already amortizes dependency-construction +overhead, so partial-manual mainly matters for the non-unroll shape. -Manual mode now checks whether a submit actually touches a boundary tensor. -If not, it skips TensorMap sync entirely. +### 4. What specifically helped -This matters most when the example keeps intermediates local to the manual -scope. In the unrolled example, both automatic and partial-manual runtimes are -already in the sub-millisecond range, which shows that the remaining cost is no -longer dominated by boundary TensorMap work. +The important runtime-side wins were: -### 3. Keep boundary correctness on the normal path +- classify manual-local tensors from `owner_task_id` +- skip TensorMap work for those manual-local tensors +- wire explicit same-scope edges immediately in `pto2_rt_add_dependency(...)` +- keep `scope_end()` down to publish-barrier release plus `dep_pool_mark` + fixup -Boundary reads and writes still use: +The important example-side win was using manual scope only where the +non-unroll paged-attention orchestration still had repeated same-scope +dependency work to remove. -- creator retention -- TensorMap overlap lookup -- TensorMap frontier publish +## Current Risks -This does not make manual mode faster by itself. It is the correctness guard -that prevents stale external state and wrong cross-scope dependencies. +1. `manual_dep=true` can still be abused. + - It suppresses TensorMap lookup/insert for that tensor. + - It is only safe when ordering/frontier requirements are already covered by + other logic. -### 4. Example structure still dominates the ceiling +2. Nested scope inside manual scope is still unsupported. + - This is a current implementation restriction, not a theoretical property. -`paged_attention_unroll` already reduces submit pressure aggressively. Because -that example exposes less repeated same-scope dependency work, partial-manual -does not beat the best automatic/unmodified paths there. +3. `pto2_rt_add_dependency(...)` now spends dep-pool entries on the submit path. + - That is intentional, but it means dep-pool pressure moved from the old + replay path into explicit-edge wiring. -The no-unroll `paged_attention` case is where partial-manual matters most, and -that is also the target case where it now tracks `aicpu_build_graph`. +4. Manual publish still relies on `dep_pool_mark` prefix repair at `scope_end()`. + - This is required because explicit edges can touch older consumers after + newer tasks were already submitted. -## Clear Conclusions +## Recommendation Summary -1. Manual scope correctness is preserved by keeping cross-scope tensors on the - normal TensorMap path. -2. Manual scope performance only improves when the example exposes a real - same-scope chain that can stay manual-local. -3. The replay-heavy finalize model was the wrong design. The current - submit-time wiring plus publish-only `scope_end` is the right direction. -4. On non-unroll paged attention, `partial_manual` now matches the intended - performance band: close to `aicpu_build_graph`, clearly better than both old - and current automatic TensorMap runtimes. +Keep the design as: -## Remaining Risks +- AUTO mode by default +- explicit MANUAL mode through `PTO2ScopeMode` +- TensorMap kept only for cross-scope correctness +- explicit immediate wiring for same-scope manual edges +- `scope_end()` reduced to publish-barrier release and normal lifetime work -- manual scopes are still single-level only -- old unmodified unroll comparisons rely on older log markers -- explicit dependency misuse is still a fatal orchestration error by design -- the runtime still depends on the example to choose good manual-scope - boundaries; bad example structure can erase the benefit +That gives the required feature coverage while keeping the AUTO path +effectively zero-overhead and bringing non-unroll partial-manual paged +attention to within `~0-2%` of `aicpu_build_graph`. From a5332ef57589b3ec51f1a36e9fc85b211808ea08 Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Thu, 9 Apr 2026 00:29:33 +0800 Subject: [PATCH 32/35] Update: collapse manual scope_end scan Merge manual scope-end validation and dep-pool watermark repair into a\nsingle pass.\n\nThis keeps the manual publish path behavior unchanged while trimming one\nserial walk over the scope task list. --- .../tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp index adc610b44..89d0be054 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp @@ -560,6 +560,7 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { } if (orch->scheduler && count > 0) { + int32_t dep_pool_mark_prefix = 0; for (int32_t task_idx = 0; task_idx < count; task_idx++) { PTO2TaskSlotState *slot_state = orch->scope_tasks[begin + task_idx]; PTO2TaskPayload *payload = slot_state->payload; @@ -577,11 +578,6 @@ void pto2_scope_end(PTO2OrchestratorState *orch) { orch->fatal = true; return; } - } - - int32_t dep_pool_mark_prefix = 0; - for (int32_t task_idx = 0; task_idx < count; task_idx++) { - PTO2TaskSlotState *slot_state = orch->scope_tasks[begin + task_idx]; // add_dependency may allocate dep entries for an older consumer after // newer tasks were already submitted. Recompute a monotonic dep-pool // watermark at publish time so tail reclamation still advances safely. From 38bb94268bde48d69499edbaf6ccfe2c8e89494b Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Thu, 9 Apr 2026 12:38:40 +0800 Subject: [PATCH 33/35] Support: ignore local worktrees --- .gitignore | 1 + 1 file changed, 1 insertion(+) diff --git a/.gitignore b/.gitignore index 37d5e142b..0b8113d9b 100644 --- a/.gitignore +++ b/.gitignore @@ -21,6 +21,7 @@ venv/ .claude/settings.local.json .claude/worktrees .claude/plans +.worktrees/ # Git cloned dependencies (not tracked in repo) examples/scripts/_deps/ From 01d072348b6425662363b6f365e14ae7fd3469af Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Fri, 10 Apr 2026 01:10:23 +0800 Subject: [PATCH 34/35] Fix: restore rebased manual benchmark paths - repair the rebased tensormap submit prologue and task-id write-back\n- restore partial-manual hub kernel sources under the example trees\n- repoint the partial-manual configs so hardware benchmarks build again --- .../runtime/pto_orchestrator.cpp | 22 ++++++++++++++++++- .../kernels/aic/aic_hub.cpp | 14 ++++++++++++ .../kernels/aiv/aiv_hub.cpp | 14 ++++++++++++ .../kernels/kernel_config.py | 4 ++-- .../kernels/aic/aic_hub.cpp | 14 ++++++++++++ .../kernels/aiv/aiv_hub.cpp | 14 ++++++++++++ .../kernels/kernel_config.py | 4 ++-- 7 files changed, 81 insertions(+), 5 deletions(-) create mode 100644 tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/aic/aic_hub.cpp create mode 100644 tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/aiv/aiv_hub.cpp create mode 100644 tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/kernels/aic/aic_hub.cpp create mode 100644 tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/kernels/aiv/aiv_hub.cpp diff --git a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp index 89d0be054..0935f40ea 100644 --- a/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp +++ b/src/a2a3/runtime/tensormap_and_ringbuffer/runtime/pto_orchestrator.cpp @@ -195,7 +195,7 @@ static bool pto2_append_fanin_or_fail( return true; } -static void scope_tasks_push(PTO2OrchestratorState *orch, PTO2TaskSlotState *task_slot_state); +static bool scope_tasks_push(PTO2OrchestratorState *orch, PTO2TaskSlotState *task_slot_state); struct PTO2OutputLayout { uint64_t offsets[MAX_TENSOR_ARGS] = {}; @@ -632,6 +632,23 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( orch->fatal = true; return result; } + + // === Validate submit inputs === + uint8_t active_mask = pto2_mixed_kernels_to_active_mask(mixed_kernels); + always_assert(active_mask != 0 && "MixedKernels must have at least one active slot"); + + int16_t block_num = args.launch_spec.block_num(); + always_assert(block_num >= 1 && "block_num must be >= 1"); + + // Normalize single-AIV tasks: if only aiv1 is set (no aic, no aiv0), move + // it to the aiv0 slot. This guarantees the dispatch path can always use + // PTO2SubtaskSlot::AIV0 for single-AIV shapes without inspecting active_mask. + // Mixed tasks (AIC+AIV) keep their original AIV identity so the correct + // hardware channel (AIV0→AIC vs AIV1→AIC) is used at dispatch time. + MixedKernels normalized = mixed_kernels; + bool has_aic = (active_mask & PTO2_SUBTASK_MASK_AIC) != 0; + bool has_aiv0 = (active_mask & PTO2_SUBTASK_MASK_AIV0) != 0; + bool has_aiv1 = (active_mask & PTO2_SUBTASK_MASK_AIV1) != 0; if (!has_aic && has_aiv1 && !has_aiv0) { normalized.aiv0_kernel_id = normalized.aiv1_kernel_id; normalized.aiv1_kernel_id = INVALID_KERNEL_ID; @@ -969,6 +986,9 @@ static TaskOutputTensors pto2_submit_mixed_task_impl( #endif g_orch_submit_idx++; #endif + if (submitted_task_id != nullptr) { + *submitted_task_id = task_id; + } return result; } diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/aic/aic_hub.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/aic/aic_hub.cpp new file mode 100644 index 000000000..0b3062f18 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/aic/aic_hub.cpp @@ -0,0 +1,14 @@ +#include +#include + +using namespace pto; + +#ifndef __gm__ +#define __gm__ +#endif + +#ifndef __aicore__ +#define __aicore__ [aicore] +#endif + +extern "C" __aicore__ void kernel_entry(__gm__ int64_t *args) { (void)args; } diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/aiv/aiv_hub.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/aiv/aiv_hub.cpp new file mode 100644 index 000000000..0b3062f18 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/aiv/aiv_hub.cpp @@ -0,0 +1,14 @@ +#include +#include + +using namespace pto; + +#ifndef __gm__ +#define __gm__ +#endif + +#ifndef __aicore__ +#define __aicore__ [aicore] +#endif + +extern "C" __aicore__ void kernel_entry(__gm__ int64_t *args) { (void)args; } diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/kernel_config.py b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/kernel_config.py index 92bba047b..80765faab 100644 --- a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/kernel_config.py +++ b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_partial_manual/kernels/kernel_config.py @@ -37,7 +37,7 @@ { "func_id": 4, "name": "AIC_HUB", - "source": str(_PA_KERNELS / "aic" / "aic_hub.cpp"), + "source": str(_ROOT / "aic" / "aic_hub.cpp"), "core_type": "aic", "signature": [], }, @@ -58,7 +58,7 @@ { "func_id": 5, "name": "AIV_HUB", - "source": str(_PA_KERNELS / "aiv" / "aiv_hub.cpp"), + "source": str(_ROOT / "aiv" / "aiv_hub.cpp"), "core_type": "aiv", "signature": [], }, diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/kernels/aic/aic_hub.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/kernels/aic/aic_hub.cpp new file mode 100644 index 000000000..0b3062f18 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/kernels/aic/aic_hub.cpp @@ -0,0 +1,14 @@ +#include +#include + +using namespace pto; + +#ifndef __gm__ +#define __gm__ +#endif + +#ifndef __aicore__ +#define __aicore__ [aicore] +#endif + +extern "C" __aicore__ void kernel_entry(__gm__ int64_t *args) { (void)args; } diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/kernels/aiv/aiv_hub.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/kernels/aiv/aiv_hub.cpp new file mode 100644 index 000000000..0b3062f18 --- /dev/null +++ b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/kernels/aiv/aiv_hub.cpp @@ -0,0 +1,14 @@ +#include +#include + +using namespace pto; + +#ifndef __gm__ +#define __gm__ +#endif + +#ifndef __aicore__ +#define __aicore__ [aicore] +#endif + +extern "C" __aicore__ void kernel_entry(__gm__ int64_t *args) { (void)args; } diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/kernels/kernel_config.py b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/kernels/kernel_config.py index e3a0d13c3..52b624312 100644 --- a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/kernels/kernel_config.py +++ b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/kernels/kernel_config.py @@ -37,7 +37,7 @@ { "func_id": 4, "name": "AIC_HUB", - "source": str(_PA_KERNELS / "aic" / "aic_hub.cpp"), + "source": str(_ROOT / "aic" / "aic_hub.cpp"), "core_type": "aic", "signature": [], }, @@ -58,7 +58,7 @@ { "func_id": 5, "name": "AIV_HUB", - "source": str(_PA_KERNELS / "aiv" / "aiv_hub.cpp"), + "source": str(_ROOT / "aiv" / "aiv_hub.cpp"), "core_type": "aiv", "signature": [], }, From e5fa1bc097dc0788c7334e9b353071aa39a26bbc Mon Sep 17 00:00:00 2001 From: Youwei Xiao Date: Fri, 10 Apr 2026 10:24:58 +0800 Subject: [PATCH 35/35] Fix: align rebased unroll partial-manual ABI - update the rebased partial-manual unroll orchestration to match the\n current qk/pv kernel ABI by passing block_table as a tensor plus\n bt_offset as a scalar\n- rerun the device validation for both unroll cases after the fix\n- refresh the design doc with the rebase root cause and the new 4-way\n benchmark results on device 4 with PTO-ISA d96c8784 --- docs/manual-dep-for-tensormap-design.md | 100 +++++++++++------- .../orchestration/paged_attention_orch.cpp | 18 ++-- 2 files changed, 73 insertions(+), 45 deletions(-) diff --git a/docs/manual-dep-for-tensormap-design.md b/docs/manual-dep-for-tensormap-design.md index 8e815ab43..3853f67f0 100644 --- a/docs/manual-dep-for-tensormap-design.md +++ b/docs/manual-dep-for-tensormap-design.md @@ -233,9 +233,9 @@ remove much runtime work. Current branch benchmark entrypoints: ```bash -./tools/benchmark_rounds.sh -d 6 -n 5 -c 6622890 -r aicpu_build_graph --build -./tools/benchmark_rounds.sh -d 6 -n 5 -c 6622890 -r tensormap_and_ringbuffer --build -./tools/benchmark_rounds.sh -d 6 -n 5 -c 6622890 -r tensormap_and_ringbuffer_partial_manual --build +./tools/benchmark_rounds.sh -d 4 -n 5 -c d96c8784 -r aicpu_build_graph --build +./tools/benchmark_rounds.sh -d 4 -n 5 -c d96c8784 -r tensormap_and_ringbuffer --build +./tools/benchmark_rounds.sh -d 4 -n 5 -c d96c8784 -r tensormap_and_ringbuffer_partial_manual --build ``` `tensormap_and_ringbuffer_partial_manual` is a selector in @@ -250,30 +250,47 @@ The old unmodified runtime is intentionally not kept on this branch. To rerun it side-by-side: ```bash +export PROJECT_ROOT=$(pwd) git worktree add tmp/worktree_unmodified a71ba16 ( cd tmp/worktree_unmodified - ./tools/benchmark_rounds.sh -d 6 -n 5 -c 6622890 \ + python3 -m venv .venv --system-site-packages + . .venv/bin/activate + pip install -e . -q + export PTO_ISA_ROOT="$PROJECT_ROOT/examples/scripts/_deps/pto-isa" + ./tools/benchmark_rounds.sh -d 4 -n 5 -c d96c8784 \ -r tensormap_and_ringbuffer_unmodified --build ) ``` -For this document, a serial `run_example.py` pass was used instead of the -wrapper so every run used one uncontended process on one device. +Fresh benchmark logs for the rebased branch are in: -Fresh result CSV: +- `tmp/rebased_bench_20260410_fix/aicpu_build_graph.log` +- `tmp/rebased_bench_20260410_fix/tensormap_and_ringbuffer.log` +- `tmp/rebased_bench_20260410_fix/tensormap_and_ringbuffer_partial_manual.log` +- `tmp/rebased_bench_20260410_fix/tensormap_and_ringbuffer_unmodified.log` -- `tmp/bench_matrix_20260409_0006_direct/results.csv` +Rebase note: + +- `paged_attention_unroll_partial_manual` was initially timing out after the + merge-forward. +- The runtime manual-scope machinery was not the root cause. +- The direct cause was stale example-side AIC submit ABI: the rebased + `paged_attention_unroll` AIC kernels now expect `block_table` as a tensor + input plus a scalar `bt_offset`, while the partial-manual scene was still + passing a raw pointer scalar. +- Fixing the partial-manual `qk/pv` submit argument layout restored both + unroll cases on device. ## Fresh Hardware Results Fresh rerun settings: -- date: `2026-04-09` +- date: `2026-04-10` - platform: `a2a3` -- device: `6` +- device: `4` - rounds: `5` -- PTO-ISA commit: `6622890` +- PTO-ISA commit: `d96c8784` Units below are `elapsed_us (orch_us)`. `aicpu_build_graph` does not emit the same orch timing lines, so only elapsed time is shown there. @@ -282,28 +299,30 @@ same orch timing lines, so only elapsed time is shown there. | Case | `aicpu_build_graph` | `tensormap_and_ringbuffer_unmodified` | `tensormap_and_ringbuffer` | `tensormap_and_ringbuffer_partial_manual` | | --- | ---: | ---: | ---: | ---: | -| `Case1` | `31037.8` | `36992.8 (36991.9)` | `36791.2 (36790.5)` | `31563.9 (31407.2)` | -| `Case2` | `16719.2` | `18753.6 (18752.8)` | `18615.9 (18615.1)` | `16757.6 (16343.9)` | +| `Case1` | `29937.7` | `36095.9 (36094.9)` | `39148.7 (39148.3)` | `34186.3 (34025.7)` | +| `Case2` | `16762.7` | `18639.5 (18635.1)` | `19813.0 (19812.7)` | `18028.7 (17618.4)` | ### `paged_attention_unroll` | Case | `aicpu_build_graph` | `tensormap_and_ringbuffer_unmodified` | `tensormap_and_ringbuffer` | `tensormap_and_ringbuffer_partial_manual` | | --- | ---: | ---: | ---: | ---: | -| `Case1` | `1421.2` | `1320.0 (853.6)` | `1322.5 (820.0)` | `1327.0 (835.5)` | -| `Case2` | `707.8` | `632.5 (383.5)` | `635.9 (391.8)` | `633.7 (365.5)` | +| `Case1` | `1425.3` | `1325.6 (835.3)` | `1173.2 (992.0)` | `1160.4 (968.8)` | +| `Case2` | `693.0` | `628.7 (380.7)` | `567.9 (435.6)` | `561.9 (416.6)` | ## Feature / Optimization -> Gain ### 1. AUTO stays effectively zero-overhead -The current AUTO runtime is flat versus the unmodified baseline: +The current AUTO runtime no longer meets the zero-overhead target on the +non-unroll scene, but it still wins clearly on the unroll scene: -- `paged_attention/Case1`: `36791.2 us` vs `36992.8 us` (`-0.5%`) -- `paged_attention/Case2`: `18615.9 us` vs `18753.6 us` (`-0.7%`) -- `paged_attention_unroll/Case1`: `1322.5 us` vs `1320.0 us` (`+0.2%`) -- `paged_attention_unroll/Case2`: `635.9 us` vs `632.5 us` (`+0.5%`) +- `paged_attention/Case1`: `39148.7 us` vs `36095.9 us` (`+8.5%`) +- `paged_attention/Case2`: `19813.0 us` vs `18639.5 us` (`+6.3%`) +- `paged_attention_unroll/Case1`: `1173.2 us` vs `1325.6 us` (`-11.5%`) +- `paged_attention_unroll/Case2`: `567.9 us` vs `628.7 us` (`-9.7%`) -This is the zero-overhead result we needed on the normal tensormap path. +So the AUTO path is still good for the already-amortized unroll workload, but +not yet zero-overhead for the non-unroll paged-attention target. ### 2. Partial-manual removes the non-unroll gap @@ -311,31 +330,33 @@ Against the current AUTO runtime, partial-manual improves the non-unroll scene substantially: - `paged_attention/Case1` - - elapsed: `36791.2 us -> 31563.9 us` (`-14.2%`) - - orch: `36790.5 us -> 31407.2 us` (`-14.6%`) + - elapsed: `39148.7 us -> 34186.3 us` (`-12.7%`) + - orch: `39148.3 us -> 34025.7 us` (`-13.1%`) - `paged_attention/Case2` - - elapsed: `18615.9 us -> 16757.6 us` (`-10.0%`) - - orch: `18615.1 us -> 16343.9 us` (`-12.2%`) + - elapsed: `19813.0 us -> 18028.7 us` (`-9.0%`) + - orch: `19812.7 us -> 17618.4 us` (`-11.1%`) + +Against `aicpu_build_graph`, there is still a visible non-unroll gap: -Against `aicpu_build_graph`, the remaining end-to-end gap on non-unroll is now -small: +- `Case1`: `34186.3 us` vs `29937.7 us` (`+14.2%`) +- `Case2`: `18028.7 us` vs `16762.7 us` (`+7.6%`) -- `Case1`: `31563.9 us` vs `31037.8 us` (`+1.7%`) -- `Case2`: `16757.6 us` vs `16719.2 us` (`+0.2%`) +Against the unmodified tensormap baseline, partial-manual is now ahead on the +non-unroll scene: -This is the target workload. Partial-manual is now effectively in the same -performance band as `aicpu_build_graph` there. +- `Case1`: `36095.9 us -> 34186.3 us` (`-5.3%`) +- `Case2`: `18639.5 us -> 18028.7 us` (`-3.3%`) ### 3. Unroll already amortizes most of the cost -On `paged_attention_unroll`, the AUTO tensormap path was already strong, so -partial-manual brings little extra value: +On `paged_attention_unroll`, both current runtimes are already better than +`aicpu_build_graph`, and partial-manual only nudges the AUTO path slightly: -- `Case1`: `1322.5 us -> 1327.0 us` elapsed (`+0.3%`) -- `Case2`: `635.9 us -> 633.7 us` elapsed (`-0.3%`) +- `Case1`: `1173.2 us -> 1160.4 us` elapsed (`-1.1%`) +- `Case2`: `567.9 us -> 561.9 us` elapsed (`-1.1%`) -That is expected. The unroll example already amortizes dependency-construction -overhead, so partial-manual mainly matters for the non-unroll shape. +That is the expected shape. The unroll orchestration already amortizes most +dependency overhead, so partial-manual has little room left to improve. ### 4. What specifically helped @@ -380,5 +401,6 @@ Keep the design as: - `scope_end()` reduced to publish-barrier release and normal lifetime work That gives the required feature coverage while keeping the AUTO path -effectively zero-overhead and bringing non-unroll partial-manual paged -attention to within `~0-2%` of `aicpu_build_graph`. +competitive on unroll and materially reducing the non-unroll gap, but the +fresh rerun still shows more work is needed to make partial-manual match +`aicpu_build_graph` on non-unroll paged attention. diff --git a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/kernels/orchestration/paged_attention_orch.cpp b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/kernels/orchestration/paged_attention_orch.cpp index 843c2daf6..9cc65b575 100644 --- a/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/kernels/orchestration/paged_attention_orch.cpp +++ b/tests/st/a2a3/tensormap_and_ringbuffer/paged_attention_unroll_partial_manual/kernels/orchestration/paged_attention_orch.cpp @@ -70,8 +70,12 @@ aicpu_orchestration_entry(const ChipStorageTaskArgs &orch_args, int orch_thread_ Tensor value_cache = make_tensor_external(vc_ptr, value_cache_shapes, 2, data_type, false); Tensor out = make_tensor_external(out_ptr, out_shapes, 2, DataType::FLOAT32); - int *host_block_table = orch_args.tensor(3).data_as(); - int *host_context_lens = orch_args.tensor(4).data_as(); + uint32_t bt_shapes[2] = {static_cast(batch), static_cast(block_num)}; + Tensor block_table = + make_tensor_external(orch_args.tensor(3).data_as(), bt_shapes, 2, DataType::INT32, false); + uint32_t cl_shapes[1] = {static_cast(batch)}; + Tensor context_lens = + make_tensor_external(orch_args.tensor(4).data_as(), cl_shapes, 1, DataType::INT32, false); uint32_t oi_shapes[2] = {static_cast(q_tile), static_cast(head_dim)}; uint32_t li_shapes[1] = {static_cast(q_tile)}; @@ -80,9 +84,9 @@ aicpu_orchestration_entry(const ChipStorageTaskArgs &orch_args, int orch_thread_ TensorCreateInfo scalar_ci(li_shapes, 1, DataType::FLOAT32); for (uint64_t b_idx = 0; b_idx < batch; b_idx++) { - uint64_t cur_seq = host_context_lens[b_idx]; + uint32_t cl_idx[1] = {static_cast(b_idx)}; + uint64_t cur_seq = static_cast(get_tensor_data(context_lens, 1, cl_idx)); uint64_t bn_this_batch = (cur_seq + block_size - 1) / block_size; - int *bt_base = host_block_table + b_idx * block_num; for (uint64_t q_idx = 0; q_idx < q_loop; q_idx++) { PTO2_SCOPE() { @@ -123,9 +127,10 @@ aicpu_orchestration_entry(const ChipStorageTaskArgs &orch_args, int orch_thread_ params_qk.reset(); params_qk.add_input(qi); params_qk.add_input(key_cache); + params_qk.add_input(block_table); params_qk.add_output(sij_buf_ci); params_qk.add_scalar(n_blocks); - params_qk.add_scalar(reinterpret_cast(bt_base + bn)); + params_qk.add_scalar(b_idx * block_num + bn); PTO2ManualSubmitResult qk_outs = pto2_rt_submit_aic_task_manual(FUNC_QK_MATMUL, params_qk); uint32_t pij_buf_shapes[2] = { @@ -147,9 +152,10 @@ aicpu_orchestration_entry(const ChipStorageTaskArgs &orch_args, int orch_thread_ params_pv.reset(); params_pv.add_input(sf_outs.outputs.get_ref(0)); params_pv.add_input(value_cache); + params_pv.add_input(block_table); params_pv.add_output(tile2d_ci); params_pv.add_scalar(n_blocks); - params_pv.add_scalar(reinterpret_cast(bt_base + bn)); + params_pv.add_scalar(b_idx * block_num + bn); PTO2ManualSubmitResult pv_outs = pto2_rt_submit_aic_task_manual(FUNC_PV_MATMUL, params_pv); uint64_t is_first = (bn == 0) ? 1 : 0;