Skip to content

docs: add async engine dev note#490

Merged
andreatgretel merged 14 commits intomainfrom
andreatgretel/docs/async-blog
Apr 8, 2026
Merged

docs: add async engine dev note#490
andreatgretel merged 14 commits intomainfrom
andreatgretel/docs/async-blog

Conversation

@andreatgretel
Copy link
Copy Markdown
Contributor

📋 Summary

Add "Async All the Way Down" dev note covering the async task-queue scheduler and its impact on Data Designer pipeline performance. Covers the full async engine arc (PRs #356, #378, #404, #429, #456) in a single narrative post with benchmark results and original diagrams.

🔄 Changes

✨ Added

  • docs/devnotes/posts/async-engine.md - dev note post (~1600 words, slop-guard 93/100)
  • docs/devnotes/posts/assets/async-engine/ - 6 figures (NVIDIA-styled, dark background + green accent):
    • AI-generated hero image
    • Sync vs async Gantt timeline (values derived from real trace data)
    • DAG shape illustrations (4 benchmark workloads)
    • Grouped bar chart (sync vs async wall clock times)
    • Speedup scaling chart
    • Architecture layers SVG diagram

🔧 Changed

🔍 Attention Areas

⚠️ Reviewers: Please pay special attention to the following:

  • async-engine.md - technical claims were cross-checked against implementation code (Kahn's algorithm, AIMD, symmetric bridging, semaphores, etc.) and benchmark scripts (DAG shapes, column dependencies). The "At higher record counts" section discusses rate-limiting tradeoffs qualitatively.
  • Benchmark data is from 10-record runs. Supporting 20-record and 50-record data exist in tmp_blog_content/ (not committed) for reference.

🤖 Generated with AI

- Fix wall-clock claim: 41% -> 22% to match benchmark table
- Fix dual-model speedup rounding: 1.7x -> 1.6x (10.0/6.1 = 1.64)
- Fix run_config API: use dd.set_run_config() instead of passing to create()
Add "Async All the Way Down" dev note covering the async task-queue
scheduler built across PRs #356, #378, #404, #429, #456. Includes
benchmark results, architecture diagrams, and DAG shape illustrations.
Build MkDocs site on PRs that touch docs and deploy to Cloudflare
Pages. Each PR gets a browseable preview URL posted as a comment.
Notebook tutorials use placeholder stubs since they require API
keys to execute.

Requires CLOUDFLARE_API_TOKEN and CLOUDFLARE_ACCOUNT_ID repo secrets.
@andreatgretel andreatgretel requested a review from a team as a code owner April 2, 2026 15:48
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 2, 2026

Greptile Summary

This PR adds the "Async All the Way Down" dev note documenting the async task-queue scheduler and its performance impact, along with companion assets, an author entry, and an updated nav entry in mkdocs.yml. Previously identified issues (benchmark percentage mismatch, speedup rounding, and run_config API misuse) were resolved in commit 182819f. Technical claims cross-check cleanly: RunConfig.async_trace and RunConfig.progress_bar exist, both DATA_DESIGNER_ASYNC_ENGINE and DATA_DESIGNER_ASYNC_TRACE are live env vars, and the set_run_config() call in the "Try It" snippet matches the actual API.

Confidence Score: 5/5

Documentation-only PR; all previously flagged issues resolved, technical claims verified — safe to merge.

All P0/P1 findings from the prior review round (benchmark percentage, speedup rounding, run_config API) were fixed in 182819f. The remaining content was cross-checked: RunConfig.async_trace and progress_bar exist, DATA_DESIGNER_ASYNC_ENGINE and DATA_DESIGNER_ASYNC_TRACE are wired in the engine, set_run_config() usage is correct, and cross-links between articles resolve. No new issues found.

No files require special attention.

Vulnerabilities

No security concerns identified. This PR is documentation-only with no code or configuration changes that affect runtime behavior.

Important Files Changed

Filename Overview
docs/devnotes/posts/async-all-the-way-down.md New dev note covering the async engine; technical claims verified against the codebase — API usage, env vars, and benchmark numbers are all accurate.
docs/devnotes/posts/owning-the-model-stack.md Cross-reference link to the new async-all-the-way-down.md post added; no other content changes.
docs/devnotes/.authors.yml New amanoel author entry added, matching the author slug used in the new post's front matter.
mkdocs.yml New "Async All the Way Down" nav entry inserted at the top of the Dev Notes section (most-recent-first ordering).
.github/workflows/docs-preview.yml Docs preview CI workflow; no changes to build logic visible in this PR diff.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Cell enters Frontier\nwhen upstream deps satisfied] --> B[AsyncTaskScheduler\nacquires submission semaphore slot]
    B --> C{LLM-bound\ntask?}
    C -- Yes --> D[Acquire LLM-wait semaphore\nRelease submission semaphore]
    C -- No --> E[Hold submission slot\nfor full duration]
    D --> F[Generator makes\nLLM request via ThrottledModelClient]
    E --> G[Generator runs\nCPU/non-LLM work]
    F --> H{Provider\nresponse?}
    G --> I[Release submission slot\nMark cell complete]
    H -- 429 --> J[AIMD: cut concurrency\nDefer task to frontier]
    H -- Success --> K[Release LLM-wait slot\nMark cell complete]
    J --> A
    K --> L[CompletionTracker: unlock\ndownstream cells]
    I --> L
    L --> A
    K --> M[Row group complete?\nFlush to Parquet]
Loading

Reviews (14): Last reviewed commit: "docs: address review feedback on async b..." | Re-trigger Greptile

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

Docs preview: https://a54387d5.dd-docs-preview.pages.dev

Notebook tutorials are placeholder-only in previews.

@andreatgretel andreatgretel force-pushed the andreatgretel/docs/async-blog branch from 7055573 to e434aad Compare April 2, 2026 18:01
andreatgretel and others added 3 commits April 2, 2026 15:14
Add DAG subtitle to sync-vs-async timeline figure and bridge the
surrounding text to explain which workload shape is being shown.
nabinchha and others added 5 commits April 2, 2026 18:39
Regenerate scale-model-timeline and scale-boxplot from nginx access
logs (column_progress.csv, sync/summary.json) instead of buffered
execution logs. Optimize both PNGs to palette mode. Adjust figure
widths and update model timeline commentary.

# **Async All the Way Down**

Every Data Designer pipeline carries a map of what can run in parallel. Consider a pipeline that generates a `topic`, writes a `summary` and a `trivia` fact from that topic, then produces an `analysis` of the summary. `summary` and `trivia` both depend on `topic`, so they could run alongside each other. `analysis` depends on `summary`, so it has to wait — but only on the same row's summary, not the entire column. These references form a per-cell dependency graph. The previous engine used that graph to order columns, but it ran each column to completion before starting the next. A row's `analysis` couldn't start until *every* row of `summary` had finished, even though it only needed its own.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Every Data Designer pipeline carries a map of what can run in parallel. Consider a pipeline that generates a `topic`, writes a `summary` and a `trivia` fact from that topic, then produces an `analysis` of the summary. `summary` and `trivia` both depend on `topic`, so they could run alongside each other. `analysis` depends on `summary`, so it has to wait — but only on the same row's summary, not the entire column. These references form a per-cell dependency graph. The previous engine used that graph to order columns, but it ran each column to completion before starting the next. A row's `analysis` couldn't start until *every* row of `summary` had finished, even though it only needed its own.
Every Data Designer pipeline carries a map of what can run in parallel. Consider a pipeline that generates a `topic`, writes a `summary` and a `trivia` fact from that topic, then produces an `analysis` of the summary. `summary` and `trivia` both depend on `topic`, so they could run alongside each other. `analysis` depends on `summary`, so it has to wait — but only on the same row's summary, not the entire column. These references form a per-cell dependency graph. Data Designer’s original workflow engine used that graph to order columns, but it ran each column to completion before starting the next. A row's `analysis` couldn't start until *every* row of `summary` had finished, even though it only needed its own.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, adopted your wording with a small tweak - added "within each batch" to clarify the sync engine already split into batches, it just ran columns sequentially within each one.


The scheduler maintains a *frontier* — the set of tasks whose inputs are all satisfied. Dispatch is a loop: pull ready tasks from the frontier, acquire a [semaphore](https://en.wikipedia.org/wiki/Semaphore_(programming)) slot, spawn a worker. When the worker completes, mark the cell done, which may add new tasks to the frontier. The loop runs until every cell in every row group has completed or been dropped.

Two details matter here. Multi-column generators (where one generator produces several output columns) are deduplicated so they run once. And stateful generators like seed dataset readers get per-instance `asyncio.Lock`s to preserve row-group ordering, since the order rows are read from a seed dataset matters.
Copy link
Copy Markdown
Contributor

@johnnygreco johnnygreco Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i feel like this bit about multi-column generators and seed readers might be TMI. I get that we want to give technical details here, but the goal is for users to not need to worry about these deep implementation details. It's also a bit confusing because the reader would need to understand how generators relate to columns and why we have multi-column generators to begin with.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, removed the paragraph. The two-semaphore discussion below is the interesting detail worth keeping.

- Tighten intro to a concise abstract, move pipeline narrative into
  "The Bottleneck Was Structural" section
- Remove multi-column generators / seed readers paragraph (TMI)
- Clarify sync engine ran columns sequentially within each batch
@andreatgretel andreatgretel merged commit 0e90ea6 into main Apr 8, 2026
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants