Skip to content

docs: update dedup docs for WorkflowRunResult (PR #1275)#1654

Merged
lbliii merged 5 commits intoNVIDIA-NeMo:26.04-stagingfrom
lbliii:lbliii/pr1275-doc-updates
Apr 2, 2026
Merged

docs: update dedup docs for WorkflowRunResult (PR #1275)#1654
lbliii merged 5 commits intoNVIDIA-NeMo:26.04-stagingfrom
lbliii:lbliii/pr1275-doc-updates

Conversation

@lbliii
Copy link
Copy Markdown
Contributor

@lbliii lbliii commented Mar 24, 2026

Description

Updates v26.04 fern documentation to reflect the new WorkflowRunResult return type introduced in #1275. All deduplication workflow run() code examples now capture the return value and show available metadata keys. Adds API reference documentation for WorkflowRunResult and WorkflowBase to the pipeline reference page. Replaces the 26.02 release notes with a 26.04 skeleton including the Workflow Results API entry and breaking changes.

Usage

from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow

exact_workflow = ExactDeduplicationWorkflow(...)
result = exact_workflow.run()
print(result.metadata)  # {"total_time": 42.1, "num_duplicates": 1500, ...}

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Run the following to install Fern:

npm install -g fern-api

Run the following to generate a local preview:

fern docs dev

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 24, 2026

Greptile Summary

This PR updates the v26.04 Fern documentation to reflect the new WorkflowRunResult return type introduced in PR #1275, capturing result in all deduplication workflow run() call examples and adding inline metadata key comments. The release notes and breaking-change entries are well-written, and the previous review concerns about WorkflowBase inheritance and missing metadata keys in semdedup.mdx have been addressed.

One minor inconsistency remains: pipeline.mdx was only reformatted (plain link → list item) but the PR description states that WorkflowRunResult and WorkflowBase API reference docs were added to that page — those additions are absent from the diff.

Confidence Score: 5/5

Safe to merge; all findings are P2 documentation improvements with no impact on runtime behaviour.

All prior P1 concerns (WorkflowBase inheritance claim, missing id_generator_path, incomplete semdedup metadata comments) have been resolved. The one remaining inline comment (incomplete TextSemanticDeduplicationWorkflow metadata list in index.mdx) is a minor P2 inconsistency between the overview page and the more detailed semdedup.mdx page.

fern/versions/v26.04/pages/curate-text/process-data/deduplication/index.mdx — the TextSemanticDeduplicationWorkflow metadata comment at line 118 is incomplete relative to semdedup.mdx and the source code.

Important Files Changed

Filename Overview
fern/versions/v26.04/pages/curate-text/process-data/deduplication/index.mdx Adds WorkflowRunResult capture and metadata comments to all workflow examples; the TextSemanticDeduplicationWorkflow comment is still incomplete compared to semdedup.mdx and the source code.
fern/versions/v26.04/pages/curate-text/process-data/deduplication/exact.mdx Captures WorkflowRunResult for all run() calls and adds accurate metadata comments including the previously-missing id_generator_path.
fern/versions/v26.04/pages/curate-text/process-data/deduplication/fuzzy.mdx Captures WorkflowRunResult for all run() calls and adds accurate metadata comments (all keys verified against fuzzy/workflow.py).
fern/versions/v26.04/pages/curate-text/process-data/deduplication/semdedup.mdx Updates TextSemanticDeduplicationWorkflow and SemanticDeduplicationWorkflow result capture; all seven metadata keys verified against semantic.py and semantic/workflow.py.
fern/versions/v26.04/pages/about/release-notes/index.mdx Adds Workflow Results API section and breaking-change entries; correctly qualifies that TextSemanticDeduplicationWorkflow implements the interface without formal WorkflowBase inheritance.
fern/versions/v26.04/pages/api-reference/pipeline.mdx Source Code section reformatted to a list item; no WorkflowRunResult or WorkflowBase API reference added despite the PR description claiming this.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["workflow.run()"] --> B["WorkflowRunResult"]
    B --> C["result.metadata dict"]
    B --> D["result.pipeline_tasks dict"]

    C --> E1["ExactDeduplicationWorkflow\ntotal_time, num_duplicates,\nidentification_time, id_generator_path"]
    C --> E2["FuzzyDeduplicationWorkflow\ntotal_time, num_duplicates, minhash_time,\nlsh_time, connected_components_pipeline_time,\nid_generator_path"]
    C --> E3["TextSemanticDeduplicationWorkflow\ntotal_time, num_duplicates, num_duplicates_removed,\nembedding_time, identification_time,\nremoval_time, final_output_path"]
    C --> E4["SemanticDeduplicationWorkflow\ntotal_time, num_duplicates,\nkmeans_time, pairwise_time"]
    C --> E5["TextDuplicatesRemovalWorkflow\ntotal_time, num_duplicates_removed"]
Loading

Reviews (3): Last reviewed commit: "docs: remove internal WorkflowRunResult/..." | Re-trigger Greptile

Comment thread fern/versions/v26.04/pages/api-reference/pipeline.mdx Outdated
Comment thread fern/versions/v26.04/pages/about/release-notes/index.mdx Outdated
Comment thread fern/versions/v26.04/pages/curate-text/process-data/deduplication/index.mdx Outdated
Comment thread fern/versions/v26.04/pages/curate-text/process-data/deduplication/index.mdx Outdated
Comment thread fern/versions/v26.04/pages/curate-text/process-data/deduplication/semdedup.mdx Outdated
Comment thread fern/versions/v26.04/pages/curate-text/process-data/deduplication/semdedup.mdx Outdated

---

## WorkflowRunResult
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this change? @ayushdg / @sarahyurick / @VibhuJawa seems too internal

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah maybe it can be added after the WorkflowBase section and just be:

## WorkflowRunResult

A dataclass returned by all deduplication workflow run() methods. Contains workflow outputs, pipeline task mappings, and metadata such as timing and duplicate counts.

See the API documentation (link) for more details.

without the full list of methods etc.

Copy link
Copy Markdown
Contributor

@praateekmahajan praateekmahajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest of the PR LGTM except that one comment

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 2, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

lbliii and others added 5 commits April 2, 2026 13:30
Update v26.04 fern docs to reflect the new WorkflowRunResult return type
from all deduplication workflow run() methods. Add API reference docs for
WorkflowRunResult and WorkflowBase, update code examples across dedup
pages, and replace 26.02 release notes with 26.04 skeleton.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Add input_filegroups_time, connected_components_pipeline_time, and
complete TextSemanticDedup keys to API reference table and inline
comments. Note PR NVIDIA-NeMo#1275 provenance on workflow.py source link.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
- Qualify WorkflowBase claim: "most" workflows inherit, not "all"
  (TextSemanticDeduplicationWorkflow duck-types the interface)
- Add missing id_generator_path to exact/fuzzy comments in index.mdx
- Complete semdedup.mdx metadata comments with identification_time,
  removal_time, and final_output_path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Per praateekmahajan review feedback — these sections are too internal
for public docs. Also removes dead link in dedup index page.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants