Skip to content

Comments

feat: enhance one_logger_utils to support rankpulse integration#1988

Open
sanshang-nv wants to merge 4 commits intoNVIDIA:mainfrom
sanshang-nv:sanshang/rankpulse
Open

feat: enhance one_logger_utils to support rankpulse integration#1988
sanshang-nv wants to merge 4 commits intoNVIDIA:mainfrom
sanshang-nv:sanshang/rankpulse

Conversation

@sanshang-nv
Copy link
Contributor

@sanshang-nv sanshang-nv commented Oct 28, 2025

What does this PR do ?

Add support to new feature of one-logger in one-logger-utils package, which helps debug cpu/gpu hang problems. Be disabled as defaults and use several env vars to control:

  • RANKPULSE_ENABLE to enable or disable rankpulse (default value is 0)
  • RANKPULSE_INTERVAL_SECONDS to set checking period (default value is 15)
  • RANKPULSE_TWINDOW_SECONDS to set history checking time window (default value is 300)
  • RANKPULSE_GPU_DEBUG_INFO to enable or disable GPU debug information dump when GPU hang (default value is 1)

feature MR of one-logger repo: link

related integration MR of standalone one-logger-utils repo: link

Pre-checks:

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 28, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jaredcasper jaredcasper added this to the Core 0.16 milestone Nov 17, 2025
@jaredcasper jaredcasper added the Expert Review Apply this label to indicate that your PR is ready for expert review. label Nov 17, 2025
Copy link
Contributor

@pablo-garay pablo-garay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not the best / with most context for this, please have the other reviewers tagged here approve it

@sanshang-nv
Copy link
Contributor Author

I am not the best / with most context for this, please have the other reviewers tagged here approve it

thanks, @pablo-garay . Could you please help AT the right reviewer to help approve if you know? thanks!

@pablo-garay
Copy link
Contributor

I am not the best / with most context for this, please have the other reviewers tagged here approve it

thanks, @pablo-garay . Could you please help AT the right reviewer to help approve if you know? thanks!

I thought the tagged reviewers right now should/would suffice? E.g. a few of them reviewed/commented already - hence Was in their review would be enough

@sanshang-nv
Copy link
Contributor Author

kindly ping @jaredcasper @deepakn94

@jaredcasper
Copy link
Contributor

Are we using environment variables? We don't use environment variables for anything else. Can you use command line arguments?

@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Jan 11, 2026
@chtruong814 chtruong814 removed the needs-follow-up Issue needs follow-up label Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request Expert Review Apply this label to indicate that your PR is ready for expert review.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants