feat: enhance one_logger_utils to support rankpulse integration#1988
feat: enhance one_logger_utils to support rankpulse integration#1988sanshang-nv wants to merge 4 commits intoNVIDIA:mainfrom
Conversation
pablo-garay
left a comment
There was a problem hiding this comment.
I am not the best / with most context for this, please have the other reviewers tagged here approve it
thanks, @pablo-garay . Could you please help AT the right reviewer to help approve if you know? thanks! |
I thought the tagged reviewers right now should/would suffice? E.g. a few of them reviewed/commented already - hence Was in their review would be enough |
|
kindly ping @jaredcasper @deepakn94 |
|
Are we using environment variables? We don't use environment variables for anything else. Can you use command line arguments? |
What does this PR do ?
Add support to new feature of one-logger in one-logger-utils package, which helps debug cpu/gpu hang problems. Be disabled as defaults and use several env vars to control:
RANKPULSE_ENABLEto enable or disable rankpulse (default value is 0)RANKPULSE_INTERVAL_SECONDSto set checking period (default value is 15)RANKPULSE_TWINDOW_SECONDSto set history checking time window (default value is 300)RANKPULSE_GPU_DEBUG_INFOto enable or disable GPU debug information dump when GPU hang (default value is 1)feature MR of one-logger repo: link
related integration MR of standalone one-logger-utils repo: link
Pre-checks: