fix: make sft dynamic batch step time check more stable#1265
Conversation
Signed-off-by: Terry Kong <terryk@nvidia.com>
📝 WalkthroughWalkthroughUpdated a test script’s metric validation: the mean calculation for timing data now uses a trailing slice (-6 to -1) instead of a single index (2). No other logic or control flow changes. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (4 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used📓 Path-based instructions (3)**/*.sh📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Files:
tests/test_suites/llm/*.sh📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Files:
tests/test_suites/**📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Files:
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…1265) Signed-off-by: Terry Kong <terryk@nvidia.com>
…1265) Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Failure seems to have been there since the beginning, usually the threshold is barely crossed
It's biased a little high because it also includes the checkpointing https://wandb.ai/nvidia/nemo-rl/panel/z7zystoxj?nw=rv5zmj1j76g&yAxisMax=13
The performance seems to be pretty high variance between commits, even the initial commit had a step time pretty close to 10
Averaging just the last steps seems to give a larger gap and fewer false positives
Summary by CodeRabbit