fix: Support datasets saved with save_to_disk in ResponseDataset#1610
Conversation
📝 WalkthroughWalkthroughTwo utility functions were enhanced: dataset message key transformation is now idempotent, applying only when "messages" column is absent; dataset loading adds fallback logic to handle Arrow datasets saved with save_to_disk. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
🧰 Additional context used📓 Path-based instructions (4)**/*.py📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Files:
nemo_rl/**/*.py📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Files:
!(**/tests/**|**/test_*.py|**/test_*.sh)📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Files:
**/*.{py,sh}📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Files:
🧬 Code graph analysis (1)nemo_rl/data/datasets/response_datasets/response_dataset.py (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
🔇 Additional comments (3)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
b88bada to
2472f66
Compare
yuki-97
left a comment
There was a problem hiding this comment.
@sahgerlad Thanks for supporting this! LGTM except one minor thing.
|
@sahgerlad Can you add an unit test at e.g., you can load one dataset from HF and use |
7ed83cc to
4a4cd8d
Compare
|
Thanks @sahgerlad , there's a conflict with main branch, can you solve it? |
4a4cd8d to
4a94751
Compare
…s fixes KeyError when loading Arrow datasets that were saved using HuggingFace datasets' save_to_disk() method. Signed-off-by: Sahger Lad <lad.sahger@gmail.com>
Co-authored-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: sahgerlad <36946563+sahgerlad@users.noreply.github.com> Signed-off-by: Sahger Lad <lad.sahger@gmail.com>
Tests that ResponseDataset correctly handles datasets that already have a 'messages' column, which is the case when loading Arrow datasets saved with HuggingFace's save_to_disk() method. Signed-off-by: Sahger Lad <lad.sahger@gmail.com>
4a94751 to
7ea0c59
Compare
…DIA-NeMo#1610) Signed-off-by: Sahger Lad <lad.sahger@gmail.com> Signed-off-by: sahgerlad <36946563+sahgerlad@users.noreply.github.com> Co-authored-by: Yuki Huang <yukih@nvidia.com>
…DIA-NeMo#1610) Signed-off-by: Sahger Lad <lad.sahger@gmail.com> Signed-off-by: sahgerlad <36946563+sahgerlad@users.noreply.github.com> Co-authored-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: Parth Mannan <pmannan@nvidia.com>
…DIA-NeMo#1610) Signed-off-by: Sahger Lad <lad.sahger@gmail.com> Signed-off-by: sahgerlad <36946563+sahgerlad@users.noreply.github.com> Co-authored-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
…DIA-NeMo#1610) Signed-off-by: Sahger Lad <lad.sahger@gmail.com> Signed-off-by: sahgerlad <36946563+sahgerlad@users.noreply.github.com> Co-authored-by: Yuki Huang <yukih@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Signed-off-by: Sahger Lad <lad.sahger@gmail.com> Signed-off-by: sahgerlad <36946563+sahgerlad@users.noreply.github.com> Co-authored-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Sahger Lad <lad.sahger@gmail.com> Signed-off-by: sahgerlad <36946563+sahgerlad@users.noreply.github.com> Co-authored-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Sahger Lad <lad.sahger@gmail.com> Signed-off-by: sahgerlad <36946563+sahgerlad@users.noreply.github.com> Co-authored-by: Yuki Huang <yukih@nvidia.com>
What does this PR do ?
Support datasets saved with
save_to_diskin ResponseDataset. This fixes KeyError when loading Arrow datasets that were saved using HuggingFace datasets'save_to_disk()method.Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.