Skip to content

fix dataloader len for tqdm and others#168

Closed
tomekrut wants to merge 2 commits intodeepspeedai:masterfrom
tomekrut:fix-dataloader-len
Closed

fix dataloader len for tqdm and others#168
tomekrut wants to merge 2 commits intodeepspeedai:masterfrom
tomekrut:fix-dataloader-len

Conversation

@tomekrut
Copy link
Copy Markdown

I found a small issue see below - the fix created for it.

#167

@tomekrut
Copy link
Copy Markdown
Author

Just few more comments to the above fix:
in case anyone wants to use tqdm for example... the number of batches to complete a single epoch is not correct. it happens because the DeepSpeed's DataLoader len is not calculated correctly. See more details in the #167

@ShadenSmith ShadenSmith requested a review from minjiaz March 25, 2020 16:37
@ShadenSmith ShadenSmith added the bug Something isn't working label Mar 25, 2020
@ShadenSmith ShadenSmith linked an issue Mar 25, 2020 that may be closed by this pull request
@tjruwase
Copy link
Copy Markdown
Contributor

@tomekrut Thanks for contributing to DeepSpeed. Apologies for the delay in getting back to you. We need time to investigate and fully understand the underlying issue that you have found. Yes, you found a subtle bug in DeepSpeed, and thanks so much for that. However, this PR does not fully address the underlying issue; it fixes one scenario but breaks others as I will explain below.

First I will provide some context to the origin of DeepSpeedDataLoader. The goal is for it to be wrapper around torch DataLoader to enable us conveniently add data loading optimizations in the future. As such we strive to maintain similar semantics as torch DataLoader.

I observe that torch DataLoader supports two modes: batching, which returns batched samples, and non-batching, which returns individual samples. This PR fixes the bug for batching mode but introduces a new bug for non-batching mode. It seems to me that more extensive code changes and unit tests are required to support both modes and address other issues in DeepSpeedDataLoader. I have now created #176 to track this bug.

@tjruwase tjruwase closed this Mar 28, 2020
@tjruwase tjruwase reopened this Mar 28, 2020
@msftclas
Copy link
Copy Markdown

msftclas commented Mar 31, 2020

CLA assistant check
All CLA requirements met.

@tjruwase
Copy link
Copy Markdown
Contributor

Closing because it is now replaced by #178

@tjruwase tjruwase closed this Mar 31, 2020
jeffra added a commit to jeffra/DeepSpeed that referenced this pull request Apr 19, 2021
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DeepSpeedDataLoader's len is not caculated properly

5 participants