Question: update preprocessing scripts to use HuggingFace datasets for pretraining?

Collecting the datasets needed for pretraining is a bit of work, especially when downloading from lots of different URLs behind a firewall.

https://github.com/microsoft/DeepSpeedExamples/tree/25d73cf73fb3dc66faefa141b7319526555be9fc/Megatron-LM-v1.1.5-ZeRO3#datasets

I see that some version of these seem to be available in HuggingFace datasets repo, like openwebtext.

https://huggingface.co/datasets/openwebtext

For the above, it's especially nice since @stas00 has a small subset one can use for testing:

https://huggingface.co/datasets/stas/openwebtext-10k

It's pretty straight-forward to extend the preprocessing script to use the HF datasets as a source rather than a json file.  Would something like that be acceptable as a PR?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: update preprocessing scripts to use HuggingFace datasets for pretraining? #120

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question: update preprocessing scripts to use HuggingFace datasets for pretraining? #120

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions