Conversation
| @staticmethod | ||
| def get_dataset(cfg): | ||
| if 'is_sharded' in cfg.dataset: | ||
| is_sharded = cfg.dataset.is_sharded | ||
| del cfg.dataset.is_sharded | ||
| else: | ||
| is_sharded = False | ||
|
|
||
| if is_sharded: | ||
| cfg.dataset._target_ = 'nemo.collections.tts.data.vocoder_dataset.TarredVocoderDataset' | ||
| dataset = instantiate(cfg.dataset) | ||
| sampler = None | ||
| else: | ||
| dataset = instantiate(cfg.dataset) | ||
| sampler = dataset.get_sampler(cfg.dataloader_params.batch_size) | ||
| return dataset, sampler |
There was a problem hiding this comment.
Is this the same outcome as defining this in TarredVocoderDataset?:
get_sampler():
return None
There was a problem hiding this comment.
Good point i will create a simpler case here, but I created this function to set the target so as to use existing config files and provide toggle option for sharding.
There was a problem hiding this comment.
Given that these are experimental recipes that are undergoing breaking changes and have no public models, I think it should be OK to require config updates. The tarred dataset also requires different inputs to instantiate it.
There was a problem hiding this comment.
I am against creating one new config for each feature.
There was a problem hiding this comment.
Do you mean you are against creating separate config files for new features in general (eg. sampling rate, new architectures)? Or specifically for needing separate config for tarred vs non-tarred datasets?
The former of creating lots of different overlapping configs is expected across NeMo. The latter can be addressed in several different ways, if needed.
There was a problem hiding this comment.
We don;t want to create multiple configs with same model architecture but different setting (like tarred/sampling_rate etc).
There was a problem hiding this comment.
For the same model architecture with different configs are main options I have heard before are:
- Create different example configs.
- Have one example config and provide web documentation and tutorials clearly indicating to users how to manually change fields for different valid configs (eg. sample rate, language).
From past discussions around this, generally (1) was preferable, especially for changes like sample rate which are relatively complicated and error prone if you attempt to update all needed parameters manually. A combination of (1) and (2) is needed if you have too many configs to support, such as when having to support multiple sample rates and languages at the same time.
There was a problem hiding this comment.
In this particular case, there are 2 problems with the existing approach.
- The model class takes the data loader class/config as input, however the constructor requires passing information
global_rankandworld_sizewhich are known only to the model class and trainer. - Different data loaders require significantly different inputs to instantiate (and this will diverge more as we add additional sampling strategies).
Perhaps the cleanest way to address these would be to change it so the model is expected to receive a subset of dataset parameters, and then add a module which maps the parameters to the logic to create each dataset.
For example:
In .yaml
train_ds:
dataset_type: "vocoder_tarred"
dataset_args:
batch_size: ${batch_size}
...
In vocoder_dataset.py
def create_dataset(dataset_type, dataset_args, global_rank, world_size, ...):
if dataset_type == "vocoder":
return VocoderDataset(*dataset_args)
elif dataset_type == "vocoder_tarred":
return TarrredVocoderDataset(*dataset_args, global_rank=global_rank, world_size=world_size)
...
There was a problem hiding this comment.
Good catch, missed it. Updated to use those as well.
There was a problem hiding this comment.
for logging, testing, validation we rely only on non tarred datasets.
| @staticmethod | ||
| def get_dataset(cfg): | ||
| if 'is_sharded' in cfg.dataset: | ||
| is_sharded = cfg.dataset.is_sharded | ||
| del cfg.dataset.is_sharded | ||
| else: | ||
| is_sharded = False | ||
|
|
||
| if is_sharded: | ||
| cfg.dataset._target_ = 'nemo.collections.tts.data.vocoder_dataset.TarredVocoderDataset' | ||
| dataset = instantiate(cfg.dataset) | ||
| sampler = None | ||
| else: | ||
| dataset = instantiate(cfg.dataset) | ||
| sampler = dataset.get_sampler(cfg.dataloader_params.batch_size) | ||
| return dataset, sampler |
There was a problem hiding this comment.
Given that these are experimental recipes that are undergoing breaking changes and have no public models, I think it should be OK to require config updates. The tarred dataset also requires different inputs to instantiate it.
| self.world_size = 1 | ||
| if trainer is not None: | ||
| self.world_size = trainer.num_nodes * trainer.num_devices |
There was a problem hiding this comment.
In general we need to figure out how to make the model class agnostic to the data loader. Especially given that the logic will need to be kept consistent across all vocoder recipes (audio codec, hifigan, and univnet at least).
If we think upgrading webdataset version is a blocker to streamline the code, then we can document that with some TODOs and leave these workarounds. Otherwise, we should redesign the contract between the dataset and the model classes so they work with both.
| self._dataset.rename(audio=VALID_FILE_FORMATS, key='__key__') | ||
| .to_tuple('audio', 'key') |
There was a problem hiding this comment.
Nitpick: it is not very important for audio only dataset, but code would be more readable and extensible using original dictionary format instead of converting to tuple.
|
jenkins |
|
jenkins |
1 similar comment
|
jenkins |
| @staticmethod | ||
| def get_dataset(cfg): | ||
| if 'is_sharded' in cfg.dataset: | ||
| is_sharded = cfg.dataset.is_sharded | ||
| del cfg.dataset.is_sharded | ||
| else: | ||
| is_sharded = False | ||
|
|
||
| if is_sharded: | ||
| cfg.dataset._target_ = 'nemo.collections.tts.data.vocoder_dataset.TarredVocoderDataset' | ||
| dataset = instantiate(cfg.dataset) | ||
| sampler = None | ||
| else: | ||
| dataset = instantiate(cfg.dataset) | ||
| sampler = dataset.get_sampler(cfg.dataloader_params.batch_size) | ||
| return dataset, sampler |
There was a problem hiding this comment.
For the same model architecture with different configs are main options I have heard before are:
- Create different example configs.
- Have one example config and provide web documentation and tutorials clearly indicating to users how to manually change fields for different valid configs (eg. sample rate, language).
From past discussions around this, generally (1) was preferable, especially for changes like sample rate which are relatively complicated and error prone if you attempt to update all needed parameters manually. A combination of (1) and (2) is needed if you have too many configs to support, such as when having to support multiple sample rates and languages at the same time.
| @staticmethod | ||
| def get_dataset(cfg): | ||
| if 'is_sharded' in cfg.dataset: | ||
| is_sharded = cfg.dataset.is_sharded | ||
| del cfg.dataset.is_sharded | ||
| else: | ||
| is_sharded = False | ||
|
|
||
| if is_sharded: | ||
| cfg.dataset._target_ = 'nemo.collections.tts.data.vocoder_dataset.TarredVocoderDataset' | ||
| dataset = instantiate(cfg.dataset) | ||
| sampler = None | ||
| else: | ||
| dataset = instantiate(cfg.dataset) | ||
| sampler = dataset.get_sampler(cfg.dataloader_params.batch_size) | ||
| return dataset, sampler |
There was a problem hiding this comment.
In this particular case, there are 2 problems with the existing approach.
- The model class takes the data loader class/config as input, however the constructor requires passing information
global_rankandworld_sizewhich are known only to the model class and trainer. - Different data loaders require significantly different inputs to instantiate (and this will diverge more as we add additional sampling strategies).
Perhaps the cleanest way to address these would be to change it so the model is expected to receive a subset of dataset parameters, and then add a module which maps the parameters to the logic to create each dataset.
For example:
In .yaml
train_ds:
dataset_type: "vocoder_tarred"
dataset_args:
batch_size: ${batch_size}
...
In vocoder_dataset.py
def create_dataset(dataset_type, dataset_args, global_rank, world_size, ...):
if dataset_type == "vocoder":
return VocoderDataset(*dataset_args)
elif dataset_type == "vocoder_tarred":
return TarrredVocoderDataset(*dataset_args, global_rank=global_rank, world_size=world_size)
...
a7a4dab to
9c52b7a
Compare
|
jenkins |
| The benefit of replication is that it allows each node to sample data points from the entire | ||
| dataset independently of other nodes, and reduces dependence on value of `shuffle_n`. |
There was a problem hiding this comment.
If I am not mistaken, shuffle_n is the size of the shuffle buffer as the gpu reads in data from its shards. Allocating more shards to the gpu would not really reduce its dependence on shuffle_n. Intuitively, I would expect it to become more important (larger dataset requires more shuffling).
Somewhat related, wd. WebDataset() has a shardshuffle=True parameter that ensures that the shard list itself is fully shuffled.
There was a problem hiding this comment.
From webdataset Docs:
https://webdataset.github.io/webdataset/gettingstarted/
shuffle(n): shuffle the dataset with a buffer of size n; also shuffles shards (see below)
There was a problem hiding this comment.
Based on the thread linked below, it sounds like the comment is effectively saying "larger datasets are less sensitive to how they are shuffled".
There was a problem hiding this comment.
updated to have both buffer and inital arguments to have more control over shuffling
| self._dataset = wd.WebDataset(audio_tar_filepaths, nodesplitter=None) | ||
|
|
||
| if shuffle_n > 0: | ||
| self._dataset = self._dataset.shuffle(shuffle_n) |
There was a problem hiding this comment.
Should we do something like dataset.shuffle(size=shuffle_n, initial=shuffle_n)? The initial buffer size of 100 is really small, which I saw slow convergence a lot on my small test datasets (but might be irrelevant on very large datasets).
There was a problem hiding this comment.
If the set is small then yes number of samples per worker would become small and shuffle might not have an effect and wouln;t be a problem for larger sets as number of samples per worker would be large. As per this discussion, it looks like if we want to get a true shuffle of over all sets, its better to unbatch() on loader and shuffle with a larger number.
| sample_rate: int, | ||
| sample_rate: int = 24000, |
There was a problem hiding this comment.
Nitpick: other classes in the TTS collection either provide no default sample_rate or defaults to 22050.
There was a problem hiding this comment.
Default for our audio codec models is 24000. Did we have a 22050 Hz model?
There was a problem hiding this comment.
No, but this is a data loader for all TTS vocoders. We have never used 24khz in TTS before (and when I brought up the question before of whether we should do so to synchronize with SpeechLM models, the concensus was to stick to 22.05khz). So I suppose I would favor not having a default sample rate.
9c52b7a to
7303148
Compare
|
jenkins |
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
7303148 to
cc9380a
Compare
|
jenkins |
rlangman
left a comment
There was a problem hiding this comment.
LGTM. Thanks a lot for the feature and changes.
Signed-off-by: Chen Cui <chcui@nvidia.com> support packed dataset Signed-off-by: Chen Cui <chcui@nvidia.com> [Codec] Finite scalar quantizer (NVIDIA-NeMo#7886) * Finite scalar quantizer Signed-off-by: Ante Jukić <ajukic@nvidia.com> * Updated test Signed-off-by: Ante Jukić <ajukic@nvidia.com> --------- Signed-off-by: Ante Jukić <ajukic@nvidia.com> upgrade to latest mcore and TE (NVIDIA-NeMo#7908) * reimport module Signed-off-by: dimapihtar <dpihtar@gmail.com> * update mcore and TE commits Signed-off-by: dimapihtar <dpihtar@gmail.com> --------- Signed-off-by: dimapihtar <dpihtar@gmail.com> Tar codec (NVIDIA-NeMo#7867) added missing torch import (NVIDIA-NeMo#7913) Signed-off-by: David Mosallanezhad <dmosallanezh@nvidia.com> add cpu init check (NVIDIA-NeMo#7889) Signed-off-by: Chen Cui <chcui@nvidia.com> Fix pinned triton version (NVIDIA-NeMo#7925) * Fix pinned triton version Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * Remove comment Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change README Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * Remove flash-attn in Dockerfile Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * Revert Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> --------- Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> fix tp_overlap config var name (NVIDIA-NeMo#7928) Signed-off-by: Xiaowei Ren <xren@nvidia.com> add Dutch P&C FC model info (NVIDIA-NeMo#7892) * add Dutch P&C FC model info Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * update order of the results Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> --------- Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> fix issues with convert_nemo_llama_to_hf.py (NVIDIA-NeMo#7922) [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci fix collate_fn bug for TP > 1 Signed-off-by: Chen Cui <chcui@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci make packed dataset work Signed-off-by: Chen Cui <chcui@nvidia.com> fix nan bug Signed-off-by: Chen Cui <chcui@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci support answer only loss Signed-off-by: Chen Cui <chcui@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci account for padding in cu_seqlens during dataloading for attn kernel Signed-off-by: Chen Cui <chcui@nvidia.com> fix path for answer_only_loss = false Signed-off-by: Chen Cui <chcui@nvidia.com> Modify GPTSFTPackedDataset to respond to pad_to_max_length setting Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com> support packed dataset Signed-off-by: Chen Cui <chcui@nvidia.com> [Codec] Finite scalar quantizer (NVIDIA-NeMo#7886) * Finite scalar quantizer Signed-off-by: Ante Jukić <ajukic@nvidia.com> * Updated test Signed-off-by: Ante Jukić <ajukic@nvidia.com> --------- Signed-off-by: Ante Jukić <ajukic@nvidia.com> upgrade to latest mcore and TE (NVIDIA-NeMo#7908) * reimport module Signed-off-by: dimapihtar <dpihtar@gmail.com> * update mcore and TE commits Signed-off-by: dimapihtar <dpihtar@gmail.com> --------- Signed-off-by: dimapihtar <dpihtar@gmail.com> Tar codec (NVIDIA-NeMo#7867) added missing torch import (NVIDIA-NeMo#7913) Signed-off-by: David Mosallanezhad <dmosallanezh@nvidia.com> add cpu init check (NVIDIA-NeMo#7889) Signed-off-by: Chen Cui <chcui@nvidia.com> Fix pinned triton version (NVIDIA-NeMo#7925) * Fix pinned triton version Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * Remove comment Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change README Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * Remove flash-attn in Dockerfile Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> * Revert Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> --------- Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> fix tp_overlap config var name (NVIDIA-NeMo#7928) Signed-off-by: Xiaowei Ren <xren@nvidia.com> add Dutch P&C FC model info (NVIDIA-NeMo#7892) * add Dutch P&C FC model info Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * update order of the results Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> --------- Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> fix issues with convert_nemo_llama_to_hf.py (NVIDIA-NeMo#7922) [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci fix collate_fn bug for TP > 1 Signed-off-by: Chen Cui <chcui@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci make packed dataset work Signed-off-by: Chen Cui <chcui@nvidia.com> fix nan bug Signed-off-by: Chen Cui <chcui@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci support answer only loss Signed-off-by: Chen Cui <chcui@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci account for padding in cu_seqlens during dataloading for attn kernel Signed-off-by: Chen Cui <chcui@nvidia.com> fix path for answer_only_loss = false Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Add webdataset support for VocoderDataset
Collection: TTS
Changelog
Known issue:
Usage
is_sharded=Truein cfg for train_ds to load webdatasetBefore your PR is Ready for review
Pre checks:
PR Type: