Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
ea13962
add virtual pipeline size to config
ericharper Jul 6, 2022
2de384e
convert model to list of modules
ericharper Aug 3, 2022
51b5639
convert model to list of modules
ericharper Aug 3, 2022
55ee50b
convert model to list of modules
ericharper Aug 3, 2022
31b001f
update for list of modules
ericharper Aug 4, 2022
6c9c6a9
add virtual to init
ericharper Aug 4, 2022
5de1c46
Merge branch 'main' of github.com:NVIDIA/NeMo into pipeline_interleaved
ericharper Aug 9, 2022
48eba0d
Merge branch 'main' of github.com:NVIDIA/NeMo into pipeline_interleaved
ericharper Aug 11, 2022
1ca0fa3
update first last stage embedding all reduce
ericharper Aug 11, 2022
56b0d4c
update sequence parallel all reduce for virtual models
ericharper Aug 11, 2022
3d3182b
Merge branch 'main' of github.com:NVIDIA/NeMo into pipeline_interleaved
ericharper Aug 15, 2022
c8d3acf
runs but we get an error
ericharper Aug 17, 2022
c83b3c9
set virtual rank 0 after looping
ericharper Aug 17, 2022
af60b68
account for virtual when determinining first and last pipeline stages
ericharper Aug 17, 2022
2a76688
checkpointing for virtual models in progress
ericharper Aug 18, 2022
1c5d879
add checkpoint hooks
ericharper Aug 19, 2022
049b065
working on validation when resuming
ericharper Aug 22, 2022
9c810f7
skip sanity val steps by default in config
ericharper Aug 23, 2022
e913df1
pull main
ericharper Aug 24, 2022
c59e5fb
remove comment
ericharper Sep 27, 2022
87f995b
log number of params
ericharper Sep 28, 2022
984e77a
pull main
ericharper Sep 28, 2022
b0e0548
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 28, 2022
8ef0d58
style
ericharper Sep 28, 2022
3de5bb2
Merge branch 'pipeline_interleaved' of github.com:NVIDIA/NeMo into pi…
ericharper Sep 28, 2022
e2331fb
Merge branch 'main' into pipeline_interleaved
ericharper Sep 28, 2022
8b852ac
Merge branch 'main' into pipeline_interleaved
ericharper Oct 3, 2022
1cd5f46
Merge branch 'main' into pipeline_interleaved
ericharper Oct 3, 2022
93a95dd
check if self.model is a list
ericharper Oct 3, 2022
fd6b207
Merge branch 'pipeline_interleaved' of github.com:NVIDIA/NeMo into pi…
ericharper Oct 3, 2022
bf15a0d
make virtual pipeline default size None on init
ericharper Oct 3, 2022
dba6c5f
make virtual pipeline default to None in config
ericharper Oct 3, 2022
ab5199f
Merge branch 'main' into pipeline_interleaved
ericharper Oct 3, 2022
b115d09
remove ensure_divisibility call
ericharper Oct 5, 2022
5758ab6
Merge branch 'pipeline_interleaved' of github.com:NVIDIA/NeMo into pi…
ericharper Oct 5, 2022
8349bef
Merge branch 'main' into pipeline_interleaved
ericharper Oct 5, 2022
969bf69
fix lgtm alerts
ericharper Oct 5, 2022
1e956d6
remove num_sanity_val_steps from config
ericharper Oct 5, 2022
a48e80d
default virtual pipeline size to none
ericharper Oct 5, 2022
1b7a206
check for list
ericharper Oct 5, 2022
fdf3846
update assert to make sure we are only doing virtual for gpt
ericharper Oct 5, 2022
5a65fad
revert change to get_params_for_weight_decay
ericharper Oct 5, 2022
ffc164a
pull main
ericharper Oct 11, 2022
784060e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 11, 2022
08509f9
init var
ericharper Oct 11, 2022
131a163
Merge branch 'pipeline_interleaved' of github.com:NVIDIA/NeMo into pi…
ericharper Oct 11, 2022
266d30c
Merge branch 'main' into pipeline_interleaved
ericharper Oct 11, 2022
aa21c85
add import guard for set virtual model parallel world size
ericharper Oct 11, 2022
4aa40b6
Merge branch 'main' into pipeline_interleaved
ericharper Oct 11, 2022
5b41d1f
use import guard
ericharper Oct 11, 2022
4a6b3f5
update calls to fake init in eval scripts
ericharper Oct 11, 2022
0623a37
Merge branch 'main' into pipeline_interleaved
ericharper Oct 11, 2022
f631df4
add _get_fwd_bwd_function
ericharper Oct 12, 2022
4743b07
Merge branch 'pipeline_interleaved' of github.com:NVIDIA/NeMo into pi…
ericharper Oct 12, 2022
7d3e7ff
log all total model parameters
ericharper Oct 12, 2022
0e7b2f4
Merge branch 'main' into pipeline_interleaved
ericharper Oct 12, 2022
d35c4b8
remove unused import
ericharper Oct 12, 2022
f667fef
pull main
ericharper Oct 13, 2022
408e8c9
Merge branch 'main' into pipeline_interleaved
ericharper Oct 13, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion examples/nlp/language_modeling/conf/megatron_gpt_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ trainer:
accumulate_grad_batches: 1 # do not modify, grad acc is automatic for training megatron models
gradient_clip_val: 1.0
benchmark: False
enable_model_summary: False # default PTL callback for this does not support model parallelism, instead we log manually

exp_manager:
explicit_log_dir: null
Expand Down Expand Up @@ -47,7 +48,7 @@ model:
global_batch_size: 8 # will use more micro batches to reach global batch size
tensor_model_parallel_size: 1 # intra-layer model parallelism
pipeline_model_parallel_size: 1 # inter-layer model parallelism
resume_from_checkpoint: null # manually set the checkpoint file to load from
virtual_pipeline_model_parallel_size: null # interleaved pipeline

# model architecture
encoder_seq_length: 512
Expand Down Expand Up @@ -92,6 +93,7 @@ model:

# miscellaneous
seed: 1234
resume_from_checkpoint: null # manually set the checkpoint file to load from
use_cpu_initialization: False # Init weights on the CPU (slow for large models)
onnx_safe: False # Use work-arounds for known problems with Torch ONNX exporter.
apex_transformer_log_level: 30 # Python logging level displays logs with severity greater than or equal to this
Expand Down
1 change: 1 addition & 0 deletions examples/nlp/language_modeling/megatron_gpt_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,7 @@ def main(cfg) -> None:
app_state.model_parallel_size,
app_state.data_parallel_size,
app_state.pipeline_model_parallel_split_rank,
app_state.virtual_pipeline_model_parallel_rank,
) = fake_initialize_model_parallel(
world_size=app_state.model_parallel_size,
rank=trainer.global_rank,
Expand Down
1 change: 1 addition & 0 deletions examples/nlp/language_modeling/megatron_t5_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ def main():
app_state.model_parallel_size,
app_state.data_parallel_size,
app_state.pipeline_model_parallel_split_rank,
app_state.virtual_pipeline_model_parallel_rank,
) = fake_initialize_model_parallel(
world_size=app_state.model_parallel_size,
rank=trainer.global_rank,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ def main(cfg) -> None:
app_state.model_parallel_size,
app_state.data_parallel_size,
app_state.pipeline_model_parallel_split_rank,
app_state.virtual_pipeline_model_parallel_rank,
) = fake_initialize_model_parallel(
world_size=app_state.model_parallel_size,
rank=trainer.global_rank,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ def main(cfg) -> None:
app_state.model_parallel_size,
app_state.data_parallel_size,
app_state.pipeline_model_parallel_split_rank,
app_state.virtual_pipeline_model_parallel_rank,
) = fake_initialize_model_parallel(
world_size=app_state.model_parallel_size,
rank=trainer.global_rank,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ def main(cfg) -> None:
app_state.model_parallel_size,
app_state.data_parallel_size,
app_state.pipeline_model_parallel_split_rank,
app_state.virtual_pipeline_model_parallel_rank,
) = fake_initialize_model_parallel(
world_size=app_state.model_parallel_size,
rank=trainer.global_rank,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ def main(cfg) -> None:
app_state.model_parallel_size,
app_state.data_parallel_size,
app_state.pipeline_model_parallel_split_rank,
app_state.virtual_pipeline_model_parallel_rank,
) = fake_initialize_model_parallel(
world_size=app_state.model_parallel_size,
rank=trainer.global_rank,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
from nemo.collections.nlp.parts.nlp_overrides import GradScaler
from nemo.core.optim import MainParamsOptimizerWrapper, prepare_lr_scheduler
from nemo.utils import AppState, logging
from nemo.utils.get_rank import is_global_rank_zero

try:
from apex.transformer import parallel_state
Expand Down Expand Up @@ -87,6 +88,7 @@ def __init__(self, cfg: DictConfig, trainer: Trainer, no_lm_init=True):
local_rank=trainer.local_rank,
tensor_model_parallel_size=cfg.get('tensor_model_parallel_size', 1),
pipeline_model_parallel_size=cfg.get('pipeline_model_parallel_size', 1),
virtual_pipeline_model_parallel_size=cfg.get('virtual_pipeline_model_parallel_size', None),
pipeline_model_parallel_split_rank=cfg.get('pipeline_model_parallel_split_rank', 0),
micro_batch_size=cfg.get('micro_batch_size'),
global_batch_size=cfg.get('global_batch_size'),
Expand Down Expand Up @@ -389,3 +391,17 @@ def _validate_config(self):
logging.info("Gradient accumulation fusion can only be used with megatron amp O2 mixed precision.")
with open_dict(self.cfg):
self.cfg.gradient_accumulation_fusion = False

def is_data_parallel_rank_zero(self):
if is_global_rank_zero():
return True
else:
try:
data_parallel_rank = parallel_state.get_data_parallel_rank()
except:
data_parallel_rank = None

if data_parallel_rank is not None and data_parallel_rank == 0:
return True
else:
return False
Loading