Describe the bug
Nemotron 12B DPO with seq len 12k requires at least 16 nodes, but in theory should only need somewhere between 4 to 8 GPUs.
Steps/Code to reproduce bug
# DPO Algorithm Configuration
dpo:
max_num_epochs: 1
max_num_steps: 100
val_period: 10
val_batches: 1
val_global_batch_size: 16
val_micro_batch_size: 1
val_at_start: true
seed: 42
reference_policy_kl_penalty: 0.1
preference_average_log_probs: False # whether normalizing log probs according to the sequence length in preference_loss
sft_average_log_probs: ${.preference_average_log_probs} # whether normalizing log probs according to the sequence length in sft_loss
preference_loss_weight: 1 # the coefficient of the preference loss
sft_loss_weight: 0 # the coefficient of the SFT loss
checkpointing:
enabled: true
checkpoint_dir: "results/dpo"
metric_name: "val_loss"
higher_is_better: false
keep_top_k: null
save_period: 50
policy:
model_name: ""
tokenizer:
name: ${policy.model_name}
# number of preference samples per batch
# each preference sample corresponds to a pair of chosen and rejected responses
# so the actual batch size processed by the model is train_global_batch_size * 2
train_global_batch_size: 64
train_micro_batch_size: 1
#logprob_batch_size: ${policy.train_micro_batch_size}
max_total_sequence_length: 12288
precision: "bfloat16"
fsdp_offload_enabled: false
activation_checkpointing_enabled: true
dtensor_cfg:
enabled: true
cpu_offload: false
sequence_parallel: false
activation_checkpointing: true
tensor_parallel_size: 8
context_parallel_size: 1
custom_parallel_plan: null
dynamic_batching:
enabled: false
# makes the training sequence length divisible by the tensor parallel size
# this is useful for sequence parallel training
make_sequence_length_divisible_by: ${policy.dtensor_cfg.tensor_parallel_size}
max_grad_norm: 1.0
optimizer:
name: "torch.optim.AdamW"
kwargs:
lr: 1.0e-6
weight_decay: 0.01
betas: [0.9, 0.999]
eps: 1e-8
# when using Dtensor, we need to set foreach
# and fused to False
foreach: False
fused: False
scheduler:
- name: "torch.optim.lr_scheduler.ConstantLR"
kwargs:
factor: 1.0
total_iters: 10000000000
- milestones: []
data:
dataset_name: ""
train_data_path: ""
val_data_path: ""
max_input_seq_length: ${policy.max_total_sequence_length}
logger:
log_dir: "logs" # Base directory for all logs
wandb_enabled: true # Make sure you do a ``wandb login [Your API key]'' before running
tensorboard_enabled: false
monitor_gpus: true # If true, will monitor GPU usage and log to wandb and/or tensorboard
wandb:
project: "dpo-dev"
name: "dpo"
gpu_monitoring:
collection_interval: 10 # How often to collect GPU usage metrics (in seconds)
flush_interval: 10 # How often to flush GPU usage metrics to the loggers (in seconds)
cluster:
gpus_per_node: 8
num_nodes: 16
Expected behavior
Environment overview (please complete the following information)
Environment details
Additional context
Describe the bug
Nemotron 12B DPO with seq len 12k requires at least 16 nodes, but in theory should only need somewhere between 4 to 8 GPUs.
Steps/Code to reproduce bug
Expected behavior
Environment overview (please complete the following information)
Environment details
Additional context