[Nemotron 12B][DPO] GPU memory footprint higher than expectation

**Describe the bug**

Nemotron 12B DPO with seq len 12k requires at least 16 nodes, but in theory should only need somewhere between 4 to 8 GPUs. 

**Steps/Code to reproduce bug**

```yaml
# DPO Algorithm Configuration
dpo:
  max_num_epochs: 1
  max_num_steps: 100
  val_period: 10
  val_batches: 1
  val_global_batch_size: 16
  val_micro_batch_size: 1
  val_at_start: true
  seed: 42

  reference_policy_kl_penalty: 0.1
  preference_average_log_probs: False # whether normalizing log probs according to the sequence length in preference_loss
  sft_average_log_probs: ${.preference_average_log_probs} # whether normalizing log probs according to the sequence length in sft_loss

  preference_loss_weight: 1 # the coefficient of the preference loss
  sft_loss_weight: 0 # the coefficient of the SFT loss

checkpointing:
  enabled: true
  checkpoint_dir: "results/dpo"
  metric_name: "val_loss"
  higher_is_better: false
  keep_top_k: null
  save_period: 50

policy:
  model_name: ""
  tokenizer:
    name: ${policy.model_name}

  # number of preference samples per batch
  # each preference sample corresponds to a pair of chosen and rejected responses
  # so the actual batch size processed by the model is train_global_batch_size * 2
  train_global_batch_size: 64
  train_micro_batch_size: 1


  #logprob_batch_size: ${policy.train_micro_batch_size}
  max_total_sequence_length: 12288
  precision: "bfloat16"
  fsdp_offload_enabled: false
  activation_checkpointing_enabled: true

  dtensor_cfg:
    enabled: true
    cpu_offload: false
    sequence_parallel: false
    activation_checkpointing: true
    tensor_parallel_size: 8
    context_parallel_size: 1
    custom_parallel_plan: null

  dynamic_batching:
    enabled: false

  # makes the training sequence length divisible by the tensor parallel size
  # this is useful for sequence parallel training
  make_sequence_length_divisible_by: ${policy.dtensor_cfg.tensor_parallel_size}
  max_grad_norm: 1.0

  optimizer:
    name: "torch.optim.AdamW"
    kwargs:
      lr: 1.0e-6
      weight_decay: 0.01
      betas: [0.9, 0.999]
      eps: 1e-8
      # when using Dtensor, we need to set foreach
      # and fused to False
      foreach: False
      fused: False

  scheduler:
    - name: "torch.optim.lr_scheduler.ConstantLR"
      kwargs:
        factor: 1.0
        total_iters: 10000000000
    - milestones: []

data:
  dataset_name: ""
  train_data_path: ""
  val_data_path: ""
  max_input_seq_length: ${policy.max_total_sequence_length}
logger:
  log_dir: "logs"  # Base directory for all logs
  wandb_enabled: true # Make sure you do a ``wandb login [Your API key]'' before running
  tensorboard_enabled: false
  monitor_gpus: true  # If true, will monitor GPU usage and log to wandb and/or tensorboard
  wandb:
    project: "dpo-dev"
    name: "dpo"
  gpu_monitoring:
    collection_interval: 10  # How often to collect GPU usage metrics (in seconds)
    flush_interval: 10  # How often to flush GPU usage metrics to the loggers (in seconds)

cluster:
  gpus_per_node: 8
  num_nodes: 16
```


**Expected behavior**



**Environment overview (please complete the following information)**



**Environment details**


**Additional context**




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Nemotron 12B][DPO] GPU memory footprint higher than expectation #848

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Nemotron 12B][DPO] GPU memory footprint higher than expectation #848

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions