Skip to content

[Nemotron 12B][DPO] GPU memory footprint higher than expectation #848

@wangshangsam

Description

@wangshangsam

Describe the bug

Nemotron 12B DPO with seq len 12k requires at least 16 nodes, but in theory should only need somewhere between 4 to 8 GPUs.

Steps/Code to reproduce bug

# DPO Algorithm Configuration
dpo:
  max_num_epochs: 1
  max_num_steps: 100
  val_period: 10
  val_batches: 1
  val_global_batch_size: 16
  val_micro_batch_size: 1
  val_at_start: true
  seed: 42

  reference_policy_kl_penalty: 0.1
  preference_average_log_probs: False # whether normalizing log probs according to the sequence length in preference_loss
  sft_average_log_probs: ${.preference_average_log_probs} # whether normalizing log probs according to the sequence length in sft_loss

  preference_loss_weight: 1 # the coefficient of the preference loss
  sft_loss_weight: 0 # the coefficient of the SFT loss

checkpointing:
  enabled: true
  checkpoint_dir: "results/dpo"
  metric_name: "val_loss"
  higher_is_better: false
  keep_top_k: null
  save_period: 50

policy:
  model_name: ""
  tokenizer:
    name: ${policy.model_name}

  # number of preference samples per batch
  # each preference sample corresponds to a pair of chosen and rejected responses
  # so the actual batch size processed by the model is train_global_batch_size * 2
  train_global_batch_size: 64
  train_micro_batch_size: 1


  #logprob_batch_size: ${policy.train_micro_batch_size}
  max_total_sequence_length: 12288
  precision: "bfloat16"
  fsdp_offload_enabled: false
  activation_checkpointing_enabled: true

  dtensor_cfg:
    enabled: true
    cpu_offload: false
    sequence_parallel: false
    activation_checkpointing: true
    tensor_parallel_size: 8
    context_parallel_size: 1
    custom_parallel_plan: null

  dynamic_batching:
    enabled: false

  # makes the training sequence length divisible by the tensor parallel size
  # this is useful for sequence parallel training
  make_sequence_length_divisible_by: ${policy.dtensor_cfg.tensor_parallel_size}
  max_grad_norm: 1.0

  optimizer:
    name: "torch.optim.AdamW"
    kwargs:
      lr: 1.0e-6
      weight_decay: 0.01
      betas: [0.9, 0.999]
      eps: 1e-8
      # when using Dtensor, we need to set foreach
      # and fused to False
      foreach: False
      fused: False

  scheduler:
    - name: "torch.optim.lr_scheduler.ConstantLR"
      kwargs:
        factor: 1.0
        total_iters: 10000000000
    - milestones: []

data:
  dataset_name: ""
  train_data_path: ""
  val_data_path: ""
  max_input_seq_length: ${policy.max_total_sequence_length}
logger:
  log_dir: "logs"  # Base directory for all logs
  wandb_enabled: true # Make sure you do a ``wandb login [Your API key]'' before running
  tensorboard_enabled: false
  monitor_gpus: true  # If true, will monitor GPU usage and log to wandb and/or tensorboard
  wandb:
    project: "dpo-dev"
    name: "dpo"
  gpu_monitoring:
    collection_interval: 10  # How often to collect GPU usage metrics (in seconds)
    flush_interval: 10  # How often to flush GPU usage metrics to the loggers (in seconds)

cluster:
  gpus_per_node: 8
  num_nodes: 16

Expected behavior

Environment overview (please complete the following information)

Environment details

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions