Skip to content

Trainer is using DataParallel on parallelized models  #9577

@jncasey

Description

@jncasey

Environment info

  • transformers version: 4.2.0
  • Platform: Ubuntu 20.04
  • Python version: 3.8.5
  • PyTorch version (GPU?): 1.7.1 / CUDA 11.2
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help

@sgugger @stas00

Information

I'm trying out the 4.2.0 release with a training script that had been working in 4.1.1.

I'm parallelizing my model over two GPUs, and I had been using the --model_parallel training arg in the previous version. Now that it's no longer used, I removed the arg from my training command, but I'm getting an error as though the DataParallel is being used and the model isn't being detected as parallelized:
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1

I did some debugging, and everything seems okay with my model (trainer. is_model_parallel returns True). But the trainer. args.n_gpu is still 2.

I admit that I don't totally understand what's happening in the trainer code, but it might be an error on line 289?
self.args._n_gpu = 1

Should that be self.args.n_gpu = 1, without the leading underscore?

To reproduce

Steps to reproduce the behavior:

  1. Parallelize a model
  2. Train on a machine with multiple GPUs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions