Environment info
transformers version: 4.2.0
- Platform: Ubuntu 20.04
- Python version: 3.8.5
- PyTorch version (GPU?): 1.7.1 / CUDA 11.2
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help
@sgugger @stas00
Information
I'm trying out the 4.2.0 release with a training script that had been working in 4.1.1.
I'm parallelizing my model over two GPUs, and I had been using the --model_parallel training arg in the previous version. Now that it's no longer used, I removed the arg from my training command, but I'm getting an error as though the DataParallel is being used and the model isn't being detected as parallelized:
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1
I did some debugging, and everything seems okay with my model (trainer. is_model_parallel returns True). But the trainer. args.n_gpu is still 2.
I admit that I don't totally understand what's happening in the trainer code, but it might be an error on line 289?
self.args._n_gpu = 1
Should that be self.args.n_gpu = 1, without the leading underscore?
To reproduce
Steps to reproduce the behavior:
- Parallelize a model
- Train on a machine with multiple GPUs
Environment info
transformersversion: 4.2.0Who can help
@sgugger @stas00
Information
I'm trying out the 4.2.0 release with a training script that had been working in 4.1.1.
I'm parallelizing my model over two GPUs, and I had been using the
--model_paralleltraining arg in the previous version. Now that it's no longer used, I removed the arg from my training command, but I'm getting an error as though the DataParallel is being used and the model isn't being detected as parallelized:RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1I did some debugging, and everything seems okay with my model (
trainer. is_model_parallelreturns True). But thetrainer. args.n_gpuis still 2.I admit that I don't totally understand what's happening in the trainer code, but it might be an error on line 289?
self.args._n_gpu = 1Should that be
self.args.n_gpu = 1, without the leading underscore?To reproduce
Steps to reproduce the behavior: