System Info
transformers version: 4.32.0.dev0
- Platform: Linux-5.15.0-76-generic-x86_64-with-glibc2.31
- Python version: 3.10.10
- Huggingface_hub version: 0.14.1
- Safetensors version: 0.3.1
- Accelerate version: 0.21.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.0.0+cu117 (True)
- Tensorflow version (GPU?): 2.13.0 (False)
- Flax version (CPU?/GPU?/TPU?): 0.7.0 (cpu)
- Jax version: 0.4.13
- JaxLib version: 0.4.13
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: It seems yes, but I don't want to ;)
Who can help?
No response
Information
Tasks
Reproduction
Hi,
I'm running the unchanged "VisionTextDualEncoder and CLIP model training example" on my local laptop (which has 1 GPU) and wonder why it claims to do distributed training: True (and not False). From the output:
07/19/2023 15:21:22 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False
The above output originates from run_clip.py
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
+ f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
)
- The default should be
training_args.local_rank=-1 according to TrainingArguments but is somehow set to 0 in this example and I don't know why.
- Adding
local_rank=-1 to the run_clip.py example script does not show any effect.
My questions:
- Is it intended that
local_rank is set to 0?
- Does
local_rank=0 really mean that distributed training in Trainer is enabled? (I'm new to Trainer and usually work with DistributedDataParallel)
- How to switch off distributed training?
Bigger picture: Sometimes my training (on a cluster) hangs up in n-1 iteration and never finishes. I wonder if this has to do with distributed training. I don't know how to debug this.
100%|█████████▉| 2875/2876 [11:34<00:00, 4.10it/s]
Thanks in advance!
Expected behavior
I don't want to use distributed training, i.e. training_args.local_rank = -1
System Info
transformersversion: 4.32.0.dev0Who can help?
No response
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Hi,
I'm running the unchanged "VisionTextDualEncoder and CLIP model training example" on my local laptop (which has 1 GPU) and wonder why it claims to do
distributed training: True(and notFalse). From the output:The above output originates from
run_clip.pytraining_args.local_rank=-1according toTrainingArgumentsbut is somehow set to0in this example and I don't know why.local_rank=-1to the run_clip.py example script does not show any effect.My questions:
local_rankis set to0?local_rank=0really mean that distributed training inTraineris enabled? (I'm new toTrainerand usually work withDistributedDataParallel)Bigger picture: Sometimes my training (on a cluster) hangs up in n-1 iteration and never finishes. I wonder if this has to do with distributed training. I don't know how to debug this.
Thanks in advance!
Expected behavior
I don't want to use distributed training, i.e.
training_args.local_rank = -1