Skip to content

Running on two 3090s? #9

@vijetadeshpande

Description

@vijetadeshpande

Hi authors, I do not have access to solid hardware. What I have for now is two 3090s (24GB each). I am planning I run/debug the code with this setup and then move experiments to A100s. On these two 3090s, I have CUDA 12.4. Torch version 2.0.0 (in the requirements) does not support CUDA 12.X. I found that torch 12.1.1 support CUDA 12.1. This is the only change I have made in the requirements, otherwise the setup is as suggested.

When I run

torchrun --nproc_per_node=2 --master_port=6000 train.py ...

I am the code is getting stuck at the following progress step,

LlamaTokenizerFast(name_or_path='meta-llama/Llama-2-7b-hf', vocab_size=32000, model_max_length=32, is_fast=True, padding_side='right', truncation
_side='left', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>'}, clean_up_tokenization_spaces=F
alse),  added_tokens_decoder={                                                                                                                   
        0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),                                   
        1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),                                     
        2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),                                    
}                                                                                                                                                
/mnt/shared_home/vdeshpande/miniconda3/envs/env_spag/lib/python3.9/site-packages/accelerate/accelerator.py:457: FutureWarning: Passing the follow
ing arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). 
Please pass an `accelerate.DataLoaderConfiguration` instead:                                                                                     
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)                                                          
  warnings.warn(                                                                                                                                 
Installed CUDA version 12.2 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /mnt/shared_home/vdeshpande/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...                                               
Installed CUDA version 12.2 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Using /mnt/shared_home/vdeshpande/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...                                               
Detected CUDA files, patching ldflags                                                                                                            
Emitting ninja build file /mnt/shared_home/vdeshpande/.cache/torch_extensions/py39_cu121/cpu_adam/build.ninja...                                 
Building extension module cpu_adam...                                                                                                            
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)                                
ninja: no work to do.               
Loading extension module cpu_adam...                                    
Loading extension module cpu_adam...                                    
Time to load cpu_adam op: 3.3634016513824463 seconds                    
Time to load cpu_adam op: 3.0814285278320312 seconds                    
Parameter Offload: Total persistent parameters: 532480 in 130 params                                                                             [2024-09-04 15:34:07,215] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 50337 closing signal SIGTERM                  
[2024-09-04 15:34:22,297] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 50336) of binary: /mnt
/shared_home/vdeshpande/miniconda3/envs/env_spag/bin/python

Any insights on resolving this issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions