Error building extension 'cpu_adam'

Hey guys, I'm having a problem getting DeepSpeed working with XLM-Roberta. I'm trying to run it on an Amazon Linux machine, which is based on Red Hat. Here are a some versions of packages/dependencies I'm using:

`cuda version: 10.2`
`transformers: 4.4.2`
`pytorch: 1.7.1`
`deepspeed: 0.3.13`
`gcc/c++/g++: (GCC) 7.2.1 20170915 (Red Hat 7.2.1-2)`

I must admit I had some issues upgrading the CUDA version from the default 10.0 on the instance to 10.2 and GCC from 4.8.5 to 7.2.1 but since I don't get the error that the torch and installed CUDA versions are different and that GCC has a version lower than 5, I'd assume I'm in the clear.

Here's the essential part of the code I'm running (from a notebook):

```
import os
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '9994' # modify if RuntimeError: Address already in use
os.environ['RANK'] = "0"
os.environ['LOCAL_RANK'] = "0"
os.environ['WORLD_SIZE'] = "1"

from transformers import Trainer, TrainingArguments, XLMRobertaForSequenceClassification, XLMRobertaTokenizer

model = XLMRobertaForSequenceClassification.from_pretrained('xlm-roberta-base')

training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    save_steps=500,
    save_total_limit=2,
    deepspeed="my_ds_config.json"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

trainer.train()
```

Here's the content of my config file:

```
{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "zero_optimization": {
        "stage": 2,
       "allgather_partitions": true,
       "allgather_bucket_size": 2e8,
       "reduce_scatter": true,
       "reduce_bucket_size": 2e8,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "cpu_offload": true
    },

    "optimizer": {
        "type": "Adam",
        "params": {
            "adam_w_mode": true,
            "lr": 3e-5,
            "betas": [ 0.9, 0.999 ],
            "eps": 1e-8,
            "weight_decay": 3e-7
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 3e-5,
            "warmup_num_steps": 500
        }
    }
}
```

Here's the output of my `ds_config`:

```
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
/bin/sh: line 0: type: llvm-config: not found
/bin/sh: line 0: type: llvm-config-9: not found
 [WARNING]  sparse_attn requires one of the following commands '['llvm-config', 'llvm-config-9']', but it does not exist!
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch']
torch version .................... 1.7.1
torch cuda version ............... 10.2
nvcc version ..................... 10.2
deepspeed install path ........... ['/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed']
deepspeed info ................... 0.3.13+22d5a1f, 22d5a1f, master
deepspeed wheel compiled w. ...... torch 1.7, cuda 10.2
```

And finally, here's the stack trace:

```
[2021-03-24 15:29:36,478] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.13, git-hash=unknown, git-branch=unknown
[2021-03-24 15:29:36,494] [INFO] [engine.py:77:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
Using /home/ec2-user/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module cpu_adam, skipping build step...
Loading extension module cpu_adam...
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-131-cc14ac05ecbb> in <module>
     30 )
     31 
---> 32 trainer.train()

~/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
    901         delay_optimizer_creation = self.sharded_ddp is not None and self.sharded_ddp != ShardedDDPOption.SIMPLE
    902         if self.args.deepspeed:
--> 903             model, optimizer, lr_scheduler = init_deepspeed(self, num_training_steps=max_steps)
    904             self.model = model.module
    905             self.model_wrapped = model  # will get further wrapped in DDP

~/anaconda3/envs/python3/lib/python3.6/site-packages/transformers/integrations.py in init_deepspeed(trainer, num_training_steps)
    416         model=model,
    417         model_parameters=model_parameters,
--> 418         config_params=config,
    419     )
    420 

~/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/__init__.py in initialize(args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config_params)
    123                                  dist_init_required=dist_init_required,
    124                                  collate_fn=collate_fn,
--> 125                                  config_params=config_params)
    126     else:
    127         assert mpu is None, "mpu must be None with pipeline parallelism"

~/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/runtime/engine.py in __init__(self, args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config_params, dont_change_device)
    181         self.lr_scheduler = None
    182         if model_parameters or optimizer:
--> 183             self._configure_optimizer(optimizer, model_parameters)
    184             self._configure_lr_scheduler(lr_scheduler)
    185             self._report_progress(0)

~/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/runtime/engine.py in _configure_optimizer(self, client_optimizer, model_parameters)
    596                 logger.info('Using client Optimizer as basic optimizer')
    597         else:
--> 598             basic_optimizer = self._configure_basic_optimizer(model_parameters)
    599             if self.global_rank == 0:
    600                 logger.info(

~/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/runtime/engine.py in _configure_basic_optimizer(self, model_parameters)
    665                     optimizer = DeepSpeedCPUAdam(model_parameters,
    666                                                  **optimizer_parameters,
--> 667                                                  adamw_mode=effective_adam_w_mode)
    668                 else:
    669                     from deepspeed.ops.adam import FusedAdam

~/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/adam/cpu_adam.py in __init__(self, model_params, lr, bias_correction, betas, eps, weight_decay, amsgrad, adamw_mode)
     76         DeepSpeedCPUAdam.optimizer_id = DeepSpeedCPUAdam.optimizer_id + 1
     77         self.adam_w_mode = adamw_mode
---> 78         self.ds_opt_adam = CPUAdamBuilder().load()
     79 
     80         self.ds_opt_adam.create_adam(self.opt_id,

~/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in load(self, verbose)
    213             return importlib.import_module(self.absolute_name())
    214         else:
--> 215             return self.jit_load(verbose)
    216 
    217     def jit_load(self, verbose=True):

~/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/op_builder/builder.py in jit_load(self, verbose)
    250             extra_cuda_cflags=self.nvcc_args(),
    251             extra_ldflags=self.extra_ldflags(),
--> 252             verbose=verbose)
    253         build_duration = time.time() - start_build
    254         if verbose:

~/anaconda3/envs/python3/lib/python3.6/site-packages/torch/utils/cpp_extension.py in load(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
   1089     if isinstance(cuda_sources, str):
   1090         cuda_sources = [cuda_sources]
-> 1091 
   1092     cpp_sources.insert(0, '#include <torch/extension.h>')
   1093 

~/anaconda3/envs/python3/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _jit_compile(name, sources, extra_cflags, extra_cuda_cflags, extra_ldflags, extra_include_paths, build_directory, verbose, with_cuda, is_python_module, is_standalone, keep_intermediates)
   1315 
   1316 
-> 1317 def verify_ninja_availability():
   1318     r'''
   1319     Raises ``RuntimeError`` if `ninja <https://ninja-build.org/>`_ build system is not

~/anaconda3/envs/python3/lib/python3.6/site-packages/torch/utils/cpp_extension.py in _import_module_from_library(module_name, path, is_python_module)
   1697                       sources,
   1698                       objects,
-> 1699                       ldflags,
   1700                       library_target,
   1701                       with_cuda) -> None:

~/anaconda3/envs/python3/lib/python3.6/imp.py in find_module(name, path)
    295         break  # Break out of outer loop when breaking out of inner loop.
    296     else:
--> 297         raise ImportError(_ERR_MSG.format(name), name=name)
    298 
    299     encoding = None

ImportError: No module named 'cpu_adam'
```

Thanks in advance for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error building extension 'cpu_adam' #889

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error building extension 'cpu_adam' #889

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions