Add Cambricon MLU accelerator support by huismiling · Pull Request #2552 · huggingface/accelerate

huismiling · 2024-03-13T08:58:23Z

What does this PR do?

If I want to use Cambricon MLUs to train 🤗 Transformers models, the support should be added in Accelerate first and then will come in the Trainer for free.
This PR will support Cambricon MLU accelerator：

Sample config after running the accelerate config command:

debug: false
distributed_type: MULTI_MLU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

run nlp_example.py with MLUs.
accelerate launch nlp_example.py
Below are the output logs:

You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
epoch 0: {'accuracy': 0.6838235294117647, 'f1': 0.8122270742358079}
epoch 1: {'accuracy': 0.7058823529411765, 'f1': 0.8170731707317073}
epoch 2: {'accuracy': 0.7598039215686274, 'f1': 0.8398692810457516}

about Cambricon MLU
Cambricon MLU is a AI processor that support AI frameworks like PyTorch, TensorFlow, etc. So, Its possible run Transformers/Accelerate on MLUs to train foundation model. Website: https://www.cambricon.com

huismiling · 2024-03-13T09:04:44Z

@sgugger Hi, good day. Could you please review this PR, thanks

muellerzr · 2024-03-13T11:04:59Z

Sylvain is no longer on this project/at Hugging Face, I’ll review this today. Thanks for your contribution!

HuggingFaceDocBuilderDev · 2024-03-13T13:58:37Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

muellerzr

Thanks! Overall this is a very sound PR and looks exciting to support yet another hardware! 🚀

Left some suggestions and a few questions I'd like addressed before moving forward. Thanks!

muellerzr · 2024-03-13T14:03:15Z

Also for the quality checks, please do pip install -e .[quality] along with make style; make quality

it's beautiful ！ Co-authored-by: Zach Mueller <muellerzr@gmail.com>

huismiling · 2024-03-14T01:11:33Z

Also for the quality checks, please do pip install -e .[quality] along with make style; make quality

Below is the output.

ruff format .
126 files left unchanged
doc-builder style src/accelerate docs/source --max_len 119
ruff check .
ruff format --check .
126 files already formatted
doc-builder style src/accelerate docs/source --max_len 119 --check_only

huismiling · 2024-03-14T01:21:54Z

@muellerzr Thanks for your advice.
The below is done.

Deleted the torch check.
Deleted torch.cuda = torch.mlu.

muellerzr

Thanks! Overall looks like a straightforward and easy integration!

cc @SunMarc for the big model inference stuff it touches :)

SunMarc

Awesome ! Thanks for the clean integration of MLU with big model inference @huismiling ! Can you confirm that you are able to load a model on multi-mlu when using transformers library ( by passing device_map="auto when loading a model such as llama2 or mistral ) ?

huismiling · 2024-03-15T16:45:42Z

cc @SunMarc @muellerzr
Hi, I tried Llama-2-7b-chat-hf model with this code by using 8-MLUs.

from transformers import AutoTokenizer
import transformers
import torch

model = "/llm/models/Llama-2-7b-chat-hf/"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

sequences = pipeline(
    'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=100,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

below is the output.

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:06<00:06,  6.38s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:07<00:00,  3.52s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:07<00:00,  3.95s/it]
/llm/transformers/src/transformers/generation/configuration_utils.py:492: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/llm/transformers/src/transformers/generation/configuration_utils.py:497: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.
  warnings.warn(
/llm/transformers/src/transformers/generation/configuration_utils.py:492: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/llm/transformers/src/transformers/generation/configuration_utils.py:497: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Result: I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?

I'm open to different genres and topics, but I prefer shows with complex characters and compelling storytelling.

Thanks!

muellerzr · 2024-03-18T17:14:31Z

Fantastic! Thanks for verifying! I’ll merge once the CI finishes :)

huismiling · 2024-03-19T03:02:04Z

torch._dynamo has conflicts with lru_cache.
E torch._dynamo.exc.Unsupported: call_method UserDefinedObjectVariable(_lru_cache_wrapper) __call__ [] {}

use device.type to do mlu device check.

local test is passed !

muellerzr · 2024-03-20T14:58:57Z

Great work! Thanks for verifying! (failing test is unrelated)

huismiling added 3 commits March 13, 2024 15:41

Add Cambricon MLU accelerator support

bc5ccfb

up mlu support for test

3ad38dc

fix mlu device MULTI_MLU

be32c91

huismiling mentioned this pull request Mar 13, 2024

add Cambricon MLUs support huggingface/transformers#29627

Merged

muellerzr self-requested a review March 13, 2024 11:05

muellerzr reviewed Mar 13, 2024

View reviewed changes

Comment thread src/accelerate/utils/imports.py Outdated

Comment thread src/accelerate/utils/imports.py Outdated

Comment thread src/accelerate/utils/imports.py Outdated

Update src/accelerate/utils/imports.py

421c142

it's beautiful ！ Co-authored-by: Zach Mueller <muellerzr@gmail.com>

up mlu for quality check

78cd1cb

huismiling requested a review from muellerzr March 14, 2024 10:12

muellerzr approved these changes Mar 14, 2024

View reviewed changes

muellerzr requested a review from SunMarc March 14, 2024 11:29

SunMarc approved these changes Mar 14, 2024

View reviewed changes

fix mlu device longTensor error

3abd038

huismiling requested review from SunMarc and muellerzr March 18, 2024 01:18

huismiling added 2 commits March 19, 2024 09:07

fix mlu device tensor dtype check

0542987

fix mlu device send_to_device with torch dynamo error

e024276

muellerzr merged commit f2778d6 into huggingface:main Mar 20, 2024

huismiling mentioned this pull request Apr 29, 2024

support Cambricon MLUs device huggingface/peft#1687

Merged

huismiling mentioned this pull request May 11, 2024

support for Cambricon MLUs device safetensors/safetensors#479

Closed

SunMarc mentioned this pull request Jun 27, 2024

fix bug when getting the real accelerator's device number #2874

Closed

fmo-mt mentioned this pull request Jul 5, 2024

Support MUSA (Moore Threads GPU) backend in accelerate #2917

Merged

renovate Bot mentioned this pull request Sep 3, 2024

fix(deps): update all non-major dependencies (minor) SocialGouv/ragga#15

Open

1 task

huismiling mentioned this pull request Oct 24, 2024

Add support for Cambricon mlu devices safetensors/safetensors#535

Merged

Conversation

huismiling commented Mar 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

huismiling commented Mar 13, 2024

Uh oh!

muellerzr commented Mar 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Mar 13, 2024

Uh oh!

muellerzr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

muellerzr commented Mar 13, 2024

Uh oh!

huismiling commented Mar 14, 2024

Uh oh!

huismiling commented Mar 14, 2024

Uh oh!

muellerzr left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

huismiling commented Mar 15, 2024

Uh oh!

muellerzr commented Mar 18, 2024

Uh oh!

huismiling commented Mar 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

muellerzr commented Mar 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

huismiling commented Mar 13, 2024 •

edited

Loading

muellerzr commented Mar 13, 2024 •

edited

Loading

huismiling commented Mar 19, 2024 •

edited

Loading