[BUG] Error in llama3.1 resizing embedding with ZeRO 3

### Required prerequisites

- [X] I have read the documentation <https://align-anything.readthedocs.io>.
- [X] I have searched the [Issue Tracker](https://github.com/PKU-Alignment/align-anything/issues) and [Discussions](https://github.com/PKU-Alignment/align-anything/discussions) that this hasn't already been reported. (+1 or comment there if it has.)
- [X] Consider asking first in a [Discussion](https://github.com/PKU-Alignment/align-anything/discussions/new).

### What version of align-anything are you using?

0.1.0-dev

### System information

- `transformers` version: 4.43.1
- Platform: Linux-5.15.0-1040-nvidia-x86_64-with-glibc2.35
- Python version: 3.11.9
- Huggingface_hub version: 0.24.1
- Safetensors version: 0.4.3
- Accelerate version: 0.33.0
- Accelerate config:    not found
- PyTorch version (GPU?): 2.3.1+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>

### Problem description

The llama3.1 can naturally be supported by the training, evaluation, and deployment modules of Align-Anything. However, according to our tests, due to some issues with the current transformers, it is temporarily unable to support deepspeed's ZeRO3 training. Our developers have reported [this issue to the transformers community](https://github.com/huggingface/transformers/issues/32196), and we have received a clear response and will continue to follow up.

This bug may affect the training of other types of models. Currently, if you need to use a stable version for training, you can temporarily use transformers version 4.41.2.

If you want to fine-tune llama3.1, we have verified that using ZeRO 2 can be implemented without errors in the latest 4.43.0 version of transformers.

### Reproducible example code

```python
import torch
import deepspeed
import json

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer
)

from transformers.integrations.deepspeed import HfDeepSpeedConfig


DEFAULT_BOS_TOKEN: str = '<s>'
DEFAULT_EOS_TOKEN: str = '</s>'
DEFAULT_PAD_TOKEN: str = '<pad>'
DEFAULT_UNK_TOKEN: str = '<unk>'

model_name_or_path = 'PATHTO/Llama-3.1'
ds_cfgs_path = 'PATH'

deepspeed.init_distributed()

with open(ds_cfgs_path) as f:
    ds_cfgs = json.load(f)
    ds_cfgs['bf16']['enabled'] = True

dstchf = HfDeepSpeedConfig(ds_cfgs)

tokenizer = AutoTokenizer.from_pretrained(
    model_name_or_path,
    model_max_length=2048,
    padding_side='right',
    trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
        model_name_or_path,
        torch_dtype=torch.bfloat16,
        trust_remote_code=True,
)

# Reference: https://github.com/tatsu-lab/stanford_alpaca/blob/main/train.py
def resize_tokenizer_embedding(tokenizer, model) -> None:
    """Resize tokenizer and embedding.

    Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
    """
    def init_new_embeddings(
        embeddings,
        new_num_embeddings: int,
        num_new_embeddings: int,
    ) -> None:
        if embeddings is None:
            return

        params = [embeddings.weight]
        print(hasattr(embeddings.weight, 'ds_id'))
        # True for transformers 4.43.1, False for transformers 4.41.2
        exit()
        context = (
            deepspeed.zero.GatheredParameters(params, modifier_rank=0)
            if is_deepspeed_zero3_enabled()
            else contextlib.nullcontext()
        )
        with context:
            for param in params:
                if param is None:
                    continue
                assert param.size(0) == new_num_embeddings, f'{param.size(0)}, {new_num_embeddings}'
                # bug here, param size is 32000 while new_num_embeddings is 32001
                param_data = param.data
                param_mean = param_data[:-num_new_embeddings].mean(dim=0, keepdim=True)
                param_data[-num_new_embeddings:] = param_mean

    special_tokens_dict = {}
    if tokenizer.pad_token is None:
        special_tokens_dict['pad_token'] = DEFAULT_PAD_TOKEN
    if tokenizer.eos_token is None:
        special_tokens_dict['eos_token'] = DEFAULT_EOS_TOKEN
    if tokenizer.bos_token is None:
        special_tokens_dict['bos_token'] = DEFAULT_BOS_TOKEN
    if tokenizer.unk_token is None:
        special_tokens_dict['unk_token'] = DEFAULT_UNK_TOKEN

    num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
    new_num_embeddings = len(tokenizer)

    model.config.bos_token_id = tokenizer.bos_token_id
    model.config.eos_token_id = tokenizer.eos_token_id
    model.config.pad_token_id = tokenizer.pad_token_id

    if num_new_tokens > 0:
        hf_device_map = getattr(model, 'hf_device_map', {})
        devices = {
            torch.device(device)
            for device in hf_device_map.values()
            if device not in {'cpu', 'disk'}
        }
        is_model_parallel = len(devices) > 1

        if not is_model_parallel:
            model.resize_token_embeddings(new_num_embeddings)

            init_new_embeddings(
                model.get_input_embeddings(),
                new_num_embeddings=new_num_embeddings,
                num_new_embeddings=num_new_tokens,
            )
            init_new_embeddings(
                model.get_output_embeddings(),
                new_num_embeddings=new_num_embeddings,
                num_new_embeddings=num_new_tokens,
            )
            
resize_tokenizer_embedding(tokenizer=tokenizer, model=model)
```

### Traceback

_No response_

### Expected behavior

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Error in llama3.1 resizing embedding with ZeRO 3 #26

Required prerequisites

What version of align-anything are you using?

System information

Problem description

Reproducible example code

Traceback

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Error in llama3.1 resizing embedding with ZeRO 3 #26

Description

Required prerequisites

What version of align-anything are you using?

System information

Problem description

Reproducible example code

Traceback

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions