Is there an existing issue for this bug?
🐛 Describe the bug
I got an error when run applications/Colossal-LLaMA/prepare_sft_dataset.py
the script is :
python /mnt/data/tool/ColossalAI-0.4.0/applications/Colossal-LLaMA/prepare_sft_dataset.py \ --data_input_dirs "/mnt/data/dataset/llama3/prepare/original/2000items" \ --tokenizer_dir "/mnt/data/model/modelscope/Meta-Llama-3-8B-Instruct" \ --data_output_dirs "/mnt/data/dataset/llama3/prepare/2000items-llama3" \ --max_length 1024 \ --num_spliced_dataset_bins 10 \ --llama_version 3
the error is:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [07/19/24 16:52:06] INFO colossalai - colossalai - INFO: /mnt/data/tool/ColossalAI-0.4.0/applications/Colossal-LLaMA/prepare_sft_dataset.py:102 main INFO colossalai - colossalai - INFO: Start to process part-0/10 of all original datasets. Traceback (most recent call last): File "/mnt/data/tool/ColossalAI-0.4.0/applications/Colossal-LLaMA/prepare_sft_dataset.py", line 147, in <module> main() File "/mnt/data/tool/ColossalAI-0.4.0/applications/Colossal-LLaMA/prepare_sft_dataset.py", line 106, in main "tokenizer": tokenizer, ^^^^^^^^^ UnboundLocalError: cannot access local variable 'default_conversation' where it is not associated with a value
I've solved this bug and commit a PR soon...
Environment
● ubuntu22.04
● CPU:96c;
● RAM:736 GiB;
● GPU:8 * NVIDIA V100 (32GB)
● Python 3.11.5;
● ColossalAI 0.4.0;
● cuda_11.8;
● pytorch 2.1.0+cu118
Is there an existing issue for this bug?
🐛 Describe the bug
I got an error when run applications/Colossal-LLaMA/prepare_sft_dataset.py
the script is :
python /mnt/data/tool/ColossalAI-0.4.0/applications/Colossal-LLaMA/prepare_sft_dataset.py \ --data_input_dirs "/mnt/data/dataset/llama3/prepare/original/2000items" \ --tokenizer_dir "/mnt/data/model/modelscope/Meta-Llama-3-8B-Instruct" \ --data_output_dirs "/mnt/data/dataset/llama3/prepare/2000items-llama3" \ --max_length 1024 \ --num_spliced_dataset_bins 10 \ --llama_version 3the error is:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [07/19/24 16:52:06] INFO colossalai - colossalai - INFO: /mnt/data/tool/ColossalAI-0.4.0/applications/Colossal-LLaMA/prepare_sft_dataset.py:102 main INFO colossalai - colossalai - INFO: Start to process part-0/10 of all original datasets. Traceback (most recent call last): File "/mnt/data/tool/ColossalAI-0.4.0/applications/Colossal-LLaMA/prepare_sft_dataset.py", line 147, in <module> main() File "/mnt/data/tool/ColossalAI-0.4.0/applications/Colossal-LLaMA/prepare_sft_dataset.py", line 106, in main "tokenizer": tokenizer, ^^^^^^^^^ UnboundLocalError: cannot access local variable 'default_conversation' where it is not associated with a valueI've solved this bug and commit a PR soon...
Environment
● ubuntu22.04
● CPU:96c;
● RAM:736 GiB;
● GPU:8 * NVIDIA V100 (32GB)
● Python 3.11.5;
● ColossalAI 0.4.0;
● cuda_11.8;
● pytorch 2.1.0+cu118