Skip to content
Merged

tra #183

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
c3bef20
add autotune (#4822)
Xu-Kai Sep 28, 2023
ed06731
update Colossal (#4832)
TongLi3701 Sep 28, 2023
3a74eb4
[Infer] Colossal-Inference serving example w/ TorchServe (single GPU …
yuanheng-zhao Oct 2, 2023
573f270
[Infer] Serving example w/ ray-serve (multiple GPU case) (#4841)
yuanheng-zhao Oct 2, 2023
013a4be
[inference]fix import bug and delete down useless init (#4830)
CjhHa1 Oct 4, 2023
d1fcc0f
[infer] fix test bug (#4838)
Xu-Kai Oct 4, 2023
db40e08
[test] modify model supporting part of low_level_zero plugin (includi…
Oct 5, 2023
c97a352
fix: typo in comment of low_level_zero plugin
shawlleyw Oct 5, 2023
81ee91f
Merge pull request #4858 from Shawlleyw/main
ppt0011 Oct 6, 2023
ad23460
Merge pull request #4856 from KKZ20/test/model_support_for_low_level_…
ppt0011 Oct 6, 2023
cb3a25a
[checkpointio] hotfix torch 2.0 compatibility (#4824)
ver217 Oct 7, 2023
eef96e0
polish code for gptq (#4793)
littsk Sep 25, 2023
07ed155
[NFC] polish colossalai/inference/quant/gptq/cai_gptq/__init__.py cod…
MichelleMa8 Sep 27, 2023
cd6a962
[NFC] polish code style (#4799)
Camille7777 Sep 27, 2023
8aed02b
[nfc] fix minor typo in README (#4846)
blagoySimandov Oct 7, 2023
6a21f96
[doc] update advanced tutorials, training gpt with hybrid parallelism…
flybird11111 Oct 10, 2023
3043d5d
Update modelscope link in README.md
Camille7777 Oct 10, 2023
d6c4b9b
Update main README.md
Camille7777 Oct 10, 2023
afe10a8
Update README.md
Camille7777 Oct 10, 2023
652adc2
Update README.md
Camille7777 Oct 10, 2023
08a9f76
[Pipeline Inference] Sync pipeline inference branch to main (#4820)
FoolPlayer Oct 11, 2023
fdec650
fix test llama (#4884)
Xu-Kai Oct 11, 2023
1dcaf24
[doc] add reminder for issue encountered with hybrid adam
ppt0011 Oct 11, 2023
ffd9a3c
[hotfix] fix bug in sequence parallel test (#4887)
littsk Oct 11, 2023
c1fab95
Merge pull request #4889 from ppt0011/main
ppt0011 Oct 12, 2023
df63564
[gemini] support amp o3 for gemini (#4872)
ver217 Oct 12, 2023
83b52c5
[feature] Add clip_grad_norm for hybrid_parallel_plugin (#4837)
littsk Oct 12, 2023
39f2582
[hotfix] fix lr scheduler bug in torch 2.0 (#4864)
Oct 12, 2023
77a9328
[inference] add llama2 support (#4898)
Xu-Kai Oct 13, 2023
a0684e7
[feature] support no master weights option for low level zero plugin …
KKZ20 Oct 13, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,8 @@ distributed training and inference in a few lines.
- One half-day of training using a few hundred dollars yields similar results to mainstream large models, open-source and commercial-free domain-specific LLM solution.
[[code]](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Colossal-LLaMA-2)
[[blog]](https://www.hpc-ai.tech/blog/one-half-day-of-training-using-a-few-hundred-dollars-yields-similar-results-to-mainstream-large-models-open-source-and-commercial-free-domain-specific-llm-solution)
[[model weights]](https://huggingface.co/hpcai-tech/Colossal-LLaMA-2-7b-base)
[[HuggingFace model weights]](https://huggingface.co/hpcai-tech/Colossal-LLaMA-2-7b-base)
[[Modelscope model weights]](https://www.modelscope.cn/models/colossalai/Colossal-LLaMA-2-7b-base/summary)

| | Backbone | Tokens Consumed | | MMLU | CMMLU | AGIEval | GAOKAO | CEval |
| :----------------------------: | :--------: | :-------------: | :------------------: | :-----------: | :-----: | :----: | :----: | :------------------------------: |
Expand Down
22 changes: 20 additions & 2 deletions applications/Colossal-LLaMA-2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,9 @@
* [2023/09] [One Half-Day of Training Using a Few Hundred Dollars Yields Similar Results to Mainstream Large Models, Open-Source and Commercial-Free Domain-Specific Llm Solution](https://www.hpc-ai.tech/blog/one-half-day-of-training-using-a-few-hundred-dollars-yields-similar-results-to-mainstream-large-models-open-source-and-commercial-free-domain-specific-llm-solution)
[[code]](https://github.com/hpcaitech/ColossalAI/tree/main/applications/Colossal-LLaMA-2)
[[blog]](https://www.hpc-ai.tech/blog/one-half-day-of-training-using-a-few-hundred-dollars-yields-similar-results-to-mainstream-large-models-open-source-and-commercial-free-domain-specific-llm-solution)
[[model weights]](https://huggingface.co/hpcai-tech/Colossal-LLaMA-2-7b-base)
[[HuggingFace model weights]](https://huggingface.co/hpcai-tech/Colossal-LLaMA-2-7b-base)
[[Modelscope model weights]](https://www.modelscope.cn/models/colossalai/Colossal-LLaMA-2-7b-base/summary)


## Colossal-LLaMA-2-7B
The [Colossal-AI](https://github.com/hpcaitech/ColossalAI) team has introduced the open-source model **Colossal-LLaMA-2-7B-base**. This model, a derivation of LLaMA-2, has undergone continual pre-training involving approximately 8.5 billion tokens over a duration of 15 hours with 64 A800 GPUs. At a cost of **less than $1,000**, you can achieve results **similar to those that cost millions of dollars to pretrain from scratch**. It is licensed under the LLaMA-2 license and [Apache 2.0 License](https://github.com/hpcaitech/ColossalAI/blob/main/LICENSE) **without any additional commercial use restrictions**. This solution can also be used to build models of specific domain knowledge or tasks.
Expand Down Expand Up @@ -122,7 +124,23 @@ pred = model.generate(**inputs,
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True)[len(input):])
```

You can also download model weights from [🤗HuggingFace](https://huggingface.co/hpcai-tech/Colossal-LLaMA-2-7b-base).
You can also load our model using modelscope, use the following code:
```Python
from modelscope import AutoModelForCausalLM, AutoTokenizer, snapshot_download
model_dir = snapshot_download('colossalai/Colossal-LLaMA-2-7b-base', revision='v1.0.1')
tokenizer = AutoTokenizer.from_pretrained(model_dir, device_map="auto", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True).eval()
generation_kwargs = {"max_new_tokens": 256,
"top_p": 0.95,
"temperature": 0.3
}
input = '离离原上草,'
inputs = tokenizer(input, return_token_type_ids=False, return_tensors='pt')
inputs = inputs.to('cuda:0')
output = model.generate(**inputs, **generation_kwargs)
print(tokenizer.decode(output.cpu()[0], skip_special_tokens=True)[len(input):])
```
You can download model weights from [🤗HuggingFace](https://huggingface.co/hpcai-tech/Colossal-LLaMA-2-7b-base) or [👾Modelscope](https://modelscope.cn/models/colossalai/Colossal-LLaMA-2-7b-base/summary).

## Usage
### Install
Expand Down
2 changes: 1 addition & 1 deletion applications/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ The list of applications include:

- [X] [Colossal-LLaMA-2](./Colossal-LLaMA-2/): Continual Pre-training of LLaMA-2.
- [X] [ColossalEval](./ColossalEval): Evaluation Pipeline for LLMs.
- [X] [Chatbot](./Chat/README.md): Replication of ChatGPT with RLHF.
- [X] [ColossalChat](./Chat/README.md): Replication of ChatGPT with RLHF.
- [X] [FastFold](https://github.com/hpcaitech/FastFold): Optimizing AlphaFold (Biomedicine) Training and Inference on GPU Clusters.

> Please note that the `Chatbot` application is migrated from the original `ChatGPT` folder.
Expand Down
75 changes: 60 additions & 15 deletions colossalai/amp/naive_amp/mixed_precision_optimizer.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from typing import Dict, List
from typing import Dict, List, Tuple

import torch
from torch import Tensor
from torch import Tensor, inf
from torch.nn import Module, Parameter
from torch.optim import Optimizer

Expand Down Expand Up @@ -68,8 +68,6 @@ def __init__(
self.mixed_precision = BF16MixedPrecisionMixin()
else:
raise ValueError(f"Unsupported precision: {precision}")
if max_norm > 0.0:
raise NotImplementedError("max_norm is not supported yet.")
self.max_norm = max_norm
self.working_to_master_map: Dict[Parameter, Tensor] = {}
self.master_to_working_map: Dict[Tensor, Parameter] = {}
Expand Down Expand Up @@ -102,32 +100,65 @@ def zero_grad(self, *args, **kwargs):
return super().zero_grad(*args, **kwargs)

def _unscale_and_clip_grads(self, total_norm: float) -> None:
"""
Unscale and clip gradients before performing the optimization step.

Args:
total_norm (float): The computed total gradient norm.

Returns:
None
"""
div_scale = 1.0

# If mixed-precision training is used, get the gradient division scale from the mixed-precision handler.
if self.mixed_precision is not None:
div_scale = self.mixed_precision.get_grad_div_scale()

if self.max_norm > 0.0:
# norm is in fact norm*scale
# Calculate the scaling factor for gradient clipping
# The gradient norm is scaled by 'div_scale' and then clipped to 'max_norm'
clip = ((total_norm / div_scale) + 1e-6) / self.max_norm

# If the clip factor exceeds 1, adjust 'div_scale' accordingly to ensure clipping
if clip > 1:
div_scale = clip * div_scale

# Apply the scaling factor to gradients
for group in self.param_groups:
for p in group["params"]:
if p.grad is None:
continue
p.grad.data.mul_(1.0 / div_scale)

def _compute_grad_norm(self) -> float:
if self.max_norm <= 0.0:
return 0.0
grads = [p.grad for group in self.param_groups for p in group["params"] if p.grad is not None]
if len(grads) == 0:
def _compute_grad_norm(self, param_gradient_pairs: List[Tuple[Tensor]], norm_type: int = 2) -> int:
r"""
Compute and return the gradient norm for gradient clipping.

Args:
param_gradient_pairs (List[Tuple[Tensor]]): List of (parameter, gradient) pairs; gradients are used for norm calculation.
norm_type (int, optional): Type of the norm used (e.g., 2 for L2 norm). Defaults to 2.

Returns:
float: The total norm of the given gradients.
"""

if len(param_gradient_pairs) == 0:
return 0.0
device = grads[0].device
# TODO(ver217): support tp
total_norm = torch.norm(torch.stack([torch.norm(g.detach(), 2).to(device) for g in grads]), 2)
return total_norm.item()

# gradients used for norm calculation.
gradients = [grad for param, grad in param_gradient_pairs]

if norm_type == inf:
total_norm = max(grad.data.abs().max() for grad in gradients)

else:
total_norm_exponentiated = 0.0
for grad in gradients:
total_norm_exponentiated += grad.data.double().norm(norm_type) ** norm_type
total_norm = total_norm_exponentiated ** (1.0 / norm_type)

return total_norm

def step(self, *args, **kwargs):
if self.mixed_precision.should_skip_step():
Expand All @@ -142,8 +173,22 @@ def step(self, *args, **kwargs):
if working_param.grad is not None:
p.grad = working_param.grad.data.float()
working_param.grad = None
total_norm = self._compute_grad_norm()

# gradient unscale and clip.
if self.max_norm <= 0:
# no need to compute gradient norm.
total_norm = 0.0
else:
# compute the total norm.
param_gradient_pairs = [
(self.master_to_working_map[p], p.grad)
for group in self.param_groups
for p in group["params"]
if p.grad is not None
]
total_norm = self._compute_grad_norm(param_gradient_pairs)
self._unscale_and_clip_grads(total_norm)

self.optim.step(*args, **kwargs)
# update working params
for group in self.optim.param_groups:
Expand Down
5 changes: 4 additions & 1 deletion colossalai/booster/plugin/gemini_plugin.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ def save_sharded_model(

Path(checkpoint_path).mkdir(parents=True, exist_ok=True)

state_dict_shard = model.state_dict_shard(max_shard_size=max_shard_size, only_rank_0=True, dtype=torch.float32)
state_dict_shard = model.state_dict_shard(max_shard_size=max_shard_size, only_rank_0=True)
weights_name, save_index_file = get_model_base_filenames(prefix, use_safetensors)
index_file = CheckpointIndexFile(checkpoint_path)

Expand Down Expand Up @@ -257,6 +257,7 @@ class GeminiPlugin(DPPluginBase):
warmup_non_model_data_ratio (float, optional): ratio of expected non-model data memory during warmup. Only for "auto" placement. Defaults to 0.8.
steady_cuda_cap_ratio (float, optional): ratio of allowed cuda capacity for model data during steady state. Only for "auto" placement. Defaults to 0.9.
precision (str, optional): precision. Support 'fp16' and 'bf16'. Defaults to 'fp16'.
master_weights (bool, optional): master weights. Defaults to True.
pin_memory (bool, optional): use pin memory on CPU. Defaults to False.
force_outputs_fp32 (bool, optional): force outputs are fp32. Defaults to False.
strict_ddp_mode (bool, optional): use strict ddp mode (only use dp without other parallelism). Defaults to False.
Expand Down Expand Up @@ -296,6 +297,7 @@ def __init__(
warmup_non_model_data_ratio: float = 0.8, # only for auto placement
steady_cuda_cap_ratio: float = 0.9, # only for auto placement
precision: str = "fp16",
master_weights: bool = True,
pin_memory: bool = False,
force_outputs_fp32: bool = False,
strict_ddp_mode: bool = False,
Expand Down Expand Up @@ -334,6 +336,7 @@ def __init__(
min_chunk_size_m=min_chunk_size_m,
memstats=memstats,
mixed_precision=PRECISION_STR_TO_DTYPE[precision],
master_weights=master_weights,
)
self.zero_optim_config = dict(
gpu_margin_mem_ratio=gpu_margin_mem_ratio,
Expand Down
Loading