Add Starcoder 2 by Muhtasham · Pull Request #502 · ml-explore/mlx-examples

Muhtasham · 2024-02-28T21:09:57Z

Adding new code models that dropped

awni · 2024-02-28T21:19:06Z

This looks nice. Is it working?

There is an open issue btw #500, not sure if @lazarust made any progress? Maybe this is a good PR to collaborate on, though it looks like it is basically done?

lazarust · 2024-02-28T21:41:57Z

@awni I haven't had time to make progress yet, but this solution looks like it's the right direction

Muhtasham · 2024-02-28T22:38:00Z

Is it working?

@awni not yet hf is down 😞 but when trying to generate locally despite files being there and import working getting

ValueError: Model type starcoder2 not supported. Error: No module named 'mlx_lm.models.starcoder2'

but when I do in python shell I do not get that error but hf down error, is this behaviour intended

@lazarust happy to collaborate, this is still draft based on original mistral implementation I think we need some more changes from huggingface/transformers#29120

awni · 2024-02-29T00:40:48Z

but when I do in python shell I do not get that error but hf down error, is this behaviour intended

No it's not. Make sure you do not have another MLX LM installed (pip uninstall mlx-lm). Also make sure you have an editable install pip install -e .

Muhtasham · 2024-02-29T19:25:49Z

qq @awni is gelu_pytorch_tanh equivalnet in mlx is

    def __call__(self, x) -> mx.array:
        return self.w2(nn.gelu(self.w1(x)) * self.w3(x))

angeloskath · 2024-02-29T19:31:47Z

@Muhtasham the tanh approximation in PyTorch is not strictly the same as nn.gelu but it is close to several significant digits so it should not cause any issues. See the plot at ml-explore/mlx#744 (comment) for the difference between PyTorch's Tanh approximation and nn.gelu .

awni · 2024-03-01T04:06:28Z

How is this going? Were you able to generate any sensible code yet?

Muhtasham · 2024-03-01T09:26:40Z

Having some,

raise ValueError(f"Received parameters not in model: {extras}.")

I went through model config nothing seems extra any hints @awni ?

command

python -m mlx_lm.convert \
    --hf-path bigcode/starcoder2-3b \
    -q \
    --upload-repo mlx-community/starcoder2-3b-4bit

mzbac · 2024-03-01T09:50:34Z

I suggest either referring to the transformers implementation if you are familiar with the codebase, or loading the model weights and checking the weight names which will give you a hint of the model structure.

https://github.com/huggingface/transformers/blob/main/src/transformers/models/starcoder2/modeling_starcoder2.py#L1070

mzbac · 2024-03-01T10:19:04Z

And please note that the startcoder2 seems to have enabled the bias weight for all the linear layers, so you may need to enable it in nn.Linear.

awni · 2024-03-01T14:38:48Z

Thanks for the help here @mzbac !! @Muhtasham after those fixes is it working?

mzbac · 2024-03-02T01:01:08Z

By looking at the tokenizer config, it specifies token to allow the model to fill in the missing code in the middle of the code snippet, something like this prompt = <fim_prefix>${textAboveCursor}<fim_suffix>${textBelowCursor}<fim_middle>.

I'm not sure i follow how you generate just the middle portion.. I guess I should look at the paper for the details.

Yeah, basically it's just a different way to structure the prompt and allow the model to autocomplete for middle portion. A more practical example is similar to using GitHub Copilot, where the model will fill in the content at the cursor position.

awni · 2024-03-02T04:03:54Z

Ok, I'm a little confused as to what to do with this. The model doesn't generate sensible outputs, so is the plan to land it so people can use LoRA / fine-tunes? Or do we need to fix the prompt? Or is there some other thing that I'm missing?

mzbac · 2024-03-02T04:12:39Z

Ok, I'm a little confused as to what to do with this. The model doesn't generate sensible outputs, so is the plan to land it so people can use LoRA / fine-tunes? Or do we need to fix the prompt? Or is there some other thing that I'm missing?

I will do some testing tonight. In terms of fine-tuning, my understanding is that the FIM model is just different in training data prompt format and not different from normal fine-tuning. However, I have not done any FIM fine-tuning so I may be wrong.

mzbac · 2024-03-02T13:24:20Z

I did some quick tests and it looks like there are issues in the mlx implementation. I couldn't figure out exactly where, but when you use the transformers example as shown below, you can see that the model generates correct FIM code. However, this doesn't work in the mlx implementation.

# pip install git+https://github.com/huggingface/transformers.git # TODO: merge PR to main
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "bigcode/starcoder2-3b"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

tokenizer.eos_token = "<file_sep>"
inputs = tokenizer.encode("<fim_prefix>\ndef quicksort(arr):\n<fim_suffix>\n<fim_middle>\n", return_tensors="pt").to(device)
outputs = model.generate(input_ids=inputs,
    temperature=0, 
    max_new_tokens=100,
)
print(tokenizer.decode(outputs[0]))

output:

<fim_prefix>
def quicksort(arr):
<fim_suffix>
<fim_middle>
        if len(arr) <= 1:
                return arr
        else:
                pivot = arr[0]
                less = [i for i in arr[1:] if i <= pivot]
                greater = [i for i in arr[1:] if i > pivot]
                return quicksort(less) + [pivot] + quicksort(greater)

print(quicksort([10, 5, 2, 3]))<file_sep>

awni · 2024-03-02T14:15:48Z

However, this doesn't work in the mlx implementation

You tried it with the correct prompt right? Do we add the FIM prompt by default or is that something you have to do manually at the moment?

mzbac · 2024-03-02T15:02:40Z

However, this doesn't work in the mlx implementation

You tried it with the correct prompt right? Do we add the FIM prompt by default or is that something you have to do manually at the moment?

I used python -m mlx_lm.generate --model bigcode/starcoder2-3b --prompt "<fim_prefix>\ndef quicksort(arr):\n<fim_suffix>\n<fim_middle>\n" --temp 0.0 --eos-token "<file_sep>" --ignore-chat-template, it should be the exact prompt used in the transformer implementation.

mzbac · 2024-03-03T01:08:53Z

Based on @Blaizzy's PR (#518), the instruction prompt failed because it was using traditional Rope. Changing to traditional = False should fix it. ~~However, I have tried his implementation and it still doesn't work with FIM prompt on my local machine. I suspect there may be something wrong with the Rope implementation or configuration.~~

Edit:
Just found the FIM works if change the prompt to:
python -m mlx_lm.generate --model bigcode/starcoder2-3b --prompt "<file_sep>quick_sort.py<file_sep><fim_prefix>\ndef quicksort(arr):\n<fim_suffix>\n<fim_middle>" --temp 0.0 --eos-token "<file_sep>" --ignore-chat-template
Not very sure why the transformer's implementation doesn't need to be so strict with the prompt though.

awni · 2024-03-03T03:30:49Z

Ok just to make sure I understand - the only difference between this PR and #518 is that the RoPE traditional is False?

awni · 2024-03-03T03:33:21Z

If that's the case, let's simply update that here? It's a one line change. I think it makes more sense than starting on a separate PR?

awni · 2024-03-03T03:38:56Z

Can confirm, seems to work now! Thanks!

awni

Thanks for the addition and contributions everyone!

Blaizzy · 2024-03-03T06:37:36Z

Based on @Blaizzy's PR (#518),
Can confirm, seems to work now! Thanks!
Thanks for the addition and contributions everyone!

Most welcome 😊!

mzbac · 2024-03-03T06:40:20Z

Ok just to make sure I understand - the only difference between this PR and #518 is that the RoPE traditional is False?

@awni, there are some minor model args that need to be cleaned up. For example, the use of rms_norm_eps should be changed to norm_epsilon to match Starcoder2's configuration. Also, it would be more efficient to use the repeat gqa instead of mapping concatenate. I'm not sure if @Blaizzy would like to make a following PR for it; otherwise, I don't mind creating a patch PR to clean it up.

Blaizzy · 2024-03-03T06:42:50Z

Ok just to make sure I understand - the only difference between this PR and #518 is that the RoPE traditional is False?

Didn't notice the last update on this PR early Saturday.

I'm happy I could help make it work for everyone 🚀!

Blaizzy · 2024-03-03T06:44:26Z

@mzbac I can do that :)

Blaizzy · 2024-03-03T06:45:59Z

Also, it would be more efficient to use the repeat header instead of mapping concatenate.

Could you elaborate on this?

mzbac · 2024-03-03T06:49:13Z

Also, it would be more efficient to use the repeat header instead of mapping concatenate.

Could you elaborate on this?

Yeah, in the current implementation here https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/models/starcoder2.py#L71-L76, it would be updated to something like : https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/models/llama.py#L83-L85.

Blaizzy · 2024-03-03T06:53:25Z

I see, thanks for clarifying! Will do.

I thought so too because I used the llama's repeat.

If I may ask, why was it done differently?

mzbac · 2024-03-03T06:55:20Z

Yeah, based on the previous PR, simply repeating is faster than concatenating. Just FYI: #443

sislam-provenir · 2024-03-03T10:09:04Z

This feature came at the perfect time for me! 24 hours ago there was no support and now there is. Love open source! 🩶

sislam-provenir · 2024-03-03T10:58:41Z

So I'm trying to fine-tune StarCoder2 using QLoRA.

$ pwd
>>> /.../mlx-examples/llms

When I issue the fine-tuning command, I get an AttributeError:

$ python -m mlx_lm.lora \                                           
    --model mlx-community/starcoder2-3b-4bit \
    --train \
    --data $(realpath ../lora/data) \
    --iters 10


>>> None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Loading pretrained model
Fetching 7 files: 100%|███████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 61422.86it/s]
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/sameenislam/source/sameenislam/mlx-examples/llms/mlx_lm/lora.py", line 246, in <module>
    run(args)
  File "/Users/sameenislam/source/sameenislam/mlx-examples/llms/mlx_lm/lora.py", line 172, in run
    linear_to_lora_layers(model, args.lora_layers)
  File "/Users/sameenislam/source/sameenislam/mlx-examples/llms/mlx_lm/tuner/utils.py", line 27, in linear_to_lora_layers
    if model.model_type in [
       ^^^^^^^^^^^^^^^^
  File "/Users/sameenislam/anaconda3/lib/python3.11/site-packages/mlx/nn/layers/base.py", line 137, in __getattr__
    super(Module, self).__getattr__(key, val)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'super' object has no attribute '__getattr__'. Did you mean: '__setattr__'?

Has anyone encountered this?

Also including this to show that inference is working:

$ python -m mlx_lm.generate --model mlx-community/starcoder2-3b-4bit --prompt "hello"

>>> None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Fetching 7 files: 100%|██████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 145347.17it/s]
==========
Prompt: hello
_1(void) {
	printf("hello, world\n");
}

void helloworld_print_double_1(double a) {
	printf("%f\n", a);
}

double helloworld_square_1(double a) {
	return a * a;
}

double helloworld_square_2(double a) {
	return a * a;
}

double helloworld_square_
==========
Prompt: 16.348 tokens-per-sec
Generation: 40.031 tokens-per-sec

mzbac · 2024-03-03T11:12:03Z

So I'm trying to fine-tune StarCoder2 using QLoRA.

$ pwd
>>> /.../mlx-examples/llms

When I issue the fine-tuning command, I get an AttributeError:

$ python -m mlx_lm.lora \                                           
    --model mlx-community/starcoder2-3b-4bit \
    --train \
    --data $(realpath ../lora/data) \
    --iters 10


>>> None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Loading pretrained model
Fetching 7 files: 100%|███████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 61422.86it/s]
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/sameenislam/source/sameenislam/mlx-examples/llms/mlx_lm/lora.py", line 246, in <module>
    run(args)
  File "/Users/sameenislam/source/sameenislam/mlx-examples/llms/mlx_lm/lora.py", line 172, in run
    linear_to_lora_layers(model, args.lora_layers)
  File "/Users/sameenislam/source/sameenislam/mlx-examples/llms/mlx_lm/tuner/utils.py", line 27, in linear_to_lora_layers
    if model.model_type in [
       ^^^^^^^^^^^^^^^^
  File "/Users/sameenislam/anaconda3/lib/python3.11/site-packages/mlx/nn/layers/base.py", line 137, in __getattr__
    super(Module, self).__getattr__(key, val)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'super' object has no attribute '__getattr__'. Did you mean: '__setattr__'?

Has anyone encountered this?

Also including this to show that inference is working:

$ python -m mlx_lm.generate --model mlx-community/starcoder2-3b-4bit --prompt "hello"

>>> None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Fetching 7 files: 100%|██████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 145347.17it/s]
==========
Prompt: hello
_1(void) {
	printf("hello, world\n");
}

void helloworld_print_double_1(double a) {
	printf("%f\n", a);
}

double helloworld_square_1(double a) {
	return a * a;
}

double helloworld_square_2(double a) {
	return a * a;
}

double helloworld_square_
==========
Prompt: 16.348 tokens-per-sec
Generation: 40.031 tokens-per-sec

There is a missing model_type in the starcoder2. You can try adding it to your local code as shown in this PR.

sislam-provenir · 2024-03-03T11:26:36Z

There is a missing model_type in the starcoder2. You can try adding it to your local code as shown in this PR.

Awesome, I've made the change locally from PR #522 and it's working like a charm!

$ python -m mlx_lm.lora \ 
    --model $(realpath mlx_model) \
    --train \
    --data $(realpath ../lora/data) \
    --iters 10

>>> None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Loading pretrained model
Total parameters 602.983M
Trainable parameters 1.212M
Loading datasets
Training
Starting training..., iters: 10
Iter 1: Val loss 2.363, Val took 30.839s
Iter 10: Train loss 2.274, Learning Rate 1.000e-05, It/sec 0.490, Tokens/sec 196.036, Trained Tokens 3999
Saved final adapter weights to adapters.npz.

P.S. $(realpath mlx_model) is the model obtained with:

$ python -m mlx_lm.convert \                                        
    --hf-path bigcode/starcoder2-3b \
    -q

* Add Starcoder2 model and update utils.py * Refactor model arguments and modules in starcoder2.py * Refactor FeedForward class to MLP in starcoder2.py * Fix typo * pre-commit * Refactor starcoder2.py: Update model arguments and modules * Fix LM head and MLP layers * Rename input layer norm * Update bias in linear layers * Refactor token embeddings in Starcoder2Model * Rename to standard HF attention layer name * Add LayerNorm * Add transposed token embeddings (like in Gemma) * Refactor MLP and TransformerBlock classes * Add tie_word_embeddings option to ModelArgs and update Model implementation * Add conditional check for tying word embeddings in Starcoder2Model * Fix bias in lm_head linear layer * Remove unused LayerNorm in stablelm * Update transformers dependency to use GitHub repository * fix lm head bug, revert transformer req * Update RoPE initialization in Attention class --------- Co-authored-by: Awni Hannun <awni@apple.com>

…-explore#877) * Implementation of mlx.random.multivariate_normal (ml-explore#502) * Update python/src/random.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update python/src/random.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update python/src/random.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Updated typo in docstring * Restricted multivariate_normal to float32 * Generic mean and variance shapes * Review edits * Update mlx/random.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update python/src/random.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update python/src/random.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Update python/src/random.cpp Co-authored-by: Awni Hannun <awni.hannun@gmail.com> * Test for ndim of mean and cov * nits * smaller size for test * fix broadcasted sampling --------- Co-authored-by: Awni Hannun <awni.hannun@gmail.com> Co-authored-by: Awni Hannun <awni@apple.com>

Muhtasham added 6 commits February 22, 2024 01:06

Add Starcoder2 model and update utils.py

85d9aa6

Refactor model arguments and modules in starcoder2.py

973e4a7

Refactor FeedForward class to MLP in starcoder2.py

6cd9870

Merge branch 'ml-explore:main' into add/sc2

a96753c

Fix typo

73ad35f

pre-commit

3c0fbd5

Muhtasham added 2 commits February 29, 2024 19:52

Refactor starcoder2.py: Update model arguments and modules

761b616

Merge branch 'ml-explore:main' into add/sc2

9929751

sugatoray reviewed Feb 29, 2024

View reviewed changes

Comment thread llms/README.md