C/CUDA implementation for inference of Qwen3-0.6B.
First, clone the repo:
git clone https://github.com/asdf93074/qwen.cPut the model.safetensors file from this link into the root of your repo:
https://huggingface.co/Qwen/Qwen3-0.6B/blob/main/model.safetensors
make release chatBuilds the implementation into a shared library which is then used by python chat.py for chatting.
make runUses run.c as the entrypoint which loads the model, and prints the generated tokens. Not really of much use unless you want to hack on it to do something.
NOTE: It only supports CUDA as a backend.
I wanted to learn more about C, CUDA and deep learning libraries in general. The goal was to build something you could generally talk to.
I tried it with Qwen3 0.6B. As the architecture for the other Qwen3 models is identical, you could technically use this with any of those too (remember to update the hardcoded number of layers/heads in json.c).
- Most of the kernels are naively written, TBI.
- It picks the max token when decoding, which is known to cause repetitiveness (though I didn't run into this issue in testing).
- The C code can only load safetensors but you can load weights in python and run them through it too.
- KV cache and RoPE matrixes are generated for only a max length of 2048 to save on memory.
Lots of possible extension points available here if someone is looking to learn/contribute.
- better kernels (many improvements can be done here)
- dynamic KV caching (size is fixed at init)
- KV cache offload to CPU
- remove python dependency by doing the byte-level-BPE tokenization in C too
- better sampling techniques (temperature, top-p, top-k etc)
- better memory allocation (saves us from doing lots of cudaMalloc/cudaFree calls)
- partial offload to CPU
- supporting quantized versions
MIT
