feat: Add Bitsandbytes quantization for transformer backend

**Is your feature request related to a problem? Please describe.**

Quantization is not available for transformer backend
**Describe the solution you'd like**

Add bitsandbytes 4bit quantization to be triggered by the user with the `low_vram` flag in the model definition
Additionally I propose to use the `f16` flag to change the `compute_dtype` to `bfloat16` for better optimization on Nvidia cards
**Describe alternatives you've considered**


**Additional context**

I've implemented this while fixing [#1774](https://github.com/mudler/LocalAI/issues/1774)
Issue opened for tracking

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add Bitsandbytes quantization for transformer backend #1775

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

feat: Add Bitsandbytes quantization for transformer backend #1775

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions