Skip to content

ggml : fix unnecessary f32 -> f16 -> f32 casts (mmla)#5951

Merged
ggerganov merged 1 commit intomasterfrom
gg/fix-mmla-q4_1-q8_1
Mar 9, 2024
Merged

ggml : fix unnecessary f32 -> f16 -> f32 casts (mmla)#5951
ggerganov merged 1 commit intomasterfrom
gg/fix-mmla-q4_1-q8_1

Conversation

@ggerganov
Copy link
Copy Markdown
Member

ref #4966

The struct block_q8_1 on the CPU uses float instead of ggml_fp16_t:

#define QK8_1 32
typedef struct {
    float d;               // delta
    float s;               // d * sum(qs[i])
    int8_t  qs[QK8_1];     // quants
} block_q8_1;
static_assert(sizeof(block_q8_1) == 2*sizeof(float) + QK8_1, "wrong q8_1 block size/padding");

@ggerganov
Copy link
Copy Markdown
Member Author

@snadampal I haven't tested this change - please give it a try just in case

@snadampal
Copy link
Copy Markdown
Contributor

Hi @ggerganov , LGTM. I have tested it on AWS Graviton3 based c7g instances.

@ggerganov ggerganov merged commit 8380ecf into master Mar 9, 2024
@ggerganov ggerganov deleted the gg/fix-mmla-q4_1-q8_1 branch March 9, 2024 15:36
hazelnutcloud pushed a commit to hazelnutcloud/llama.cpp that referenced this pull request Mar 10, 2024
NeoZhangJianyu pushed a commit to NeoZhangJianyu/llama.cpp that referenced this pull request Mar 12, 2024
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants