Skip to content

Implement backward passes for llama with small training llama from scratch example#1360

Merged
ggerganov merged 110 commits intoggml-org:masterfrom
xaedes:train-example
May 13, 2023
Merged

Implement backward passes for llama with small training llama from scratch example#1360
ggerganov merged 110 commits intoggml-org:masterfrom
xaedes:train-example

Conversation

@xaedes
Copy link
Copy Markdown
Collaborator

@xaedes xaedes commented May 7, 2023

Training a llama directly with ggml would be really nice.

For this I implemented the backward passes required for the llama model, tested them with test_grad0 from ggml repo and trained a small llama from scratch to output a sinus wave.

Also see the more detailed discussion in ggml-org/ggml#8 (comment)

List of all new operations that I had to add:

  • GGML_OP_ADD1 : Could be replaced with add(X,repeat(Y,X)) but it is faster when repeat can be avoided.
  • GGML_OP_ACC : Necessary for view backward pass. This adds src1 to view(src0, nb, offset) and returns tensor of shape src0.
  • GGML_OP_SET : (this is new) Necessary for propagating gradients through kv cache. Instead of copying to kv cache this function sets the values in the kv cache viewed with offsets and strides and returns a tensor representing the modified kv cache. This can also be inplace returning a view to the modified kv cache.
  • GGML_OP_LOG : (this is new) Necessary for cross entropy loss
  • GGML_OP_SUM_ROWS : Necessary for repeat backward pass: Reduces rows by summing them. shape[a,b,c,d] -> shape[1,b,c,d]
  • GGML_OP_SILU_BACK : Necessary for silu backward pass
  • GGML_OP_RMS_NORM_BACK : Could also be implemented using primitives, at the cost of performance.
  • GGML_OP_GET_ROWS_BACK : Necessary for get_rows backward pass: Adds src0[i] rows to opt0[src1[i]] rows, returning a tensor of shape opt0.
  • GGML_OP_DIAG : Necessary for softmax backward pass, alternative would have been to implement SOFTMAX_BACK directly, but DIAG is at least usable for other stuff. It turns rows into diagonal matrices.
  • GGML_OP_DIAG_MASK_ZERO : Necessary for diag_mask_inf backward pass
  • GGML_OP_ROPE_BACK : Necessary for rope backward pass.

Notable other changes:

  • add inplace and non-inplace variants for scale, diag_mask_inf, soft_max and rope
  • fix sub, mul and div functions to work correctly with transposed tensor, uses the same logic as in add:
  • fix ggml_forward_add functions to work correctly with transposed tensors. uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions. this also slightly changes the mem access pattern of the different threads to work as in ggml_compute_forward_add_q_f32.
  • de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type. with this we can duplicate tensor of any type as long as they are contiguous. the function is used in dup, get_rows_back and diag_mask (when not inplace).
  • there are some maybe too verbose comments including step-by-step derivation of gradients that could (or should?) be cleaned away.
  • (this is new) I added 1d and 4d functions for ggml_reshape and ggml_view.

The performance of various parts of the training could be improved, especially a fast ggml_out_prod could help speeding up the matrix multiplication backward pass.

There are two additional test files, one for testing gradients taken from ggml repo and one small test for testing optimization in general.

Exemplary training of a small llama model is demonstrated in the self-contained baby-llama example, where it is trained to output a sinus wave.

Some notes on first training tests:

  • lbfgs optimizer is faster and trains better than adam
  • target logits should represent NEXT token, not the current
  • target logits -1 & +1 work much better than 0 & +1
  • I think adding a BOS token also helped improving the training
  • trained with cross entropy loss gave worse generation results than trained with summed squared logit difference loss

A parallel batched forward function would probably be a good improvement. Training on multiple examples in (parallel) batch really seems to improve the training, but currently I can only do that by calling the forward function multiple times with different input data, which costs a lot of nodes in the computation graph, especially since backward pass is necessary as well.
The batches could be stored in another dimension of the tensors. It would probably just require some reshapes and view operations to make it work.

I did not look into training a LoRa finetune yet, but all the necessary machinery for that seems to be working.

xaedes added 30 commits May 1, 2023 02:41
- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW

implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.

this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.

still missing backward passes for llama:

- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX
- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX

add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK

GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.

GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...

GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.

Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.

still not completely implemented backward passes for llama:

- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer
after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.
use sum instead of mean for gradient of scalar scale parameter
use add1(x,y) instead of add(x,repeat(y,x))
use scale(x,y) instead of mul(x,repeat(y,x))
this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c
ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]
…of same type.

with this we can duplicate tensor of any typ as long as they are contiguous.
when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy
required for view backward pass

src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.
@ggerganov
Copy link
Copy Markdown
Member

After fixing the vDSP_vsub argument order, it works now

image

@xaedes
Copy link
Copy Markdown
Collaborator Author

xaedes commented May 11, 2023

I got a batched forward function working. With this greater number of parallel batches can be trained with ease.

This should also be useful to implement beam search sampling.

@ggerganov
Copy link
Copy Markdown
Member

ggerganov commented May 11, 2023

Ha, that is interesting. I somehow thought that to support batched inference some changes in ggml.c would be needed.
Well done!

This should also be useful to implement beam search sampling.

Yes, this is needed for the beam-search decoding in whisper.cpp
Also, another project that can greatly benefit from batched inference is bert.cpp

Edit: now that #1405 has been merged, this PR is now highest priority for merging. Will be look into more details in the following days

Comment thread ggml.c
struct ggml_tensor* F08 = ggml_transpose (ctx, F07);
struct ggml_tensor* F09 = ggml_cont (ctx, F08);
struct ggml_tensor* F10 = ggml_reshape (ctx, F09, src0->grad);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow!

Btw, would it make sense to have GGML_OP_REPEAT_BACK that implements this in a kernel?
Maybe if this is some sort of bottleneck. Otherwise, keep it like this.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I shuddered myself at this^^ I think one of the conts is not even necessary.
I also think it would make sense to implement this as an extra operation, should not be that difficult.
Most of the reshaping etc is only necessary to use sum_rows which could be done better in a special operation.

github-actions[bot]

This comment was marked as off-topic.

github-actions[bot]

This comment was marked as off-topic.

@ggerganov
Copy link
Copy Markdown
Member

ggerganov commented May 13, 2023

Hmm, what's wrong with the CI?
It was successful before my changes, then I made some minor edits (fix warnings, indentation) and now it is failing.

/home/runner/work/llama.cpp/llama.cpp/ggml.c: In function ‘ggml_compute_forward_add1’:
/home/runner/work/llama.cpp/llama.cpp/ggml.c:7398:14: error: ‘GGML_TYPE_Q4_2’ undeclared (first use in this function); did you mean ‘GGML_TYPE_Q4_1’?
 7398 |         case GGML_TYPE_Q4_2:
      |              ^~~~~~~~~~~~~~
      |              GGML_TYPE_Q4_1

Did it checkout the wrong branch 🤔

Edit: I see now - it is actually smart enough to merge origin/master into the branch

@ggerganov
Copy link
Copy Markdown
Member

ggerganov commented May 13, 2023

@xaedes Is it ok if I merged latest master into your branch? In case you have some pending changes, I can wait for you to do it yourself so I don't mess up your flow

@xaedes
Copy link
Copy Markdown
Collaborator Author

xaedes commented May 13, 2023

Is it ok if I merged latest master into your branch? In case you have some pending changes, I can wait for you to do it yourself so I don't mess up your flow

@ggerganov Sure

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

Comment thread ggml.h Outdated
Comment thread ggml.h
Comment thread ggml.h
Comment thread ggml.c
Comment thread ggml.c
Comment thread ggml.c
Comment thread ggml.c
Comment thread ggml.c
Comment thread ggml.c
Comment thread ggml.c
Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty big addition to ggml - I hope it will open up some applications for training in the future

  • The tests will be moved to the ggml repo after we merge this PR
  • The batched forward example is very valuable. Have to apply it to whisper.cpp and bert.cpp
  • After merging this, it would be a good time for some refactoring passes in ggml.c to try and reduce code duplication and overall code size

Thanks @xaedes - excellent effort!

@ggerganov ggerganov merged commit f954edd into ggml-org:master May 13, 2023
@xaedes xaedes mentioned this pull request May 30, 2023
@RonanKMcGovern
Copy link
Copy Markdown

@xaedes has there been further progress in this direction since June? In particular, how far are things from being able to train LoRA adapters?

@xaedes
Copy link
Copy Markdown
Collaborator Author

xaedes commented Sep 22, 2023

@RonanKMcGovern
Yep, finetuning LORA adapters on LLAMA models works: #2632

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
…scratch example (ggml-org#1360)

* implement 8 of 14 missing backward pass operations used by llama

- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW

implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.

this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.

still missing backward passes for llama:

- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX

* implement 5 of 6 missing backward pass operations used by llama

- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX

add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK

GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.

GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...

GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.

Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.

still not completely implemented backward passes for llama:

- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer

* norm & rms_norm can not be threaded:

after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.

* remove already resolved TODO

* implement backward pass of ggml_rope and ggml_rope_back

* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back

* add test-grad0.c

* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console

* test both gradients of mul_mat

* disable graph dot export as it floods console

* bug fixes for silu_back

* successfully test silu backward

* bug fix for scale backward pass

use sum instead of mean for gradient of scalar scale parameter

* successfully test scale backward

* improve performance of sum backward pass

use add1(x,y) instead of add(x,repeat(y,x))

* improve performance of sqr backward pass

use scale(x,y) instead of mul(x,repeat(y,x))

* successfully test rope backward

* bug fix for cpy backward pass

* successfully test cpy backward

* bug fix for reshape backward pass

* successfully test reshape backward

* add test-opt.c

this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c

* correctly implement softmax backward pass using new operation ggml_diag

ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]

* successfully test soft_max backward

* align shape annotations

* add shape annotations for llama

* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.

with this we can duplicate tensor of any typ as long as they are contiguous.

* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads

when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy

* bug fix for add_at forward

required for view backward pass

src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.

* successfully test view backward

* minor code format improvement

* fix ggml_forward_add functions to work correctly with transposed tensors

uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.

* fix ggml_forward_add1 functions to work correctly with transposed tensors

uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.

* test-grad0.c : add print_elements to help with debugging

* successfully test permute backward

* some minor test-grad0 fixes

* fix sub, mul and div functions to work correctly with transposed tensors

uses the same logic as in add

* implement ggml_cont backward pass

* successfully test transpose backward and permute for all permutations

also test sub, mul and div up to max n_dims

* test-grad0.c add TODO for view_2d and view_3d

add_at (required for view backward pass) is a bit tricky for n_dims > 1.

* fix comments

* successfully test diag_mask_inf and diag_mask_zero backward

* test-grad0 : fix test for div

nargs and ndims was swapped, corrupting the stack

* fix diag_mask to work with non-inplace input

* move dup call into the actual add_at functions

* fix get rows backward pass

* successfully test get_rows backward

* fix view backward pass

add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.

* successfully test backward pass of view_1d, view_2d and view_3d

* fix backward pass for rms_norm

I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.

* successfully test backward pass of rms_norm

some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:

rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324

it is due to the test logic in check_gradients that they fail.

* add todos for llama backward pass

- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.

* add operation ggml_sum_rows

ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]

* add missing GGML_OP_SUM_ROWS

* fix backward pass for repeat

requires ggml_sum_rows

* successfully test backward pass of repeat

* update quantization types in switch-case of add_at and add1

* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.

had to increase maximum number of optimization parameters to train from scratch.

* fix softmax in baby-llama example

* switching from training with adam to lbfgs produces much better results in the baby-llama example

* train with two examples, creating new tensors each time..

* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt

when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt

* train on multiple examples, generate & print tokens with trained model afterwards

ctx0 for evaluation and optimization is renewed for each sample

* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d

* fix soft_max backward pass for input->ne[1] != 1

* add ggml_log operation necessary for cross entropy loss

* add test for ggml_log gradients

* implement backward pass for ggml_sum_rows, necessary for cross entropy loss

* implement ggml_repeat support for rank > 2 tensors

* add test for ggml_sum_rows gradients

* fix training get_example_targets

predict the next token, not the current token!

* add square_error_loss and cross_entropy_loss functions

* optimize loss over multiple samples

this increases computation graph, need parallel batched forward for more efficiency.

* fix backward pass for add_at and change arguments to have same order as in view

* add ggml_set(ctx, a, b) to set b in view of a and return modified a

necessary to set values into kv_self cache and properly propagate the gradients

* fix kv_self gradients for training

use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients

* replace inplace operations for training with copying operations to allow gradient propagation

* add GGML_ASSERT to catch ggml_rope and back value errors

* add trainable lora-only model with all big matrices C split into A,B with A*B=C

this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.

training this instead of the normal model resulted in much worse results though...

* vastly improve training results

instead of logit targets 0 and 1 use -1 and +1.

* shorten code using a variable

* change name of GGML_OP_ADD_AT to GGML_OP_ACC

* smaller default values for baby llama model parameters

* update static assert of GGML_OP_COUNT

* remove shape annotations in llama_eval_internal

* revert disabling of threading for rms_norm and norm

* rename print functions in baby-llama example

* fix call to ggml_set_name

* add missing include for strcmp, etc

* remove trailing whitespace

* reduce number of test-grad0 iterations

avoid exceeding timeout of automated tests

* remove busy loop that was used as sleep for slower sinus wave generation

* disable slow tests grad0 and opt to avoid exceeding timeouts

* c++ in baby-llama example

use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros

* c++ in baby-llama example

use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros

* ggml : fix compiler warnings + cosmetic changes

* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back

* swap arguments to vDSP_vdiv call

documentation for vDSP_vdiv states: "Note that B comes before A!"

* swap arguments to vDSP_vdiv call

documentation for vDSP_vdiv states: "Note that B comes before A!"

* ggml : swap vDSP_vsub args as per documentation

* add parallel batched forward function for baby-llama training

* cleanup code for batched training

* remove trailing whitespace

* minor : fix compiler warnings + indentation style

* ggml : fix null ptr deref in backward pass

* ggml : remove Q4_2 remnants

* ggml : fix clang-tidy warnings

* baby-llama : couple of clang-tidy warnings

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
…scratch example (ggml-org#1360)

* implement 8 of 14 missing backward pass operations used by llama

- GGML_OP_ADD_AT
- GGML_OP_CPY
- GGML_OP_MUL_MAT (src0.grad)
- GGML_OP_PERMUTE
- GGML_OP_RESHAPE
- GGML_OP_SCALE
- GGML_OP_TRANSPOSE
- GGML_OP_VIEW

implement additional ggml operation GGML_OP_ADD_AT, which is necessary for backward pass of GGML_OP_VIEW.

this operation adds src1 to src0 with data offset, i.e. to view(src0, ..., offset).
the values are return in a tensor size of src0. values outside of [data+offset:data+offset+nbytes(src1)] are just the original values from src0.

still missing backward passes for llama:

- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_ROPE
- GGML_OP_SILU
- GGML_OP_SOFT_MAX

* implement 5 of 6 missing backward pass operations used by llama

- GGML_OP_DIAG_MASK_INF
- GGML_OP_GET_ROWS
- GGML_OP_RMS_NORM
- GGML_OP_SILU
- GGML_OP_SOFT_MAX

add necessary ggml operations GGML_OP_ADD1, GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK, GGML_OP_DIAG_MASK_ZERO, and GGML_OP_ROPE_BACK

GGML_OP_ADD1 is necessary to add a scalar value in the backward pass of GGML_OP_SOFT_MAX
GGML_OP_ADD1 could also be replaced by using GGML_OP_ADD and GGML_OP_REPEAT, but the performance would be worse. additionally GGML_OP_REPEAT will return unexpected value when the the input to GGML_OP_SOFT_MAX contains only a single scalar. in this case GGML_OP_REPEAT will not return the value that should be repeated (src1) but the value which shape the result should take (src0). So in this case it can not replace GGML_OP_ADD1.

GGML_OP_SILU_BACK, GGML_OP_RMS_NORM_BACK and GGML_OP_ROPE_BACK are necessary for backward pass of GGML_OP_SILU, GGML_OP_RMS_NORM and GGML_OP_ROPE. The backward pass for these functions cannot be easily composed of existing operations. Since the backward pass builds a computation graph we need operations forward pass implementations of the the required backward passes. Sounds a bit confusing at first, I know...

GGML_OP_DIAG_MASK_ZERO is necessary for backward pass of GGML_OP_DIAG_MASK_INF.

Some operations where previously inplace-only. for backward pass there needs to be non-inplace variants.
staying consistent with other operations that have non-inplace and inplace variants, the operations are changed to non-inplace and
functions with "_inplace" are added which are inplace.
in llama we need to call the inplace variants so that it is implemented as before.
for llama backward pass we need to use the non-inplace variants.

still not completely implemented backward passes for llama:

- GGML_OP_ROPE: needs forward pass for GGML_OP_ROPE_BACK
- GGML_OP_GET_ROWS: only necessary for tokenizer

* norm & rms_norm can not be threaded:

after investigation rms norm for quite some time I come to the conclusion that neither norm, nor rms_norm can be threaded, because we need mean over all items, not just of the slices each thread sees.

* remove already resolved TODO

* implement backward pass of ggml_rope and ggml_rope_back

* implement backward pass for ggml_get_rows and for new operation ggml_get_rows_back

* add test-grad0.c

* use GGML_PRINT_DEBUG for debug messages which will otherwise flood the console

* test both gradients of mul_mat

* disable graph dot export as it floods console

* bug fixes for silu_back

* successfully test silu backward

* bug fix for scale backward pass

use sum instead of mean for gradient of scalar scale parameter

* successfully test scale backward

* improve performance of sum backward pass

use add1(x,y) instead of add(x,repeat(y,x))

* improve performance of sqr backward pass

use scale(x,y) instead of mul(x,repeat(y,x))

* successfully test rope backward

* bug fix for cpy backward pass

* successfully test cpy backward

* bug fix for reshape backward pass

* successfully test reshape backward

* add test-opt.c

this uses ggml_opt to train a,b for minimal e=sum(sqr(c - a*b)) for random initial a,b,c

* correctly implement softmax backward pass using new operation ggml_diag

ggml_diag constructs diagonal matrices with entries.
ggml_diag(shape[a,1,c,d]) -> shape[a,a,c,d]

* successfully test soft_max backward

* align shape annotations

* add shape annotations for llama

* de-duplicate ggml_forward_dup code taking care of contiguous tensors of same type.

with this we can duplicate tensor of any typ as long as they are contiguous.

* fix ggml_compute_forward_dup_same_cont for when nelements < nthreads

when more threads are used than elements exist ie1 was less than ie0, resulting in invalid negative byte count argument in memcpy

* bug fix for add_at forward

required for view backward pass

src0 values must be copied to dst, because during addition we don't touch all dst elements in contrast to the normal add function.

* successfully test view backward

* minor code format improvement

* fix ggml_forward_add functions to work correctly with transposed tensors

uses the same logic as in ggml_compute_forward_add_q_f32, but make it consistent across all ggml_compute_forward_add_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add_q_f32.

* fix ggml_forward_add1 functions to work correctly with transposed tensors

uses the same logic as in ggml_compute_forward_add1_q_f32, but make it consistent across all ggml_compute_forward_add1_... functions.
this also slightly changes the mem access pattern of the different threads to works as in ggml_compute_forward_add1_q_f32.

* test-grad0.c : add print_elements to help with debugging

* successfully test permute backward

* some minor test-grad0 fixes

* fix sub, mul and div functions to work correctly with transposed tensors

uses the same logic as in add

* implement ggml_cont backward pass

* successfully test transpose backward and permute for all permutations

also test sub, mul and div up to max n_dims

* test-grad0.c add TODO for view_2d and view_3d

add_at (required for view backward pass) is a bit tricky for n_dims > 1.

* fix comments

* successfully test diag_mask_inf and diag_mask_zero backward

* test-grad0 : fix test for div

nargs and ndims was swapped, corrupting the stack

* fix diag_mask to work with non-inplace input

* move dup call into the actual add_at functions

* fix get rows backward pass

* successfully test get_rows backward

* fix view backward pass

add nb parameters to add_at like in view.
together with offset they define how to view dst and src0 during the add_at operation.

* successfully test backward pass of view_1d, view_2d and view_3d

* fix backward pass for rms_norm

I would have used formulas from other frameworks, but they differed so I could not decide which is correct.
Instead it was derived here in comment using manual forward-backward automatic differention of rms_norm and simplification.

* successfully test backward pass of rms_norm

some tests may fail when gradients are large.
could not find a satisfying configuration to check for abs error and relative error that passes all tests while still actually testing the results with tight enough error bounds.
when looking at the values the "failed" tests look actually ok. for example:

rms_norm: ndims=2, i=0, k=2, x0=0.000153, xm=0.000053, xp=0.000253, f0=0.278594, f1=0.086213, g0=961.905457, g1=966.064941, eps=0.000100, error_abs=4.159485, error_rel=0.004324

it is due to the test logic in check_gradients that they fail.

* add todos for llama backward pass

- implementation for ADD1 backward pass should probably use sum instead of mean (but this backward pass is not required)
- repeat is not yet tested and looks like it only works for single element src0 inputs.

* add operation ggml_sum_rows

ggml_sum_rows(shape[a,b,c,d]) -> shape[1,b,c,d]

* add missing GGML_OP_SUM_ROWS

* fix backward pass for repeat

requires ggml_sum_rows

* successfully test backward pass of repeat

* update quantization types in switch-case of add_at and add1

* add baby-llama example training a very small llama model from scratch to output a sinusoidal wave.

had to increase maximum number of optimization parameters to train from scratch.

* fix softmax in baby-llama example

* switching from training with adam to lbfgs produces much better results in the baby-llama example

* train with two examples, creating new tensors each time..

* fix bug when using ggml_opt to optimize params in one context and use a renewable context for eval and opt

when not keeping gradients of model parameters they are overwritten by tensors created by opt, which may be invalid after opt context is renewed.
so we need to keep the original gradients and make dups for opt

* train on multiple examples, generate & print tokens with trained model afterwards

ctx0 for evaluation and optimization is renewed for each sample

* add ggml_reshape_1d, ggml_reshape_4d and ggml_view_4d

* fix soft_max backward pass for input->ne[1] != 1

* add ggml_log operation necessary for cross entropy loss

* add test for ggml_log gradients

* implement backward pass for ggml_sum_rows, necessary for cross entropy loss

* implement ggml_repeat support for rank > 2 tensors

* add test for ggml_sum_rows gradients

* fix training get_example_targets

predict the next token, not the current token!

* add square_error_loss and cross_entropy_loss functions

* optimize loss over multiple samples

this increases computation graph, need parallel batched forward for more efficiency.

* fix backward pass for add_at and change arguments to have same order as in view

* add ggml_set(ctx, a, b) to set b in view of a and return modified a

necessary to set values into kv_self cache and properly propagate the gradients

* fix kv_self gradients for training

use ggml_set instead of ggml_cpy to set kv_self cache with properly propagating gradients

* replace inplace operations for training with copying operations to allow gradient propagation

* add GGML_ASSERT to catch ggml_rope and back value errors

* add trainable lora-only model with all big matrices C split into A,B with A*B=C

this is not a lora-finetune, but the whole model changed to have only low-rank "lora" matrices.

training this instead of the normal model resulted in much worse results though...

* vastly improve training results

instead of logit targets 0 and 1 use -1 and +1.

* shorten code using a variable

* change name of GGML_OP_ADD_AT to GGML_OP_ACC

* smaller default values for baby llama model parameters

* update static assert of GGML_OP_COUNT

* remove shape annotations in llama_eval_internal

* revert disabling of threading for rms_norm and norm

* rename print functions in baby-llama example

* fix call to ggml_set_name

* add missing include for strcmp, etc

* remove trailing whitespace

* reduce number of test-grad0 iterations

avoid exceeding timeout of automated tests

* remove busy loop that was used as sleep for slower sinus wave generation

* disable slow tests grad0 and opt to avoid exceeding timeouts

* c++ in baby-llama example

use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros

* c++ in baby-llama example

use c++ includes instead of c includes
use std::min, std::max instead of MIN, MAX macros

* ggml : fix compiler warnings + cosmetic changes

* ggml : fix nullptr derefs in GGML_OP_CONT and GGML_OP_RESHAPE back

* swap arguments to vDSP_vdiv call

documentation for vDSP_vdiv states: "Note that B comes before A!"

* swap arguments to vDSP_vdiv call

documentation for vDSP_vdiv states: "Note that B comes before A!"

* ggml : swap vDSP_vsub args as per documentation

* add parallel batched forward function for baby-llama training

* cleanup code for batched training

* remove trailing whitespace

* minor : fix compiler warnings + indentation style

* ggml : fix null ptr deref in backward pass

* ggml : remove Q4_2 remnants

* ggml : fix clang-tidy warnings

* baby-llama : couple of clang-tidy warnings

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

high priority Very important issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants