* code beautification
* code beautification, move functions together
* make --device fast the default (#515)
* make --device fast the default
* Update iOS.md (#517)
* Update iOS.md
* Update iOS.md
* Pip to pip3 (#504)
* remove macos-12 test
* pip to pip3
* break aoti CI jobs separately (#500)
* init
* fixes
* more fixes
* fixes
* fix
* fix
* bug fix
* add objcopy update
* suppress int8
* undefined variable
---------
Co-authored-by: Michael Gschwind <mikekg@meta.com>
* Support llama3 in chat in run.cpp (#486)
* refactor chat runner in preparation for llama3
* add sketch for llama3 prompt template and move to returning tokens
* fix tiktoken
* fixes to chat
* add default llama_ver
* Add tests for quantize json, add cuda device specification and precision to cuda.json (#519)
* remove code for no KV Cache path (#527)
* Update ADVANCED-USERS.md (#529)
Update Advanced Users description to reflect changes in the repo since the description was initially created.
* runner-aoti on cuda (#531)
* runner-aoti on cuda
* transfer results back to CPU
* transfer results back to CPU
* runner-aoti on cuda
* Update runner_build.md (#530)
Update description of runner and build process in runner_build.md
* clean up runner code a little (#532)
* clean up runner code a little
* update
* update
* pull out generate loop in chat
* updates
* edit docs
* typo
* move int8 linear class and function into qops.py (#534)
* add dtype tests for runner-aoti + runner-et (#539)
* add dtype tests for runner-aoti + runner-et
* typo
* Quantized embedding (#536)
* move int8 linear class and function into qops.py
* move Quantized Embedding to qops.py
* Move Linear int4 to qops (#537)
* move int8 linear class and function into qops.py
* move Quantized Embedding to qops.py
* move int4 linear to qops
* Revert "add dtype tests for runner-aoti + runner-et (#539)" (#548)
This reverts commit a7a24577a65be67ac9ae4dc05452f35d9c49e5d1.
* fix generate for llama3 (#538)
* fix generate for llama3
* switch more things to C
* remove C++ header
* add delegation visualization instructions (#551)
* Add dtype runner aoti (#552)
* add dtype tests for runner-aoti + runner-et
* typo
* add dtype test runner-aoti
* test sdpa with fp16 (#553)
* test sdpa with fp16
* kv cache fp32
* typo
* update (#560)
* Only support newest versions of lm-eval (#556)
Summary:
remove support for lm-eval 0.3 to reduce the options we have
Test Plan:
CI
Reviewers:
Subscribers:
Tasks:
Tags:
* split cpu eval CI by dtype (#554)
* split cpu eval CI by dtype
* fix
* differentiate names with checks
* keep one name the same as old
* fix
* Removing duplicate HF issue message from README (#559)
Co-authored-by: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com>
* doc updates (#567)
* Add VM-safe MPS check
---------
Co-authored-by: Anthony Shoumikhin <anthony@shoumikh.in>
Co-authored-by: metascroy <161522778+metascroy@users.noreply.github.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: lucylq <lfq@meta.com>
Co-authored-by: Jerry Zhang <jerryzh168@gmail.com>
Co-authored-by: Jack-Khuu <jack.khuu.7@gmail.com>
* add unpacking support (#525)
* add unpacking support
* fix typos and linter
* perform parallel prefill when possible (#568)
* perform parallel prefill when possible
* typo
* disable hack
* remove print
* remove debug messages which prevent export
* fixes
* stream results in generate.py (#571)
* remove logging interfering with export
---------
Co-authored-by: Anthony Shoumikhin <anthony@shoumikh.in>
Co-authored-by: metascroy <161522778+metascroy@users.noreply.github.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: lucylq <lfq@meta.com>
Co-authored-by: Jerry Zhang <jerryzh168@gmail.com>
Co-authored-by: Jack-Khuu <jack.khuu.7@gmail.com>
Pre-flight:
pippointing to itPost-flight:
Differential Revision: D49843232