Implement basic chat/completions openai endpoint#461
Implement basic chat/completions openai endpoint#461LostRuins merged 8 commits intoLostRuins:concedo_experimentalfrom teddybear082:experimental_openai_chat_completions_api
Conversation
-Basic support for openai chat/completions endpoint documented at: https://platform.openai.com/docs/api-reference/chat/create -Tested with example code from openai for chat/completions and chat/completions with stream=True parameter found here: https://cookbook.openai.com/examples/how_to_stream_completions. -Tested with Mantella, the skyrim mod that turns all the NPC's into AI chattable characters, which uses openai's acreate / async competions method: https://github.com/art-from-the-machine/Mantella/blob/main/src/output_manager.py -Tested default koboldcpp api behavior with streaming and non-streaming generate endpoints and running GUI and seems to be fine. -Still TODO / evaluate before merging: (1) implement rest of openai chat/completion parameters to the extent possible, mapping to koboldcpp parameters (2) determine if there is a way to use kobold's prompt formats for certain models when translating openai messages format into a prompt string. (Not sure if possible or where these are in the code) (3) have chat/completions responses include the actual local model the user is using instead of just koboldcpp (Not sure if this is possible) Note I am a python noob, so if there is a more elegant way of doing this at minimum hopefully I have done some of the grunt work for you to implement on your own.
-Mistakenly left code relating to streaming argument from main branch in experimental.
-support stop parameter mapped to koboldai stop_sequence parameter -make default max_length / max_tokens parameter consistent with default 80 token length in generate function -add support for providing name of local model in openai responses
This reverts commit 443a6f7.
-support stop parameter mapped to koboldai stop_sequence parameter -make default max_length / max_tokens parameter consistent with default 80 token length in generate function -add support for providing name of local model in openai responses
|
Does mantella require using the chat completions api? i never liked that API since it forces everything into a specific instruct format. Will it work with the standard completions (https://platform.openai.com/docs/api-reference/completions/create#completions/create-stream) api instead? |
|
Yeah mantella requires using the chat completions api, I think this seems to be overall becoming standard because completions endpoint is marked as "legacy" and also it seems that's the way of steering the newer models like 3.5-turbo variants and GPT4 which are supported by mantella. Herika (the other skyrim AI mod for followers) also uses the chat completions API. Is there any way currently to "find" in the koboldcpp code what template is currently assigned to a model? I was going to look at that today but had not gotten a chance yet. I think that's the last thing I'd like to get into my PR if possible instead of using the ###Instruction / ###Response hack to translate the openai messages format into a prompt. The hack actually seems to work fine for me so far with my test scripts [which I had accidentally submitted in one of the changes above and then reverted if you want to use them :) ] as well as mantella with llama2-gguf and synthia-guff 7b models so far, but I imagine would break some models. EDIT: I did look at the code and I have no idea how you make kobold work so well with so many models, and the scenarios are really fun too. I don’t see anywhere prompt templates for different models like llama2 vs. llama vs. pygmalion that are loaded in koboldcpp.py. Anyway, amazing work, this is a monster project for sure! |
|
Yeah the code is quite cryptic haha. You can see the instruct prompt templates here. |
|
Thanks! And thanks btw for working through this with me and spending time reviewing, can't imagine how much time you spend on this project for everyone's benefit given all the upstream changes alone. Is there somewhere in the koboldcpp code this information from the lite.kobold.net code is passed through and saved? Or the front end basically reformats the user's simple free-text prompt and then sends the whole formatted prompt to koboldcpp.py so koboldcpp.py is not typically involved in the formatting at all? Basically I am looking to either at around line 417 or around line 1751 of koboldcpp.py, pull the applicable template based on the model, store "openai_system_prefix", "openai_user_prefix" "openai_assistant_prefix" based on the model and apply it accordingly when parsing the messages in openai format when user is using chat/completions API. Failing that, if koboldcpp.py does not have access to the templates in a variable or anything I might be able to do a simple approximation based on, say, the most popular 3 model types based on parsing local_model_name and if not found, default to alpaca if that is acceptable to you, based on the code you shared about templates from lite.kobold.net. EDIT: OHHH neat, I see now so what I defaulted to (I think?) basically is the ###INSTRUCTION: / ###RESPONSE: default format for alpaca. Great. I'll take a look and make sure it completely conforms to that standard if I wind up having to just go with that and not make model specific ones, like making it not all caps; explains why "it just worked" I guess with my hack. |
to conform with alpaca standard used as default in lite.koboldai.net
There was a problem hiding this comment.
Hi @teddybear082 I've cleaned up the code a little, can you test that everything is working for you, including both streaming and non-streaming versions?
Regarding the instruct tag formats, I think we can just stick to Alpaca format. Since this information is not stored in the model file, nor does the official OAI endpoint support setting it, we would need a custom endpoint to set it, something I feel is currently unnecessary,
|
Yes thank you I will take a look! Much appreciated! |
|
These changes worked on all my tests (chat completion, streaming chat completion, chat completion with stop parameter as a string, streaming chat completion with stop parameters as a list, and in mantella). THANK YOU!!!! I sent the new code over to the mantella and herika discord servers to let a few other people test today; assuming no one reports problems, if you're comfortable with it I think this can be merged whenever you see fit. |
|
Thanks for the initiative! Currently, I see a problem with the implementation - tried it with Mistral7b (fine-tuned on openorca). See this line https://github.com/LostRuins/koboldcpp/pull/461/files#diff-885e6237f0dc0cc77c7b4a47ef801248f4d2e6a7743b37b85a451c3ac446cbd2R414 We are setting the system message templates in a hardcoded manner, the model I tried expects Can we extend the API to accept an optional object {
"temperature": 0.5,
"messages": [
{
"role": "system",
"content": "you are a dungeons and dragons dungeon master"
},
...
],
+ "adapter": {
+ "templates": {
+ "system": { "start": "", "end": ""},
+ "user": { "start": "", "end": "" }
+ }
+ }
}Introducing an optional Implemented in #466 |
-Basic support for openai chat/completions endpoint documented at: https://platform.openai.com/docs/api-reference/chat/create
-Tested with example code from openai for chat/completions and chat/completions with stream=True parameter found here: https://cookbook.openai.com/examples/how_to_stream_completions.
-Tested with Mantella, the skyrim mod that turns all the NPC's into AI chattable characters, which uses openai's acreate / async competions method: https://github.com/art-from-the-machine/Mantella/blob/main/src/output_manager.py
-Tested default koboldcpp api behavior with streaming and non-streaming generate endpoints and running GUI and seems to be fine.
-Still TODO / evaluate before merging:
(1) determine if there is a way to use kobold's prompt formats for certain models when translating openai messages format into a prompt string. (Not sure if possible or where these are in the code)
(2) remove print statements throughout new code used for debug / evaluation purposes
Note I am a python noob, so if there is a more elegant way of doing this at minimum hopefully I have done some of the grunt work for you to implement on your own.