Skip to content

Implement basic chat/completions openai endpoint#461

Merged
LostRuins merged 8 commits intoLostRuins:concedo_experimentalfrom
teddybear082:experimental_openai_chat_completions_api
Oct 5, 2023
Merged

Implement basic chat/completions openai endpoint#461
LostRuins merged 8 commits intoLostRuins:concedo_experimentalfrom
teddybear082:experimental_openai_chat_completions_api

Conversation

@teddybear082
Copy link
Copy Markdown

@teddybear082 teddybear082 commented Oct 3, 2023

-Basic support for openai chat/completions endpoint documented at: https://platform.openai.com/docs/api-reference/chat/create

-Tested with example code from openai for chat/completions and chat/completions with stream=True parameter found here: https://cookbook.openai.com/examples/how_to_stream_completions.

-Tested with Mantella, the skyrim mod that turns all the NPC's into AI chattable characters, which uses openai's acreate / async competions method: https://github.com/art-from-the-machine/Mantella/blob/main/src/output_manager.py

-Tested default koboldcpp api behavior with streaming and non-streaming generate endpoints and running GUI and seems to be fine.

-Still TODO / evaluate before merging:

(1) determine if there is a way to use kobold's prompt formats for certain models when translating openai messages format into a prompt string. (Not sure if possible or where these are in the code)

(2) remove print statements throughout new code used for debug / evaluation purposes

Note I am a python noob, so if there is a more elegant way of doing this at minimum hopefully I have done some of the grunt work for you to implement on your own.

-Basic support for openai chat/completions endpoint documented at: https://platform.openai.com/docs/api-reference/chat/create

-Tested with example code from openai for chat/completions and chat/completions with stream=True parameter found here: https://cookbook.openai.com/examples/how_to_stream_completions.

-Tested with Mantella, the skyrim mod that turns all the NPC's into AI chattable characters, which uses openai's acreate / async competions method: https://github.com/art-from-the-machine/Mantella/blob/main/src/output_manager.py

-Tested default koboldcpp api behavior with streaming and non-streaming generate endpoints and running GUI and seems to be fine.

-Still TODO / evaluate before merging:

(1) implement rest of openai chat/completion parameters to the extent possible, mapping to koboldcpp parameters

(2) determine if there is a way to use kobold's prompt formats for certain models when translating openai messages format into a prompt string. (Not sure if possible or where these are in the code)

(3) have chat/completions responses include the actual local model the user is using instead of just koboldcpp (Not sure if this is possible)

Note I am a python noob, so if there is a more elegant way of doing this at minimum hopefully I have done some of the grunt work for you to implement on your own.
-Mistakenly left code relating to streaming argument from main branch in experimental.
-support stop parameter mapped to koboldai stop_sequence parameter

-make default max_length / max_tokens parameter consistent with default 80 token length in generate function

-add support for providing name of local model in openai responses
-support stop parameter mapped to koboldai stop_sequence parameter

-make default max_length / max_tokens parameter consistent with default 80 token length in generate function

-add support for providing name of local model in openai responses
@LostRuins
Copy link
Copy Markdown
Owner

Does mantella require using the chat completions api? i never liked that API since it forces everything into a specific instruct format. Will it work with the standard completions (https://platform.openai.com/docs/api-reference/completions/create#completions/create-stream) api instead?

@teddybear082
Copy link
Copy Markdown
Author

teddybear082 commented Oct 4, 2023

Yeah mantella requires using the chat completions api, I think this seems to be overall becoming standard because completions endpoint is marked as "legacy" and also it seems that's the way of steering the newer models like 3.5-turbo variants and GPT4 which are supported by mantella. Herika (the other skyrim AI mod for followers) also uses the chat completions API. Is there any way currently to "find" in the koboldcpp code what template is currently assigned to a model? I was going to look at that today but had not gotten a chance yet. I think that's the last thing I'd like to get into my PR if possible instead of using the ###Instruction / ###Response hack to translate the openai messages format into a prompt. The hack actually seems to work fine for me so far with my test scripts [which I had accidentally submitted in one of the changes above and then reverted if you want to use them :) ] as well as mantella with llama2-gguf and synthia-guff 7b models so far, but I imagine would break some models.

EDIT: I did look at the code and I have no idea how you make kobold work so well with so many models, and the scenarios are really fun too. I don’t see anywhere prompt templates for different models like llama2 vs. llama vs. pygmalion that are loaded in koboldcpp.py. Anyway, amazing work, this is a monster project for sure!

@LostRuins
Copy link
Copy Markdown
Owner

Yeah the code is quite cryptic haha. You can see the instruct prompt templates here.
https://github.com/LostRuins/lite.koboldai.net/blob/main/index.html#L6207
Alpaca will do fine.

@teddybear082
Copy link
Copy Markdown
Author

teddybear082 commented Oct 4, 2023

Thanks! And thanks btw for working through this with me and spending time reviewing, can't imagine how much time you spend on this project for everyone's benefit given all the upstream changes alone.

Is there somewhere in the koboldcpp code this information from the lite.kobold.net code is passed through and saved? Or the front end basically reformats the user's simple free-text prompt and then sends the whole formatted prompt to koboldcpp.py so koboldcpp.py is not typically involved in the formatting at all?

Basically I am looking to either at around line 417 or around line 1751 of koboldcpp.py, pull the applicable template based on the model, store "openai_system_prefix", "openai_user_prefix" "openai_assistant_prefix" based on the model and apply it accordingly when parsing the messages in openai format when user is using chat/completions API.

Failing that, if koboldcpp.py does not have access to the templates in a variable or anything I might be able to do a simple approximation based on, say, the most popular 3 model types based on parsing local_model_name and if not found, default to alpaca if that is acceptable to you, based on the code you shared about templates from lite.kobold.net.

EDIT: OHHH neat, I see now so what I defaulted to (I think?) basically is the ###INSTRUCTION: / ###RESPONSE: default format for alpaca. Great. I'll take a look and make sure it completely conforms to that standard if I wind up having to just go with that and not make model specific ones, like making it not all caps; explains why "it just worked" I guess with my hack.

to conform with alpaca standard used as default in lite.koboldai.net
@LostRuins LostRuins added the enhancement New feature or request label Oct 5, 2023
Copy link
Copy Markdown
Owner

@LostRuins LostRuins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @teddybear082 I've cleaned up the code a little, can you test that everything is working for you, including both streaming and non-streaming versions?

Regarding the instruct tag formats, I think we can just stick to Alpaca format. Since this information is not stored in the model file, nor does the official OAI endpoint support setting it, we would need a custom endpoint to set it, something I feel is currently unnecessary,

@teddybear082
Copy link
Copy Markdown
Author

Yes thank you I will take a look! Much appreciated!

@teddybear082
Copy link
Copy Markdown
Author

These changes worked on all my tests (chat completion, streaming chat completion, chat completion with stop parameter as a string, streaming chat completion with stop parameters as a list, and in mantella). THANK YOU!!!! I sent the new code over to the mantella and herika discord servers to let a few other people test today; assuming no one reports problems, if you're comfortable with it I think this can be merged whenever you see fit.

Copy link
Copy Markdown
Owner

@LostRuins LostRuins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@LostRuins LostRuins merged commit f9f4cdf into LostRuins:concedo_experimental Oct 5, 2023
@lofcz
Copy link
Copy Markdown

lofcz commented Oct 8, 2023

Thanks for the initiative! Currently, I see a problem with the implementation - tried it with Mistral7b (fine-tuned on openorca). See this line https://github.com/LostRuins/koboldcpp/pull/461/files#diff-885e6237f0dc0cc77c7b4a47ef801248f4d2e6a7743b37b85a451c3ac446cbd2R414

We are setting the system message templates in a hardcoded manner, the model I tried expects [INST] msg [/INST] for a user prompt and <s>[INST] msg [/INST] for a system prompt. KoboldCpp allows to set the message template, however, this is not possible with the OpenAI-like API currently.

Can we extend the API to accept an optional object adapter to the root of /v1/chat/completions request that would contain KoboldCpp-specific settings? I can offer to implement this but I'd like to gather feedback on whether this is an agreeable approach.

{
    "temperature": 0.5,
    "messages": [
        {
            "role": "system",
            "content": "you are a dungeons and dragons dungeon master"
        },
        ...
    ], 
+   "adapter": {
+        "templates": {
+              "system": { "start": "", "end": ""},
+              "user": { "start": "", "end": "" }
+          }
+   }
}

Introducing an optional adapter object still has major advantages over using the native KoboldCpp's API for libraries written primarily for OpenAI as we can introduce just one optional field that the user can fill as they see fit.

Implemented in #466

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants