server : PoC implementation of "interim" server#13400
server : PoC implementation of "interim" server#13400ngxson wants to merge 1 commit intoggml-org:masterfrom
Conversation
|
Nice. Maybe the interim API should also have a logic to route main API requests to the respective server based on the model id. This way 3rd party apps can always communicate with a single network port. |
|
Yes that can be a good idea. I'm thinking about abstract out the HTTP server implementation, so we can implement the routing logic more easily. In anyway, I think separating the HTTP layer and handler code will be one of our main goal in the very short term, before we can even do anything else. The problem is that |
In case it helps, the way I.e to route to the model with ID curl -X POST http://127.0.0.1:8080/upstream/gemma-3-4b-it-GGUF/v1/chat/completions # etc... |
This PR acts as a PoC to illustrate my idea in #13367
The way it works is to spawn an "interim" server that exposes
/loadendpoint.For example:
The implementation separates
run_interim_serverandrun_main_serverbecause therun_main_servercan be converted to creating a child process, though I'm not sure if this is preferable way to go.WDYT about this approach @ggerganov @slaren ?