server: enhanced health endpoint#5548
Conversation
| {"slots_idle", available_slots}, | ||
| {"slots_processing", processing_slots}}; | ||
| res.set_content(health.dump(), "application/json"); | ||
| res.status = 503; // HTTP Service Unavailable |
There was a problem hiding this comment.
@phymbert thanks for introducing this additional metadata to the health check!
one nit: it seems unidiomatic for health to return an error status code for an expected and error-free state. in practice, for a local inference server with a single slot (the default behavior), this is particularly unintuitive.
while the server is busy wrt inference, it can happily process health check requests — why return an error (5xx) status code, rather than a success (request understood and processed just fine) along with the actual information desired, the count of available slots (0)?
503 or 409 conflict make more sense to me for /completion or chat completion requests — their request can genuinely not be processed. but the health check returning 5xx codes during normal operation feels wrong to me. the server is not unhealthy by any metric.
it seems this is not an uncommon point of bike shedding so I will happily work around this behavior if i'm in the minority, but wanted to share in case there was any other agreement to this effect.
happy to put up a patch if so!
There was a problem hiding this comment.
hi @brittlewis12, thanks for your feedback.
My primary goal is to point a kubernetes readiness probes to the health endpoint. This way, the server will not receive new incoming request but they will be routed to another available pod. It does not mean the server is down, but as 503 says: it is overloaded. This is the standard for cloud native application.
There was a problem hiding this comment.
@brittlewis12 I finally got your point, PR #5594 address it, thanks for pointing me out this.
* server: enrich health endpoint with available slots, return 503 if not slots are available * server: document new status no slot available in the README.md
* server: enrich health endpoint with available slots, return 503 if not slots are available * server: document new status no slot available in the README.md
* server: enrich health endpoint with available slots, return 503 if not slots are available * server: document new status no slot available in the README.md
Context
It can be useful to monitor the server slots activity, especially when there is no slot available. It will allow in the context of a llama servers cluster to route incoming request to instance with available slots, for example when using kubernetes probes.
Proposed changes
Add
slots_idleandslots_processingfields in thehealthendpoint response, answer503if not slot are available.Closes #4746