-
Notifications
You must be signed in to change notification settings - Fork 71
Track LLM Metrics #356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track LLM Metrics #356
Conversation
| """ | ||
| Creates an LLM endpoint for the current user. | ||
| """ | ||
| external_interfaces.monitoring_metrics_gateway.emit_route_call_metric( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we emit a metric here? Or add tags to the current trace? cc @song-william
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding tags to the current trace would "kill 2 birds with one stone" so we can search for traces based on user_id as well.
Metrics generated by Traces are not sampled out: https://docs.datadoghq.com/tracing/metrics/metrics_namespace/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saiatmakuri If we add a tag to the existing trace, then add this logic into the auth dependency. This would then make it so we don't need to copy-pasta this call to each route.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, it looks like we can't customize the tags for trace-based metrics.
Trace metrics tags, possible tags are: env, service, version, resource, sublayer_type, sublayer_service, http.status_code, http.status_class, Datadog Agent tags (including the host and second primary tag). Note: Tags set on spans do not count and will not be available as tags for your traces metrics.
https://docs.datadoghq.com/tracing/metrics/metrics_namespace/ (edited)
| def emit_database_cache_miss_metric(self): | ||
| self.database_cache_miss += 1 | ||
|
|
||
| def emit_route_call_metric(self, route: str, _metadata: MetricMetadata): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe outside of the scope of this PR, but might be good to adopt Datadog-esque terminology, where increment means += 1. emit to me sounds more like a gauge or count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a link to the "Datadog-esque terminology" you are referring to? What is currently here looks like it is just pattern matching the emit_* pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah was thinking of https://docs.datadoghq.com/metrics/custom_metrics/dogstatsd_metrics_submission/

But yeah acknowledge that this is just following the existing naming pattern (which is what I think could be tweaked).
song-william
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving to unblock
song-william
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comments!
|
|
||
| llm_router_v1 = APIRouter(prefix="/v1/llm") | ||
|
|
||
| async def record_route_call( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could consider adding this at the highest level so all model-engine routes get it (e.g bundles/endpoits).
| def emit_database_cache_miss_metric(self): | ||
| self.database_cache_miss += 1 | ||
|
|
||
| def emit_route_call_metric(self, route: str, _metadata: MetricMetadata): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a link to the "Datadog-esque terminology" you are referring to? What is currently here looks like it is just pattern matching the emit_* pattern.
Emit metrics on each route calls for the LLM endpoints.
Tested locally with:
curl -H "Authorization: <AUTH>" http://localhost:5001/v1/llm/model-endpoints