Skip to content

Conversation

@saiatmakuri
Copy link
Contributor

@saiatmakuri saiatmakuri commented Oct 30, 2023

Emit metrics on each route calls for the LLM endpoints.

Tested locally with:

curl -H "Authorization: <AUTH>" http://localhost:5001/v1/llm/model-endpoints
curl -H "Content-Type: application/json" -H "Authorization: <AUTH>" -d '{ "prompt": "Hello!", "max_new_tokens": 10, "temperature": 0.1 }' http://localhost:5001/v1/llm/completions-sync?model_endpoint_name=llama-2-7b-test-vllm

"""
Creates an LLM endpoint for the current user.
"""
external_interfaces.monitoring_metrics_gateway.emit_route_call_metric(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we emit a metric here? Or add tags to the current trace? cc @song-william

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding tags to the current trace would "kill 2 birds with one stone" so we can search for traces based on user_id as well.

Metrics generated by Traces are not sampled out: https://docs.datadoghq.com/tracing/metrics/metrics_namespace/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saiatmakuri If we add a tag to the existing trace, then add this logic into the auth dependency. This would then make it so we don't need to copy-pasta this call to each route.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, it looks like we can't customize the tags for trace-based metrics.

Trace metrics tags, possible tags are: env, service, version, resource, sublayer_type, sublayer_service, http.status_code, http.status_class, Datadog Agent tags (including the host and second primary tag). Note: Tags set on spans do not count and will not be available as tags for your traces metrics.
https://docs.datadoghq.com/tracing/metrics/metrics_namespace/ (edited)

def emit_database_cache_miss_metric(self):
self.database_cache_miss += 1

def emit_route_call_metric(self, route: str, _metadata: MetricMetadata):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe outside of the scope of this PR, but might be good to adopt Datadog-esque terminology, where increment means += 1. emit to me sounds more like a gauge or count.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a link to the "Datadog-esque terminology" you are referring to? What is currently here looks like it is just pattern matching the emit_* pattern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah was thinking of https://docs.datadoghq.com/metrics/custom_metrics/dogstatsd_metrics_submission/
image

But yeah acknowledge that this is just following the existing naming pattern (which is what I think could be tweaked).

Copy link
Contributor

@song-william song-william left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock

Copy link
Contributor

@song-william song-william left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments!


llm_router_v1 = APIRouter(prefix="/v1/llm")

async def record_route_call(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consider adding this at the highest level so all model-engine routes get it (e.g bundles/endpoits).

def emit_database_cache_miss_metric(self):
self.database_cache_miss += 1

def emit_route_call_metric(self, route: str, _metadata: MetricMetadata):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a link to the "Datadog-esque terminology" you are referring to? What is currently here looks like it is just pattern matching the emit_* pattern.

@saiatmakuri saiatmakuri merged commit da9f82b into main Nov 2, 2023
@saiatmakuri saiatmakuri deleted the saiatmakuri/track-llm-call-metrics branch November 2, 2023 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants