Skip to content

Feat - parse LiteLLM headers to record metrics regarding backend used and fallbacks and also costs - GENAI-4264#118

Merged
subpath merged 10 commits intomainfrom
feat-tracking-api-base-and-fallbacks-geanai-4264
Mar 25, 2026
Merged

Feat - parse LiteLLM headers to record metrics regarding backend used and fallbacks and also costs - GENAI-4264#118
subpath merged 10 commits intomainfrom
feat-tracking-api-base-and-fallbacks-geanai-4264

Conversation

@subpath
Copy link
Collaborator

@subpath subpath commented Mar 23, 2026

What's new:

Add LiteLLM response header parser to extra extra metrics
Jira: https://mozilla-hub.atlassian.net/browse/GENAI-4264

Note that vertex_ai doesn't return api_base header, but TogetherAI and RayServe will return it

New metrics

successful completions only, labels include requested model, backend, service type, purpose, and fallback_used where applicable:

  • mlpa_litellm_routed_completions_total
  • mlpa_litellm_attempted_fallbacks / mlpa_litellm_attempted_retries (histograms)
  • mlpa_litellm_reported_duration_seconds (histogram, from proxy duration header)
  • mlpa_litellm_reported_cost_usd_total (counter: increments by reported USD per completion for windowed sums via increase())
  • mlpa_litellm_routed_tokens_total (prompt/completion tokens aligned with routing labels)

QA:

tests old and new ✅
Local QA ✅

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 875.0
python_gc_objects_collected_total{generation="1"} 80.0
python_gc_objects_collected_total{generation="2"} 10.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 291.0
python_gc_collections_total{generation="1"} 26.0
python_gc_collections_total{generation="2"} 2.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="12",patchlevel="13",version="3.12.13"} 1.0
# HELP mlpa_in_progress_requests Number of requests currently in progress.
# TYPE mlpa_in_progress_requests gauge
mlpa_in_progress_requests 1.0
# HELP mlpa_requests_total Total number of requests handled by the proxy.
# TYPE mlpa_requests_total counter
mlpa_requests_total{endpoint="/v1/chat/completions",method="POST",purpose="",service_type="ai"} 1.0
# HELP mlpa_requests_created Total number of requests handled by the proxy.
# TYPE mlpa_requests_created gauge
mlpa_requests_created{endpoint="/v1/chat/completions",method="POST",purpose="",service_type="ai"} 1.774341358378579e+09
# HELP mlpa_response_status_codes_total Total number of response status codes.
# TYPE mlpa_response_status_codes_total counter
mlpa_response_status_codes_total{status_code="200"} 1.0
# HELP mlpa_response_status_codes_created Total number of response status codes.
# TYPE mlpa_response_status_codes_created gauge
mlpa_response_status_codes_created{status_code="200"} 1.774341358378586e+09
# HELP mlpa_request_latency_seconds Request latency in seconds.
# TYPE mlpa_request_latency_seconds histogram
mlpa_request_latency_seconds_bucket{endpoint="/v1/chat/completions",le="0.005",method="POST"} 0.0
mlpa_request_latency_seconds_bucket{endpoint="/v1/chat/completions",le="0.01",method="POST"} 0.0
mlpa_request_latency_seconds_bucket{endpoint="/v1/chat/completions",le="0.025",method="POST"} 0.0
mlpa_request_latency_seconds_bucket{endpoint="/v1/chat/completions",le="0.05",method="POST"} 0.0
mlpa_request_latency_seconds_bucket{endpoint="/v1/chat/completions",le="0.1",method="POST"} 0.0
mlpa_request_latency_seconds_bucket{endpoint="/v1/chat/completions",le="0.25",method="POST"} 0.0
mlpa_request_latency_seconds_bucket{endpoint="/v1/chat/completions",le="0.5",method="POST"} 0.0
mlpa_request_latency_seconds_bucket{endpoint="/v1/chat/completions",le="1.0",method="POST"} 0.0
mlpa_request_latency_seconds_bucket{endpoint="/v1/chat/completions",le="2.5",method="POST"} 0.0
mlpa_request_latency_seconds_bucket{endpoint="/v1/chat/completions",le="5.0",method="POST"} 1.0
mlpa_request_latency_seconds_bucket{endpoint="/v1/chat/completions",le="10.0",method="POST"} 1.0
mlpa_request_latency_seconds_bucket{endpoint="/v1/chat/completions",le="+Inf",method="POST"} 1.0
mlpa_request_latency_seconds_count{endpoint="/v1/chat/completions",method="POST"} 1.0
mlpa_request_latency_seconds_sum{endpoint="/v1/chat/completions",method="POST"} 4.77369670799817
# HELP mlpa_request_latency_seconds_created Request latency in seconds.
# TYPE mlpa_request_latency_seconds_created gauge
mlpa_request_latency_seconds_created{endpoint="/v1/chat/completions",method="POST"} 1.7743413583785589e+09
# HELP mlpa_validate_challenge_latency_seconds Challenge validation latency in seconds.
# TYPE mlpa_validate_challenge_latency_seconds histogram
# HELP mlpa_validate_app_attest_latency_seconds App Attest authentication latency in seconds.
# TYPE mlpa_validate_app_attest_latency_seconds histogram
# HELP mlpa_validate_app_assert_latency_seconds App Assert authentication latency in seconds.
# TYPE mlpa_validate_app_assert_latency_seconds histogram
# HELP mlpa_validate_fxa_latency_seconds FxA authentication latency in seconds.
# TYPE mlpa_validate_fxa_latency_seconds histogram
mlpa_validate_fxa_latency_seconds_bucket{le="0.05",result="success",verification_source="local"} 0.0
mlpa_validate_fxa_latency_seconds_bucket{le="0.1",result="success",verification_source="local"} 0.0
mlpa_validate_fxa_latency_seconds_bucket{le="0.25",result="success",verification_source="local"} 0.0
mlpa_validate_fxa_latency_seconds_bucket{le="0.5",result="success",verification_source="local"} 0.0
mlpa_validate_fxa_latency_seconds_bucket{le="1.0",result="success",verification_source="local"} 0.0
mlpa_validate_fxa_latency_seconds_bucket{le="2.5",result="success",verification_source="local"} 0.0
mlpa_validate_fxa_latency_seconds_bucket{le="5.0",result="success",verification_source="local"} 1.0
mlpa_validate_fxa_latency_seconds_bucket{le="+Inf",result="success",verification_source="local"} 1.0
mlpa_validate_fxa_latency_seconds_count{result="success",verification_source="local"} 1.0
mlpa_validate_fxa_latency_seconds_sum{result="success",verification_source="local"} 3.342585708000115
# HELP mlpa_validate_fxa_latency_seconds_created FxA authentication latency in seconds.
# TYPE mlpa_validate_fxa_latency_seconds_created gauge
mlpa_validate_fxa_latency_seconds_created{result="success",verification_source="local"} 1.774341356950788e+09
# HELP mlpa_fxa_verifications_total Total number of FxA token verifications.
# TYPE mlpa_fxa_verifications_total counter
mlpa_fxa_verifications_total{verification_source="local"} 1.0
# HELP mlpa_fxa_verifications_created Total number of FxA token verifications.
# TYPE mlpa_fxa_verifications_created gauge
mlpa_fxa_verifications_created{verification_source="local"} 1.7743413569505558e+09
# HELP mlpa_chat_completion_latency_seconds Chat completion latency in seconds.
# TYPE mlpa_chat_completion_latency_seconds histogram
mlpa_chat_completion_latency_seconds_bucket{le="0.5",model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 0.0
mlpa_chat_completion_latency_seconds_bucket{le="1.0",model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 0.0
mlpa_chat_completion_latency_seconds_bucket{le="2.5",model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 1.0
mlpa_chat_completion_latency_seconds_bucket{le="5.0",model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 1.0
mlpa_chat_completion_latency_seconds_bucket{le="10.0",model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 1.0
mlpa_chat_completion_latency_seconds_bucket{le="20.0",model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 1.0
mlpa_chat_completion_latency_seconds_bucket{le="30.0",model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 1.0
mlpa_chat_completion_latency_seconds_bucket{le="60.0",model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 1.0
mlpa_chat_completion_latency_seconds_bucket{le="120.0",model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 1.0
mlpa_chat_completion_latency_seconds_bucket{le="180.0",model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 1.0
mlpa_chat_completion_latency_seconds_bucket{le="+Inf",model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 1.0
mlpa_chat_completion_latency_seconds_count{model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 1.0
mlpa_chat_completion_latency_seconds_sum{model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 1.3718272090045502
# HELP mlpa_chat_completion_latency_seconds_created Chat completion latency in seconds.
# TYPE mlpa_chat_completion_latency_seconds_created gauge
mlpa_chat_completion_latency_seconds_created{model="openai/gpt-4o",purpose="",result="success",service_type="ai"} 1.774341358377821e+09
# HELP mlpa_chat_completion_ttft_seconds Time to first token for streaming chat completions in seconds.
# TYPE mlpa_chat_completion_ttft_seconds histogram
# HELP mlpa_chat_tokens_total Number of tokens for chat completions.
# TYPE mlpa_chat_tokens_total counter
mlpa_chat_tokens_total{model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 402.0
mlpa_chat_tokens_total{model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 42.0
# HELP mlpa_chat_tokens_created Number of tokens for chat completions.
# TYPE mlpa_chat_tokens_created gauge
mlpa_chat_tokens_created{model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 1.774341358377654e+09
mlpa_chat_tokens_created{model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.7743413583777149e+09
# HELP mlpa_chat_tokens_per_request Distribution of tokens per chat completion request.
# TYPE mlpa_chat_tokens_per_request histogram
mlpa_chat_tokens_per_request_bucket{le="0.0",model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 0.0
mlpa_chat_tokens_per_request_bucket{le="10.0",model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 0.0
mlpa_chat_tokens_per_request_bucket{le="50.0",model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 0.0
mlpa_chat_tokens_per_request_bucket{le="100.0",model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 0.0
mlpa_chat_tokens_per_request_bucket{le="250.0",model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 0.0
mlpa_chat_tokens_per_request_bucket{le="500.0",model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 1.0
mlpa_chat_tokens_per_request_bucket{le="1000.0",model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 1.0
mlpa_chat_tokens_per_request_bucket{le="2500.0",model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 1.0
mlpa_chat_tokens_per_request_bucket{le="5000.0",model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 1.0
mlpa_chat_tokens_per_request_bucket{le="10000.0",model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 1.0
mlpa_chat_tokens_per_request_bucket{le="25000.0",model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 1.0
mlpa_chat_tokens_per_request_bucket{le="+Inf",model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 1.0
mlpa_chat_tokens_per_request_count{model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 1.0
mlpa_chat_tokens_per_request_sum{model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 402.0
mlpa_chat_tokens_per_request_bucket{le="0.0",model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 0.0
mlpa_chat_tokens_per_request_bucket{le="10.0",model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 0.0
mlpa_chat_tokens_per_request_bucket{le="50.0",model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.0
mlpa_chat_tokens_per_request_bucket{le="100.0",model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.0
mlpa_chat_tokens_per_request_bucket{le="250.0",model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.0
mlpa_chat_tokens_per_request_bucket{le="500.0",model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.0
mlpa_chat_tokens_per_request_bucket{le="1000.0",model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.0
mlpa_chat_tokens_per_request_bucket{le="2500.0",model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.0
mlpa_chat_tokens_per_request_bucket{le="5000.0",model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.0
mlpa_chat_tokens_per_request_bucket{le="10000.0",model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.0
mlpa_chat_tokens_per_request_bucket{le="25000.0",model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.0
mlpa_chat_tokens_per_request_bucket{le="+Inf",model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.0
mlpa_chat_tokens_per_request_count{model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.0
mlpa_chat_tokens_per_request_sum{model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 42.0
# HELP mlpa_chat_tokens_per_request_created Distribution of tokens per chat completion request.
# TYPE mlpa_chat_tokens_per_request_created gauge
mlpa_chat_tokens_per_request_created{model="openai/gpt-4o",purpose="",service_type="ai",type="prompt"} 1.774341358377682e+09
mlpa_chat_tokens_per_request_created{model="openai/gpt-4o",purpose="",service_type="ai",type="completion"} 1.774341358377721e+09
# HELP mlpa_chat_tool_calls_total Total number of LLM tool invocations.
# TYPE mlpa_chat_tool_calls_total counter
# HELP mlpa_chat_completions_with_tools_total Number of completions that contained at least one tool call.
# TYPE mlpa_chat_completions_with_tools_total counter
# HELP mlpa_chat_tool_calls_per_completion Distribution of tool calls per completion.
# TYPE mlpa_chat_tool_calls_per_completion histogram
# HELP mlpa_chat_requests_with_tools_total Number of chat requests that included a tools payload.
# TYPE mlpa_chat_requests_with_tools_total counter
# HELP mlpa_chat_request_rejections_total Number of chat requests rejected due to budget, rate limit, payload size, or managed-user signup cap.
# TYPE mlpa_chat_request_rejections_total counter
# HELP mlpa_litellm_routed_completions_total Successful chat completions with LiteLLM routing labels from response headers.
# TYPE mlpa_litellm_routed_completions_total counter
mlpa_litellm_routed_completions_total{backend="https://api.openai.com",fallback_used="false",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
# HELP mlpa_litellm_routed_completions_created Successful chat completions with LiteLLM routing labels from response headers.
# TYPE mlpa_litellm_routed_completions_created gauge
mlpa_litellm_routed_completions_created{backend="https://api.openai.com",fallback_used="false",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.7743413583777559e+09
# HELP mlpa_litellm_attempted_fallbacks LiteLLM-reported fallback attempts per successful completion (from x-litellm-attempted-fallbacks).
# TYPE mlpa_litellm_attempted_fallbacks histogram
mlpa_litellm_attempted_fallbacks_bucket{backend="https://api.openai.com",le="0.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_fallbacks_bucket{backend="https://api.openai.com",le="1.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_fallbacks_bucket{backend="https://api.openai.com",le="2.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_fallbacks_bucket{backend="https://api.openai.com",le="3.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_fallbacks_bucket{backend="https://api.openai.com",le="5.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_fallbacks_bucket{backend="https://api.openai.com",le="+Inf",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_fallbacks_count{backend="https://api.openai.com",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_fallbacks_sum{backend="https://api.openai.com",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 0.0
# HELP mlpa_litellm_attempted_fallbacks_created LiteLLM-reported fallback attempts per successful completion (from x-litellm-attempted-fallbacks).
# TYPE mlpa_litellm_attempted_fallbacks_created gauge
mlpa_litellm_attempted_fallbacks_created{backend="https://api.openai.com",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.774341358377763e+09
# HELP mlpa_litellm_attempted_retries LiteLLM-reported retry attempts per successful completion (from x-litellm-attempted-retries).
# TYPE mlpa_litellm_attempted_retries histogram
mlpa_litellm_attempted_retries_bucket{backend="https://api.openai.com",le="0.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_retries_bucket{backend="https://api.openai.com",le="1.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_retries_bucket{backend="https://api.openai.com",le="2.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_retries_bucket{backend="https://api.openai.com",le="3.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_retries_bucket{backend="https://api.openai.com",le="5.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_retries_bucket{backend="https://api.openai.com",le="+Inf",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_retries_count{backend="https://api.openai.com",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_attempted_retries_sum{backend="https://api.openai.com",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 0.0
# HELP mlpa_litellm_attempted_retries_created LiteLLM-reported retry attempts per successful completion (from x-litellm-attempted-retries).
# TYPE mlpa_litellm_attempted_retries_created gauge
mlpa_litellm_attempted_retries_created{backend="https://api.openai.com",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.774341358377772e+09
# HELP mlpa_litellm_reported_duration_seconds LiteLLM proxy-reported request duration in seconds (x-litellm-response-duration-ms / 1000).
# TYPE mlpa_litellm_reported_duration_seconds histogram
mlpa_litellm_reported_duration_seconds_bucket{backend="https://api.openai.com",fallback_used="false",le="0.5",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 0.0
mlpa_litellm_reported_duration_seconds_bucket{backend="https://api.openai.com",fallback_used="false",le="1.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 0.0
mlpa_litellm_reported_duration_seconds_bucket{backend="https://api.openai.com",fallback_used="false",le="2.5",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_reported_duration_seconds_bucket{backend="https://api.openai.com",fallback_used="false",le="5.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_reported_duration_seconds_bucket{backend="https://api.openai.com",fallback_used="false",le="10.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_reported_duration_seconds_bucket{backend="https://api.openai.com",fallback_used="false",le="20.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_reported_duration_seconds_bucket{backend="https://api.openai.com",fallback_used="false",le="30.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_reported_duration_seconds_bucket{backend="https://api.openai.com",fallback_used="false",le="60.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_reported_duration_seconds_bucket{backend="https://api.openai.com",fallback_used="false",le="120.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_reported_duration_seconds_bucket{backend="https://api.openai.com",fallback_used="false",le="180.0",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_reported_duration_seconds_bucket{backend="https://api.openai.com",fallback_used="false",le="+Inf",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_reported_duration_seconds_count{backend="https://api.openai.com",fallback_used="false",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.0
mlpa_litellm_reported_duration_seconds_sum{backend="https://api.openai.com",fallback_used="false",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.285656
# HELP mlpa_litellm_reported_duration_seconds_created LiteLLM proxy-reported request duration in seconds (x-litellm-response-duration-ms / 1000).
# TYPE mlpa_litellm_reported_duration_seconds_created gauge
mlpa_litellm_reported_duration_seconds_created{backend="https://api.openai.com",fallback_used="false",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.774341358377782e+09
# HELP mlpa_litellm_reported_cost_usd_total Cumulative LiteLLM-reported spend in USD (x-litellm-response-cost); use increase() over a range for windowed sums.
# TYPE mlpa_litellm_reported_cost_usd_total counter
mlpa_litellm_reported_cost_usd_total{backend="https://api.openai.com",fallback_used="false",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 0.001425
# HELP mlpa_litellm_reported_cost_usd_created Cumulative LiteLLM-reported spend in USD (x-litellm-response-cost); use increase() over a range for windowed sums.
# TYPE mlpa_litellm_reported_cost_usd_created gauge
mlpa_litellm_reported_cost_usd_created{backend="https://api.openai.com",fallback_used="false",purpose="",requested_model="openai/gpt-4o",service_type="ai"} 1.774341358377802e+09
# HELP mlpa_litellm_routed_tokens_total Token counts attributed to LiteLLM winning backend (from usage, same completion as routing headers).
# TYPE mlpa_litellm_routed_tokens_total counter
mlpa_litellm_routed_tokens_total{backend="https://api.openai.com",fallback_used="false",purpose="",requested_model="openai/gpt-4o",service_type="ai",type="prompt"} 402.0
mlpa_litellm_routed_tokens_total{backend="https://api.openai.com",fallback_used="false",purpose="",requested_model="openai/gpt-4o",service_type="ai",type="completion"} 42.0
# HELP mlpa_litellm_routed_tokens_created Token counts attributed to LiteLLM winning backend (from usage, same completion as routing headers).
# TYPE mlpa_litellm_routed_tokens_created gauge
mlpa_litellm_routed_tokens_created{backend="https://api.openai.com",fallback_used="false",purpose="",requested_model="openai/gpt-4o",service_type="ai",type="prompt"} 1.7743413583778088e+09
mlpa_litellm_routed_tokens_created{backend="https://api.openai.com",fallback_used="false",purpose="",requested_model="openai/gpt-4o",service_type="ai",type="completion"} 1.774341358377813e+09

@subpath subpath changed the title wip Feat - parse LiteLLM headers to recort metrics regarding backend used and fallbacks and also costs Mar 24, 2026
@subpath subpath marked this pull request as ready for review March 24, 2026 08:40
@subpath subpath changed the title Feat - parse LiteLLM headers to recort metrics regarding backend used and fallbacks and also costs Feat - parse LiteLLM headers to record metrics regarding backend used and fallbacks and also costs Mar 24, 2026
@subpath subpath changed the title Feat - parse LiteLLM headers to record metrics regarding backend used and fallbacks and also costs Feat - parse LiteLLM headers to record metrics regarding backend used and fallbacks and also costs - GENAI-4264 Mar 24, 2026

def valid_purposes_for_service_type(self, service_type: str) -> list[str]:
"""Return valid purpose values for a service type (empty if purpose not used)."""
return self.service_type_purposes.get(service_type, [])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return self.service_type_purposes.get(service_type, {})

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we could remove the second parameter entirely since it's explicitly defined above, wdyt?

fallbacks = _safe_int_header(headers, LITELLM_HEADER_ATTEMPTED_FALLBACKS)
retries = _safe_int_header(headers, LITELLM_HEADER_ATTEMPTED_RETRIES)
duration_ms = _safe_float_header(headers, LITELLM_HEADER_RESPONSE_DURATION_MS)
if duration_ms is not None and duration_ms < 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does LiteLLM ever return a negative value here?

float(snapshot.attempted_fallbacks)
)
metrics.litellm_attempted_retries.labels(**labels_base).observe(
float(snapshot.attempted_retries)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine as an int right?

fallback_used=fallback_used,
).inc()
metrics.litellm_attempted_fallbacks.labels(**labels_base).observe(
float(snapshot.attempted_fallbacks)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine as an int right?

if raw is None:
return 0
try:
return int(float(str(raw).strip()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: We should probably either pick int or float, not both

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's _safe_int_header, int is best 👍

@noahpodgurski
Copy link
Collaborator

Looks good to me, just a few comments 👍

subpath and others added 6 commits March 24, 2026 15:16
Co-authored-by: Noah Podgurski <42069075+noahpodgurski@users.noreply.github.com>
…thub.com:Firefox-AI/MLPA into feat-tracking-api-base-and-fallbacks-geanai-4264
@subpath subpath merged commit d15aae6 into main Mar 25, 2026
1 check passed
@subpath subpath deleted the feat-tracking-api-base-and-fallbacks-geanai-4264 branch March 25, 2026 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants