Describe your environment
OS: Linux (Ubuntu 24.04.3 LTS)
Python version: 3.10
Package version: opentelemetry-instrumentation-google-genai 0.5b0
What happened?
The Google GenAI instrumentation reports token counts that are sometimes 50x or more higher than actual usage when using streaming responses. This is because _maybe_update_token_counts uses += to accumulate token counts from each streaming chunk, but:
-
Input tokens: All models report the same constant value (e.g., 9) in every chunk. Using += sums this across all chunks, causing massive overcounting.
-
Output tokens: Some models like gemini-3-pro-preview report cumulative counts in each chunk (not delta values). Using += sums all these cumulative values, causing massive overcounting.
| Model |
Input Overcounting |
Output Overcounting |
| gemini-2.0-flash |
~30x |
1.0x (unaffected) |
| gemini-3-pro-preview |
~58x |
~30x |
The overcounting factor varies based on the number of streaming chunks in the response.
Steps to Reproduce
# Requirements:
# pip install google-genai opentelemetry-instrumentation-google-genai
import os
from google import genai
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter
from opentelemetry.instrumentation.google_genai import GoogleGenAiSdkInstrumentor
# Set up OpenTelemetry tracing
exporter = InMemorySpanExporter()
provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(exporter))
trace.set_tracer_provider(provider)
# Instrument Google GenAI (this is the buggy code)
GoogleGenAiSdkInstrumentor().instrument()
client = genai.Client(api_key=os.environ.get("GEMINI_API_KEY"))
def test_model(model_name):
print(f"\n{'='*60}")
print(f"Model: {model_name}")
print('='*60)
exporter.clear()
response = client.models.generate_content_stream(
model=model_name,
contents="Write a detailed essay about programming history."
)
# Consume the stream and track correct token counts
correct_input_tokens = []
correct_output_tokens = []
for chunk in response:
if hasattr(chunk, "usage_metadata") and chunk.usage_metadata:
inp = getattr(chunk.usage_metadata, "prompt_token_count", None)
out = getattr(chunk.usage_metadata, "candidates_token_count", None)
if inp is not None:
correct_input_tokens.append(inp)
if out is not None:
correct_output_tokens.append(out)
# Get instrumented token counts from spans
spans = exporter.get_finished_spans()
for span in spans:
if span.attributes:
input_tokens = span.attributes.get("gen_ai.usage.input_tokens")
output_tokens = span.attributes.get("gen_ai.usage.output_tokens")
if input_tokens or output_tokens:
print(f"Instrumented span reports:")
print(f" gen_ai.usage.input_tokens: {input_tokens}")
print(f" gen_ai.usage.output_tokens: {output_tokens}")
if correct_input_tokens:
print(f"\nCorrect input tokens: {correct_input_tokens[-1]}")
if input_tokens and correct_input_tokens[-1]:
print(f"Input overcounting factor: {input_tokens / correct_input_tokens[-1]:.1f}x")
if correct_output_tokens:
print(f"Correct output tokens: {correct_output_tokens[-1]}")
if output_tokens and correct_output_tokens[-1]:
print(f"Output overcounting factor: {output_tokens / correct_output_tokens[-1]:.1f}x")
test_model("gemini-2.0-flash")
test_model("gemini-3-pro-preview")
# Now patch the buggy code and re-test
print(f"\n{'='*60}")
print("Applying patch: changing += to =")
print('='*60)
from opentelemetry.instrumentation.google_genai import generate_content
helper_class = getattr(generate_content, "_GenerateContentInstrumentationHelper", None)
_get_response_property = getattr(generate_content, "_get_response_property", None)
def patched_maybe_update_token_counts(self, response):
input_tokens = _get_response_property(response, "usage_metadata.prompt_token_count")
output_tokens = _get_response_property(response, "usage_metadata.candidates_token_count")
# FIX: Use = instead of +=
if input_tokens and isinstance(input_tokens, int):
self._input_tokens = input_tokens
if output_tokens and isinstance(output_tokens, int):
self._output_tokens = output_tokens
helper_class._maybe_update_token_counts = patched_maybe_update_token_counts
test_model("gemini-3-pro-preview")
Output (note: token counts will vary on each run):
============================================================
Model: gemini-2.0-flash
============================================================
Instrumented span reports:
gen_ai.usage.input_tokens: 242
gen_ai.usage.output_tokens: 1409
Correct input tokens: 8
Input overcounting factor: 30.2x
Correct output tokens: 1409
Output overcounting factor: 1.0x
============================================================
Model: gemini-3-pro-preview
============================================================
Instrumented span reports:
gen_ai.usage.input_tokens: 522
gen_ai.usage.output_tokens: 42447
Correct input tokens: 9
Input overcounting factor: 58.0x
Correct output tokens: 1414
Output overcounting factor: 30.0x
============================================================
Applying patch: changing += to =
============================================================
============================================================
Model: gemini-3-pro-preview
============================================================
Instrumented span reports:
gen_ai.usage.input_tokens: 9
gen_ai.usage.output_tokens: 1767
Correct input tokens: 9
Input overcounting factor: 1.0x
Correct output tokens: 1767
Output overcounting factor: 1.0x
Expected Result
Token counts should match the correct values from the API:
- Input tokens: ~9 (for the test prompt)
- Output tokens: ~1,400-1,800 (varies by run)
Actual Result
- gemini-2.0-flash: Input tokens overcounted ~30x (242 vs 8), output tokens correct
- gemini-3-pro-preview: Input tokens overcounted ~58x (522 vs 9), output tokens overcounted ~30x (42,447 vs 1,414)
After applying the patch (changing += to =), both input and output token counts are correct (1.0x).
Additional context
The bug is in generate_content.py:
def _maybe_update_token_counts(self, response: GenerateContentResponse):
# ...
if input_tokens and isinstance(input_tokens, int):
self._input_tokens += input_tokens # BUG: should be =
if output_tokens and isinstance(output_tokens, int):
self._output_tokens += output_tokens # BUG: should be =
The fix is to use assignment (=) instead of accumulation (+=) since:
- Input tokens are reported as the same constant in every chunk
- Output tokens are reported as cumulative totals
Would you like to implement a fix?
None
Tip
React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.
Describe your environment
OS: Linux (Ubuntu 24.04.3 LTS)
Python version: 3.10
Package version: opentelemetry-instrumentation-google-genai 0.5b0
What happened?
The Google GenAI instrumentation reports token counts that are sometimes 50x or more higher than actual usage when using streaming responses. This is because
_maybe_update_token_countsuses+=to accumulate token counts from each streaming chunk, but:Input tokens: All models report the same constant value (e.g., 9) in every chunk. Using
+=sums this across all chunks, causing massive overcounting.Output tokens: Some models like
gemini-3-pro-previewreport cumulative counts in each chunk (not delta values). Using+=sums all these cumulative values, causing massive overcounting.The overcounting factor varies based on the number of streaming chunks in the response.
Steps to Reproduce
Output (note: token counts will vary on each run):
Expected Result
Token counts should match the correct values from the API:
Actual Result
After applying the patch (changing
+=to=), both input and output token counts are correct (1.0x).Additional context
The bug is in
generate_content.py:The fix is to use assignment (
=) instead of accumulation (+=) since:Would you like to implement a fix?
None
Tip
React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding
+1orme too, to help us triage it. Learn more here.