Skip to content

[google-genai] Streaming token counts massively overcounted due to incorrect accumulation #4120

@mtakikawa

Description

@mtakikawa

Describe your environment

OS: Linux (Ubuntu 24.04.3 LTS)
Python version: 3.10
Package version: opentelemetry-instrumentation-google-genai 0.5b0

What happened?

The Google GenAI instrumentation reports token counts that are sometimes 50x or more higher than actual usage when using streaming responses. This is because _maybe_update_token_counts uses += to accumulate token counts from each streaming chunk, but:

  1. Input tokens: All models report the same constant value (e.g., 9) in every chunk. Using += sums this across all chunks, causing massive overcounting.

  2. Output tokens: Some models like gemini-3-pro-preview report cumulative counts in each chunk (not delta values). Using += sums all these cumulative values, causing massive overcounting.

Model Input Overcounting Output Overcounting
gemini-2.0-flash ~30x 1.0x (unaffected)
gemini-3-pro-preview ~58x ~30x

The overcounting factor varies based on the number of streaming chunks in the response.

Steps to Reproduce

# Requirements:
#   pip install google-genai opentelemetry-instrumentation-google-genai

import os
from google import genai
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter
from opentelemetry.instrumentation.google_genai import GoogleGenAiSdkInstrumentor

# Set up OpenTelemetry tracing
exporter = InMemorySpanExporter()
provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(exporter))
trace.set_tracer_provider(provider)

# Instrument Google GenAI (this is the buggy code)
GoogleGenAiSdkInstrumentor().instrument()

client = genai.Client(api_key=os.environ.get("GEMINI_API_KEY"))

def test_model(model_name):
    print(f"\n{'='*60}")
    print(f"Model: {model_name}")
    print('='*60)

    exporter.clear()

    response = client.models.generate_content_stream(
        model=model_name,
        contents="Write a detailed essay about programming history."
    )

    # Consume the stream and track correct token counts
    correct_input_tokens = []
    correct_output_tokens = []
    for chunk in response:
        if hasattr(chunk, "usage_metadata") and chunk.usage_metadata:
            inp = getattr(chunk.usage_metadata, "prompt_token_count", None)
            out = getattr(chunk.usage_metadata, "candidates_token_count", None)
            if inp is not None:
                correct_input_tokens.append(inp)
            if out is not None:
                correct_output_tokens.append(out)

    # Get instrumented token counts from spans
    spans = exporter.get_finished_spans()
    for span in spans:
        if span.attributes:
            input_tokens = span.attributes.get("gen_ai.usage.input_tokens")
            output_tokens = span.attributes.get("gen_ai.usage.output_tokens")
            if input_tokens or output_tokens:
                print(f"Instrumented span reports:")
                print(f"  gen_ai.usage.input_tokens: {input_tokens}")
                print(f"  gen_ai.usage.output_tokens: {output_tokens}")

    if correct_input_tokens:
        print(f"\nCorrect input tokens: {correct_input_tokens[-1]}")
        if input_tokens and correct_input_tokens[-1]:
            print(f"Input overcounting factor: {input_tokens / correct_input_tokens[-1]:.1f}x")

    if correct_output_tokens:
        print(f"Correct output tokens: {correct_output_tokens[-1]}")
        if output_tokens and correct_output_tokens[-1]:
            print(f"Output overcounting factor: {output_tokens / correct_output_tokens[-1]:.1f}x")

test_model("gemini-2.0-flash")
test_model("gemini-3-pro-preview")

# Now patch the buggy code and re-test
print(f"\n{'='*60}")
print("Applying patch: changing += to =")
print('='*60)

from opentelemetry.instrumentation.google_genai import generate_content

helper_class = getattr(generate_content, "_GenerateContentInstrumentationHelper", None)
_get_response_property = getattr(generate_content, "_get_response_property", None)

def patched_maybe_update_token_counts(self, response):
    input_tokens = _get_response_property(response, "usage_metadata.prompt_token_count")
    output_tokens = _get_response_property(response, "usage_metadata.candidates_token_count")
    # FIX: Use = instead of +=
    if input_tokens and isinstance(input_tokens, int):
        self._input_tokens = input_tokens
    if output_tokens and isinstance(output_tokens, int):
        self._output_tokens = output_tokens

helper_class._maybe_update_token_counts = patched_maybe_update_token_counts

test_model("gemini-3-pro-preview")

Output (note: token counts will vary on each run):

============================================================
Model: gemini-2.0-flash
============================================================
Instrumented span reports:
  gen_ai.usage.input_tokens: 242
  gen_ai.usage.output_tokens: 1409

Correct input tokens: 8
Input overcounting factor: 30.2x
Correct output tokens: 1409
Output overcounting factor: 1.0x

============================================================
Model: gemini-3-pro-preview
============================================================
Instrumented span reports:
  gen_ai.usage.input_tokens: 522
  gen_ai.usage.output_tokens: 42447

Correct input tokens: 9
Input overcounting factor: 58.0x
Correct output tokens: 1414
Output overcounting factor: 30.0x

============================================================
Applying patch: changing += to =
============================================================

============================================================
Model: gemini-3-pro-preview
============================================================
Instrumented span reports:
  gen_ai.usage.input_tokens: 9
  gen_ai.usage.output_tokens: 1767

Correct input tokens: 9
Input overcounting factor: 1.0x
Correct output tokens: 1767
Output overcounting factor: 1.0x

Expected Result

Token counts should match the correct values from the API:

  • Input tokens: ~9 (for the test prompt)
  • Output tokens: ~1,400-1,800 (varies by run)

Actual Result

  • gemini-2.0-flash: Input tokens overcounted ~30x (242 vs 8), output tokens correct
  • gemini-3-pro-preview: Input tokens overcounted ~58x (522 vs 9), output tokens overcounted ~30x (42,447 vs 1,414)

After applying the patch (changing += to =), both input and output token counts are correct (1.0x).

Additional context

The bug is in generate_content.py:

def _maybe_update_token_counts(self, response: GenerateContentResponse):
    # ...
    if input_tokens and isinstance(input_tokens, int):
        self._input_tokens += input_tokens  # BUG: should be =
    if output_tokens and isinstance(output_tokens, int):
        self._output_tokens += output_tokens  # BUG: should be =

The fix is to use assignment (=) instead of accumulation (+=) since:

  • Input tokens are reported as the same constant in every chunk
  • Output tokens are reported as cumulative totals

Would you like to implement a fix?

None

Tip

React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions