-
Notifications
You must be signed in to change notification settings - Fork 191
Description
instrument_openai, instrument_openai_agents, and instrument_anthropic currently use a messy mix of legacy custom data formats. None use the semantic conventions in https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/ . We want to fix all that.
Eventually we want to only produce new-style data and not keep any legacy code around. This will require at least one major version bump. Ideally only one, but that might be tricky. It could help to have a branch into which several PRs are made.
Ideally we would be using opentelemetry-python-contrib libraries to reduce the maintenance load and avoid reinventing the wheel, but they seem to move very slowly. Here's the state they're currently in:
- OpenAI Responses and Compaction APIs open-telemetry/opentelemetry-python-contrib#4042 was recently opened and instruments the OpenAI responses API, which I requested some 8 months ago.
- opentelemetry-instrumentation-openai-agents-v2 could deal with
instrument_openai_agentsbut I haven't looked at how well it works. - feat(anthropic): add Messages.create sync instrumentation open-telemetry/opentelemetry-python-contrib#4034 adds basic anthropic instrumentation, right now that library is just a boilerplate skeleton.
I think simultaneously updating behaviour and switching to these libraries is too difficult, especially as it would require making major contributions to those libraries. I've been waiting for them to catch up (e.g. instrument the Responses API) but it's been too slow. But we should still aim to use them eventually. So we should check what those libraries produce and try to align with them to avoid future breaking changes. If the contrib libraries don't follow semconv, an issue should be opened there. If there's no semconv for some particular bit of data, we should follow whatever the library does.
TODO:
- Use
parse_json_attributes=Truein all calls toexporter.exported_spans_as_dictin related tests. This can be a trivial PR on its own that doesn't touch behaviour but will make future PRs cleaner. - Migrate attributes of
instrument_openaifor chat completions andinstrument_anthropic, being done in Update LLM SDK instrumentations for better otel semconv compliance #1580 . Droprequest_dataandresponse_dataentirely in favour ofgen_ai.*. - Drop
gen_ai.systemin favour ofgen_ai.provider.name- probably requires some backend and frontend changes too. This actually also applies to other instrumentations, e.g. langchain and litellm. Search the repo. - Migrate span names to
{gen_ai.operation.name} {gen_ai.request.model}(with the values filled in) wherever applicable, which would resolve Why does the span name include the message template and not what the template resolves to? #1482 . Where this doesn't really apply, still aim for span names that have low cardinality values (e.g. agent/model/tool names) filled in rather than templates. This was also complained about in https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/monitor-openai-agents-sdk-with-application-insights/4393949 - Migrate OpenAI Responses API (which currently uses a different format and code path) to the same semconv attributes.
- Migrate
instrument_openai_agentsattributes, which uses both the chat completions and responses legacy formats in its own way. This would resolve "Unrecognized" LLM completion with OpenAI Agents SDK + Streamed Chat Completions #1585. - Set
gen_ai.operation.nameto one of the following where applicable:chat,embeddings,execute_tool,invoke_agent,generate_content. - Drop the
LLMtag. - Use a single span for streaming, instead of a second log with the response. Unfortunately this will sometimes lead to odd behaviour when reading the stream is interrupted, but the OTel SIG decided this edge case was acceptable, as noted in GenAI (LLM): how to capture streaming open-telemetry/semantic-conventions#1170 (comment) . See also the note there about implementing
__del__to minimise such problems, at least when GC happens nicely... - Ensure all kinds of token usage are recorded. Currently all details are available in e.g.
response_data.usage, and some users might be making use of that to e.g. track cached tokens. Outside of that we only havegen_ai.usage.input_tokensandgen_ai.usage.output_tokenswhich is less detailed. See genai: define cached tokens attributes open-telemetry/semantic-conventions#3163, More detailed token usage span attributes and metrics open-telemetry/semantic-conventions#1959, and Addusage.reasoning_tokensto span attributes open-telemetry/semantic-conventions#3194