Migrate OpenAI and Anthropic instrumentations to new OpenTelemetry semantic conventions

`instrument_openai`, `instrument_openai_agents`, and `instrument_anthropic` currently use a messy mix of legacy custom data formats. None use the semantic conventions in https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/ . We want to fix all that.

Eventually we want to only produce new-style data and not keep any legacy code around. This will require at least one major version bump. Ideally only one, but that might be tricky. It could help to have a branch into which several PRs are made.

Ideally we would be using `opentelemetry-python-contrib` libraries to reduce the maintenance load and avoid reinventing the wheel, but they seem to move very slowly. Here's the state they're currently in:

- https://github.com/open-telemetry/opentelemetry-python-contrib/pull/4042 was recently opened and instruments the OpenAI responses API, which I requested some 8 months ago.
- [opentelemetry-instrumentation-openai-agents-v2](https://github.com/open-telemetry/opentelemetry-python-contrib/tree/main/instrumentation-genai/opentelemetry-instrumentation-openai-agents-v2) could deal with `instrument_openai_agents` but I haven't looked at how well it works.
- https://github.com/open-telemetry/opentelemetry-python-contrib/pull/4034 adds basic anthropic instrumentation, right now that library is just a boilerplate skeleton.

I think simultaneously updating behaviour and switching to these libraries is too difficult, especially as it would require making major contributions to those libraries. I've been waiting for them to catch up (e.g. instrument the Responses API) but it's been too slow. But we should still aim to use them eventually. So we should check what those libraries produce and try to align with them to avoid future breaking changes. If the contrib libraries don't follow semconv, an issue should be opened there. If there's no semconv for some particular bit of data, we should follow whatever the library does.

TODO:

- [ ] Use `parse_json_attributes=True` in all calls to `exporter.exported_spans_as_dict` in related tests. This can be a trivial PR on its own that doesn't touch behaviour but will make future PRs cleaner.
- [ ] Migrate attributes of `instrument_openai` for chat completions and `instrument_anthropic`, being done in https://github.com/pydantic/logfire/pull/1580 . Drop `request_data` and `response_data` entirely in favour of `gen_ai.*`.
- [ ] Drop `gen_ai.system` in favour of `gen_ai.provider.name` - probably requires some backend and frontend changes too. This actually also applies to other instrumentations, e.g. langchain and litellm. Search the repo.
- [ ] Migrate span names to `{gen_ai.operation.name} {gen_ai.request.model}` (with the values filled in) wherever applicable, which would resolve https://github.com/pydantic/logfire/issues/1482 . Where this doesn't really apply, still aim for span names that have low cardinality values (e.g. agent/model/tool names) filled in rather than templates. This was also complained about in https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/monitor-openai-agents-sdk-with-application-insights/4393949
- [ ] Migrate OpenAI Responses API (which currently uses a different format and code path) to the same semconv attributes.
- [ ] Migrate `instrument_openai_agents` attributes, which uses both the chat completions and responses legacy formats in its own way. This would resolve https://github.com/pydantic/logfire/issues/1585.
- [ ] Set `gen_ai.operation.name` to one of the following where applicable: `chat`, `embeddings`, `execute_tool`, `invoke_agent`, `generate_content`.
- [ ] Drop the `LLM` tag.
- [ ] Use a single span for streaming, instead of a second log with the response. Unfortunately this will sometimes lead to odd behaviour when reading the stream is interrupted, but the OTel SIG decided this edge case was acceptable, as noted in https://github.com/open-telemetry/semantic-conventions/issues/1170#issuecomment-2659124349 . See also the note there about implementing `__del__` to minimise such problems, at least when GC happens nicely...
- [ ] Ensure all kinds of token usage are recorded. Currently all details are available in e.g. `response_data.usage`, and some users might be making use of that to e.g. track cached tokens. Outside of that we only have `gen_ai.usage.input_tokens` and `gen_ai.usage.output_tokens` which is less detailed. See https://github.com/open-telemetry/semantic-conventions/pull/3163, https://github.com/open-telemetry/semantic-conventions/issues/1959, and https://github.com/open-telemetry/semantic-conventions/issues/3194

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Migrate OpenAI and Anthropic instrumentations to new OpenTelemetry semantic conventions #1586

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Migrate OpenAI and Anthropic instrumentations to new OpenTelemetry semantic conventions #1586

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions