diff --git a/.agents/skills/honeyhive-evaluators/SKILL.md b/.agents/skills/honeyhive-evaluators/SKILL.md new file mode 100644 index 00000000..5fa49689 --- /dev/null +++ b/.agents/skills/honeyhive-evaluators/SKILL.md @@ -0,0 +1,460 @@ +--- +name: honeyhive-evaluators +description: Add HoneyHive evaluators and run experiments against datasets. Use when asked to set up evaluations, write evaluator functions, run experiments with evaluate(), compare prompt versions, or add client-side scoring to an AI application. Covers client-side evaluators, the evaluate() API, multi-step pipeline evaluation, and integration with tracing. +metadata: + author: honeyhive + version: "1.0" +--- + +# HoneyHive Evaluators + +Set up HoneyHive evaluators and run experiments. This skill covers writing client-side evaluator functions, running experiments with `evaluate()`, scoring multi-step pipelines, and understanding how evaluators integrate with tracing. + +## Prerequisites + +- A HoneyHive project (create at https://app.honeyhive.ai/projects) +- A HoneyHive API key (org settings > Copy API Key) +- Python 3.9+ +- `pip install honeyhive` + +Environment variables expected: +- `HH_API_KEY` - HoneyHive API key +- `HH_PROJECT` - HoneyHive project name +- `HH_API_URL` - (optional) Custom API URL for self-hosted/dedicated deployments + +--- + +## Concepts + +### Experiment Structure + +Every experiment combines three independent, decoupled parts: + +``` +Dataset --> Your Function --> Evaluators --> Results +``` + +| Component | What it is | Interface | +|-----------|------------|-----------| +| **Dataset** | Test cases with inputs and expected outputs | List of `{inputs, ground_truth}` dicts, or a `dataset_id` | +| **Function** | Your application logic | `def fn(datapoint)` -> output dict | +| **Evaluators** | Scoring functions that assess outputs | `def eval(outputs, inputs, ground_truth)` -> score | + +These are deliberately decoupled: reuse a dataset across multiple functions, run the same function against different datasets, and swap evaluators without changing anything else. + +### Evaluator Types + +| Type | What runs the logic | Best for | +|------|---------------------|----------| +| **Code (client-side)** | Deterministic Python in your env | Format checks, metrics, validation | +| **Code (server-side)** | Python on HoneyHive infra | Consistent eval across all traces | +| **LLM-as-judge** | An LLM model (server-side) | Subjective quality, relevance, tone | +| **Human** | Domain experts (server-side) | Edge cases, compliance, ground truth | +| **Composite** | Aggregation formula (server-side) | Weighted indexes, pass/fail gates | + +This skill focuses on **client-side code evaluators** (the ones you write and pass to `evaluate()`). Server-side evaluators are configured in the HoneyHive UI and run automatically on matching traces without code changes. + +### Client-Side vs Server-Side + +| | Client-side | Server-side | +|---|---|---| +| **Where it runs** | Your environment | HoneyHive infrastructure | +| **When it runs** | During `evaluate()` only | Every matching trace (production + experiments) | +| **Setup** | Define in code, pass to `evaluate()` | Configure once in HoneyHive UI | +| **Data interface** | `(outputs, inputs, ground_truth)` | `event` dict or `{{ }}` templates | +| **Versioning** | Your source control | Built-in version history with rollback | + +You can use both together. Common pattern: client-side for experiment-specific scoring, server-side for baseline checks (toxicity, format, PII) that run on all traces automatically. + +--- + +## Step 1: Create Your Dataset + +Define test cases with inputs and (optionally) expected outputs: + +```python +dataset = [ + { + "inputs": {"text": "I was charged twice for my subscription."}, + "ground_truth": {"intent": "billing"}, + }, + { + "inputs": {"text": "The export button gives error 500."}, + "ground_truth": {"intent": "technical"}, + }, + { + "inputs": {"text": "I forgot my password and reset email never arrived."}, + "ground_truth": {"intent": "account"}, + }, + { + "inputs": {"text": "Your support team was amazing. Thanks!"}, + "ground_truth": {"intent": "general"}, + }, +] +``` + +You can also reference a managed dataset by ID: + +```python +result = evaluate( + function=my_function, + dataset_id="dataset_id_here", + evaluators=[my_evaluator], + name="my-experiment", +) +``` + +--- + +## Step 2: Write Your Function + +Your function receives a `datapoint` dict and returns an output dict. There are no constraints on what happens inside --- call models, query databases, invoke tools, orchestrate sub-agents. + +```python +from openai import OpenAI + +client = OpenAI() + +def classify_intent(datapoint): + text = datapoint["inputs"]["text"] + response = client.chat.completions.create( + model="gpt-4o-mini", + messages=[{"role": "user", "content": f"""Classify this message into ONE category: +- billing: payment issues, invoices, charges, refunds +- technical: bugs, errors, how to use features +- account: login, password, profile, settings +- general: other questions, feedback + +Reply with ONLY the category name. + +Message: {text} +Category:"""}], + temperature=0, + ) + return {"intent": response.choices[0].message.content.strip().lower()} +``` + +**Key pattern**: `def fn(datapoint)` receives `{"inputs": {...}, "ground_truth": {...}}` and returns a dict. + +--- + +## Step 3: Write Evaluator Functions + +Evaluators receive three arguments and return a score: + +```python +def my_evaluator(outputs, inputs, ground_truth): + """ + Args: + outputs: Return value from your function (dict) + inputs: The inputs dict from the datapoint + ground_truth: The ground_truth dict from the datapoint + + Returns: + A score (number, boolean, or string) + """ + ... +``` + +### Common Evaluator Patterns + +**Exact match:** + +```python +def intent_match(outputs, inputs, ground_truth): + actual = outputs.get("intent", "").lower() + expected = ground_truth.get("intent", "").lower() + return 1.0 if expected in actual else 0.0 +``` + +**Length / format check:** + +```python +def length_check(outputs, inputs, ground_truth): + answer = outputs.get("answer", "") + return 1.0 if len(answer) > 50 else 0.0 +``` + +**Substring containment:** + +```python +def answer_contains_expected(outputs, inputs, ground_truth): + expected = ground_truth.get("answer", "").lower() + actual = str(outputs).lower() + return 1.0 if expected in actual else 0.0 +``` + +**Multi-criteria scoring:** + +```python +def quality_score(outputs, inputs, ground_truth): + answer = outputs.get("answer", "") + score = 0.0 + if len(answer) > 20: + score += 0.25 + if not answer.startswith("I'm sorry"): + score += 0.25 + if ground_truth.get("keyword", "") in answer.lower(): + score += 0.5 + return score +``` + +--- + +## Step 4: Run Experiments with `evaluate()` + +```python +import os +from honeyhive import evaluate + +result = evaluate( + function=classify_intent, + dataset=dataset, + evaluators=[intent_match], + name="intent-classifier-v1", + # project must be passed explicitly (does not auto-read from env): + project=os.getenv("HH_PROJECT", "my-project"), + # api_key is read from HH_API_KEY env var if not provided: + # api_key=os.getenv("HH_API_KEY"), +) + +print(f"Run ID: {result.run_id}") +``` + +### Comparing Two Versions + +Run the same dataset with different functions to compare: + +```python +result_v1 = evaluate( + function=classify_vague, + dataset=dataset, + evaluators=[intent_match], + project=os.getenv("HH_PROJECT", "my-project"), + name="intent-vague-prompt", +) + +result_v2 = evaluate( + function=classify_structured, + dataset=dataset, + evaluators=[intent_match], + project=os.getenv("HH_PROJECT", "my-project"), + name="intent-structured-prompt", +) + +print(f"V1 run: {result_v1.run_id}") +print(f"V2 run: {result_v2.run_id}") +``` + +View and compare results in the HoneyHive dashboard under **Experiments**. + +--- + +## Step 5: Built-in Tracing (Automatic) + +When you call `evaluate()`, your function is automatically traced using HoneyHive's OpenTelemetry-based tracing. Every datapoint execution produces a full traced session --- no additional setup required. + +**Important**: Do NOT create your own `HoneyHiveTracer.init()` alongside `evaluate()`. The SDK creates a new tracer per datapoint automatically. A global tracer will conflict and cause traces to land in the wrong session. + +```python +# WRONG - global tracer conflicts with evaluate() +tracer = HoneyHiveTracer.init(...) +@trace(event_type="tool", tracer=tracer) +def my_function(datapoint): + ... + +# CORRECT - let evaluate() manage tracers +@trace(event_type="tool") # No tracer parameter +def my_function(datapoint): + ... +``` + +All tracing primitives work inside your function: +- **Auto-instrumentation**: LLM calls via OpenAI, Anthropic, etc. are captured if you have instrumentors configured +- **Custom spans**: Use `@trace` to create spans for any step +- **Enrichment**: Call `enrich_span()` to attach metrics, metadata, or feedback to any span +- **Nested traces**: Multi-agent orchestration is traced with full parent-child relationships + +--- + +## Step 6: Multi-Step Pipeline Evaluation + +For pipelines with multiple steps, combine session-level evaluators (via `evaluate()`) with span-level metrics (via `enrich_span()`): + +```python +import os +from honeyhive import evaluate, trace, enrich_span + +# Session-level evaluator: scores the final pipeline output +def answer_quality(outputs, inputs, ground_truth): + expected = ground_truth.get("answer", "") + return 1.0 if expected.lower() in str(outputs).lower() else 0.0 + +# Span-level metrics: scores individual steps +@trace +def retrieve_docs(query): + docs = search_database(query) + enrich_span(metrics={"num_docs": len(docs), "retrieval_score": 0.85}) + return docs + +@trace +def generate_answer(docs, query): + answer = call_llm(docs, query) + enrich_span(metrics={"answer_length": len(answer)}) + return answer + +# Pipeline function +def rag_pipeline(datapoint): + query = datapoint["inputs"]["query"] + docs = retrieve_docs(query) + return generate_answer(docs, query) + +# Run experiment +result = evaluate( + function=rag_pipeline, + dataset=my_dataset, + evaluators=[answer_quality], + project=os.getenv("HH_PROJECT", "my-project"), + name="rag-eval", +) +``` + +After running, the dashboard shows: +- `answer_quality` scores at the **session level** +- `num_docs`, `retrieval_score`, `answer_length` at individual **span levels** + +### Evaluation Scope + +| Scope | What it evaluates | How | +|-------|-------------------|-----| +| **Session-level** | End-to-end pipeline output | Pass evaluators to `evaluate()` | +| **Span-level** | Individual steps | Call `enrich_span(metrics={...})` inside traced functions | + +--- + +## Step 7: Adding Client-Side Metrics to Production Traces + +Outside of experiments, you can add evaluation metrics directly to production traces using `enrich_span()` and `enrich_session()`. This is useful for guardrails, format validation, and real-time scoring. + +```python +import os +from honeyhive import HoneyHiveTracer, trace, enrich_span + +HoneyHiveTracer.init( + api_key=os.getenv("HH_API_KEY"), + project=os.getenv("HH_PROJECT"), +) + +@trace +def generate_response(query): + response = call_llm(query) + + # Compute and attach metrics inline + enrich_span(metrics={ + "response_length": len(response), + "contains_pii": check_pii(response), + "relevance_score": compute_relevance(query, response), + "json_valid": is_valid_json(response), + }) + + return response +``` + +### Metrics Data Types + +| Type | Available Measurements | Use Case | +|------|------------------------|----------| +| Boolean | True/False percentage | Pass/fail checks | +| Number | Sum, Avg, Median, Min, Max, P95, P98, P99 | Scores, latencies | +| String | Filters and group by | Classifications | + +Metrics appear in the HoneyHive dashboard for charting, alerting, and filtering. + +--- + +## Complete Example + +```python +import os +from openai import OpenAI +from honeyhive import evaluate, trace, enrich_span + +client = OpenAI() + +# --- Dataset --- +dataset = [ + { + "inputs": {"text": "I was charged twice for my subscription."}, + "ground_truth": {"intent": "billing"}, + }, + { + "inputs": {"text": "The export button gives error code 500."}, + "ground_truth": {"intent": "technical"}, + }, + { + "inputs": {"text": "I forgot my password and reset email never arrived."}, + "ground_truth": {"intent": "account"}, + }, + { + "inputs": {"text": "Your support team was amazing. Thanks!"}, + "ground_truth": {"intent": "general"}, + }, +] + +# --- Function --- +@trace +def classify_intent(datapoint): + text = datapoint["inputs"]["text"] + enrich_span(metadata={"text_length": len(text)}) + + response = client.chat.completions.create( + model="gpt-4o-mini", + messages=[{"role": "user", "content": f"""Classify into ONE category: +- billing, technical, account, general +Reply with ONLY the category name. +Message: {text} +Category:"""}], + temperature=0, + ) + return {"intent": response.choices[0].message.content.strip().lower()} + +# --- Evaluator --- +def intent_match(outputs, inputs, ground_truth): + actual = outputs.get("intent", "").lower() + expected = ground_truth.get("intent", "").lower() + return 1.0 if expected in actual else 0.0 + +# --- Run --- +result = evaluate( + function=classify_intent, + dataset=dataset, + evaluators=[intent_match], + project=os.getenv("HH_PROJECT", "my-project"), + name="intent-classifier-v1", +) + +print(f"Run ID: {result.run_id}") +``` + +--- + +## Best Practices + +1. **Evaluator signature**: Always `(outputs, inputs, ground_truth)` -> score. Return a number (0.0-1.0), boolean, or string. +2. **No global tracer with `evaluate()`**: Let the SDK manage per-datapoint tracers automatically. +3. **Use `@trace` without `tracer=`** inside `evaluate()` functions --- the SDK provides the tracer. +4. **Combine session-level and span-level**: Use `evaluate(evaluators=[...])` for end-to-end scoring and `enrich_span(metrics={...})` for per-step scoring. +5. **Keep evaluators simple and deterministic**: Complex eval logic should be in server-side LLM-as-judge evaluators. +6. **Use consistent metric names** across experiments for meaningful comparisons. +7. **Name experiments descriptively**: `"rag-v2-gpt4o-temperature0.3"` not `"test-1"`. +8. **Use production trace metrics for guardrails**: Attach `enrich_span(metrics={...})` for real-time format validation, PII detection, safety checks. + +## Troubleshooting + +| Issue | Solution | +|-------|----------| +| Traces landing in wrong session | Remove any global `HoneyHiveTracer.init()` when using `evaluate()` | +| Evaluator not scoring | Check function signature is `(outputs, inputs, ground_truth)` | +| Missing span-level metrics | Ensure `enrich_span()` is called inside a `@trace`-decorated function | +| `evaluate()` hangs | Check network connectivity to HoneyHive API and valid `HH_API_KEY` | +| Server-side evals not running | Server-side evaluators are configured in UI, not passed to `evaluate()` | diff --git a/.agents/skills/honeyhive-tracing/SKILL.md b/.agents/skills/honeyhive-tracing/SKILL.md new file mode 100644 index 00000000..6a36da93 --- /dev/null +++ b/.agents/skills/honeyhive-tracing/SKILL.md @@ -0,0 +1,516 @@ +--- +name: honeyhive-tracing +description: Add HoneyHive tracing to a Python application. Use when asked to instrument an app with HoneyHive, add observability, set up tracing, or integrate HoneyHive's SDK for monitoring AI/LLM calls. Covers tracer initialization, auto-instrumentation, custom spans, enrichment, and deployment patterns. +metadata: + author: honeyhive + version: "1.0" +--- + +# HoneyHive Tracing + +Add HoneyHive's OpenTelemetry-based tracing to a Python application. This skill covers tracer initialization, auto-instrumentation of LLM providers, custom spans, trace enrichment, and production deployment patterns. + +## Prerequisites + +- A HoneyHive project (create at https://app.honeyhive.ai/projects) +- A HoneyHive API key (org settings > Copy API Key) +- Python 3.9+ + +Environment variables expected: +- `HH_API_KEY` - HoneyHive API key +- `HH_PROJECT` - HoneyHive project name +- `HH_API_URL` - (optional) Custom API URL for self-hosted/dedicated deployments + +--- + +## Concepts + +### Data Model + +HoneyHive uses a **wide-event** data model. Every event carries its full context in a single record: inputs, outputs, timing, metrics, metadata, feedback, and errors. + +Events form a hierarchical tree: + +``` +session (root) # event_type: session ++-- validate_input # event_type: tool ++-- retrieve_context # event_type: tool ++-- llm_completion # event_type: model ++-- format_response # event_type: chain +``` + +**Session**: The root event. Groups all child events. Can be single-turn (one request) or multi-turn (entire conversation). Equivalent to a "trace" in APM tools. + +**Event**: A discrete operation. Each has an `event_type`: + +| `event_type` | What it represents | Examples | +|--------------|-------------------|----------| +| `model` | An LLM API request | GPT-4 completion, Claude message | +| `tool` | An external service or function call | Vector DB search, API call, database query | +| `chain` | A logical grouping of child events | RAG pipeline, agent workflow | + +**Event Schema** (all types share this): + +| Field | Description | +|-------|-------------| +| `event_id` | Unique identifier (UUID) | +| `session_id` | Groups all events in the same trace | +| `parent_id` | Links child to parent (`null` for root session) | +| `event_type` | `"session"`, `"model"`, `"tool"`, or `"chain"` | +| `event_name` | Human-readable operation name | +| `inputs` / `outputs` | Input/output data | +| `config` | Configuration (model params, prompt template, etc.) | +| `metadata` | Custom key-value pairs | +| `metrics` | Numeric measurements (latency, tokens, cost, eval scores) | +| `feedback` | User ratings, corrections | +| `error` | Error details if failed | + +### Architecture + +HoneyHive is built on OpenTelemetry. The SDK wraps an OTel `TracerProvider` and exports spans via OTLP. Any OTel-compatible instrumentor works. The SDK itself has zero dependencies on AI libraries --- instrumentors are installed separately (BYOI: Bring Your Own Instrumentor). + +--- + +## Step 1: Install Dependencies + +Install the HoneyHive SDK plus the instrumentor for your LLM provider: + +```bash +# Core SDK +pip install honeyhive + +# Provider instrumentors (install only what you use) +pip install openinference-instrumentation-openai # OpenAI +pip install openinference-instrumentation-anthropic # Anthropic +pip install openinference-instrumentation-bedrock # AWS Bedrock +pip install openinference-instrumentation-litellm # LiteLLM +pip install openinference-instrumentation-langchain # LangChain +pip install openinference-instrumentation-llama-index # LlamaIndex +pip install openinference-instrumentation-crewai # CrewAI +pip install openinference-instrumentation-google-adk # Google ADK +``` + +For agent frameworks with native OTel support (e.g., AWS Strands, PydanticAI), check framework-specific docs --- they may not need an OpenInference instrumentor. + +--- + +## Step 2: Initialize the Tracer + +Choose the pattern that matches your use case: + +### Scripts / Notebooks (simplest) + +Initialize once at module level. All traced operations share the same session. + +```python +import os +from honeyhive import HoneyHiveTracer, trace + +tracer = HoneyHiveTracer.init( + api_key=os.getenv("HH_API_KEY"), + project=os.getenv("HH_PROJECT"), + session_name="my-session", # Optional: human-readable label + source="development", # Optional: label where traces come from + # server_url=os.getenv("HH_API_URL"), # Required for self-hosted/dedicated +) +``` + +### Web Servers (FastAPI / Flask / Django) + +Initialize **one** tracer at startup, create a **new session per request** using `create_session()` (sync) or `acreate_session()` (async). + +```python +import os +from fastapi import FastAPI, Request +from honeyhive import HoneyHiveTracer, trace + +tracer = HoneyHiveTracer.init( + api_key=os.getenv("HH_API_KEY"), + project=os.getenv("HH_PROJECT"), + source="production", +) + +app = FastAPI() + +@app.middleware("http") +async def session_middleware(request: Request, call_next): + session_id = await tracer.acreate_session( + session_name=f"api-{request.url.path}", + inputs={"method": request.method, "path": str(request.url)}, + ) + response = await call_next(request) + tracer.enrich_session(outputs={"status_code": response.status_code}) + if session_id: + response.headers["X-Session-ID"] = session_id + return response +``` + +**Important**: Use `create_session()` / `acreate_session()`, NOT `session_start()` for web servers. The former stores session ID in request-scoped baggage (safe for concurrent requests); the latter stores it on the tracer instance (race condition). + +### Serverless (AWS Lambda) + +Lazy init + per-request sessions. Set `disable_batch=True` to flush spans before the function terminates. + +```python +import os +from typing import Optional +from honeyhive import HoneyHiveTracer, trace + +_tracer: Optional[HoneyHiveTracer] = None + +def get_tracer() -> HoneyHiveTracer: + global _tracer + if _tracer is None: + _tracer = HoneyHiveTracer.init( + api_key=os.getenv("HH_API_KEY"), + project=os.getenv("HH_PROJECT"), + disable_batch=True, + ) + return _tracer + +def lambda_handler(event, context): + tracer = get_tracer() + session_id = tracer.create_session( + session_name=f"lambda-{context.aws_request_id}", + inputs={"event": event}, + ) + result = process_event(event) + tracer.enrich_session(outputs={"result": result}) + return result +``` + +### Evaluation / Experiments + +When running experiments with `evaluate()`, **do NOT** create your own tracer. The SDK creates a new tracer per datapoint automatically. See the `honeyhive-evaluators` skill for details. + +--- + +## Step 3: Add Auto-Instrumentation + +After initializing the tracer, register instrumentors for your LLM providers. This captures all LLM calls automatically. + +```python +from openinference.instrumentation.openai import OpenAIInstrumentor + +# Register the instrumentor with the tracer's provider +OpenAIInstrumentor().instrument(tracer_provider=tracer.provider) + +# Now all OpenAI calls are automatically traced +from openai import OpenAI +client = OpenAI() +response = client.chat.completions.create( + model="gpt-4o-mini", + messages=[{"role": "user", "content": "Hello!"}], +) +``` + +Multiple instrumentors can be registered simultaneously: + +```python +from openinference.instrumentation.openai import OpenAIInstrumentor +from openinference.instrumentation.anthropic import AnthropicInstrumentor + +OpenAIInstrumentor().instrument(tracer_provider=tracer.provider) +AnthropicInstrumentor().instrument(tracer_provider=tracer.provider) +``` + +--- + +## Step 4: Add Custom Spans with `@trace` + +Auto-instrumentation captures LLM and vector DB calls. For any other function (preprocessing, postprocessing, business logic), use the `@trace` decorator: + +```python +from honeyhive import trace, enrich_span + +@trace +def process_request(user_id: str, data: dict) -> dict: + """Automatically traced with inputs/outputs captured.""" + enrich_span(metadata={"user_id": user_id}) + result = do_processing(data) + return {"status": "success", "data": result} + +@trace +def nested_workflow(request: dict) -> dict: + """Nested calls create trace hierarchy automatically.""" + validated = validate(request) # Child span + processed = process(validated) # Child span + return save(processed) # Child span +``` + +### Decorator Options + +```python +# Specify event type and name +@trace(event_type="tool", event_name="database_lookup") +def lookup(query: str): + ... + +# Bind to a specific tracer instance (recommended for production) +@trace(event_type="chain", tracer=tracer) +def my_pipeline(input_data): + ... +``` + +### Async Functions + +`@trace` works with both sync and async functions automatically --- no separate decorator needed: + +```python +@trace +async def fetch_data(url: str) -> dict: + async with aiohttp.ClientSession() as session: + async with session.get(url) as response: + return await response.json() +``` + +### Context Managers (for loops, conditionals) + +Use `enrich_span_context()` when decorators don't fit: + +```python +from honeyhive.tracer.processing.context import enrich_span_context + +@trace +def process_batch(items: list) -> list: + results = [] + for i, item in enumerate(items): + with enrich_span_context( + event_name=f"process_item_{i}", + inputs={"item": item}, + ): + result = transform_item(item) + results.append(result) + return results +``` + +--- + +## Step 5: Enrich Traces + +Enrichments add context beyond what auto-instrumentation captures. Three levels: + +### Session-Level (applies to all events) + +```python +tracer.enrich_session( + metadata={"tenant_id": "acme_corp", "app_version": "2.1.0"}, + user_properties={"user_id": "user_123", "plan": "premium"}, + config={"model": "gpt-4o", "prompt_version": "v2.3"}, +) +``` + +### Span-Level (inside a `@trace` function) + +```python +from honeyhive import enrich_span + +@trace +def generate_response(query: str): + response = call_llm(query) + enrich_span( + metadata={"query_length": len(query)}, + metrics={"relevance_score": 0.95, "contains_pii": False}, + feedback={"rating": True}, + ) + return response +``` + +### Auto-Instrumented Span Enrichment (without `@trace`) + +Use `using_attributes` from OpenInference to enrich auto-instrumented LLM spans: + +```python +from openinference.instrumentation import using_attributes + +with using_attributes( + user_id="user_12345", + metadata={"feature": "chat_support"}, +): + response = client.chat.completions.create(...) +``` + +### Enrichment Namespaces + +| Namespace | Type | Description | +|-----------|------|-------------| +| `config` | Object | Model params, prompt templates, hyperparameters | +| `feedback` | Object | User ratings, corrections, ground truth | +| `metrics` | Object | Scores, evaluations, numeric measurements | +| `metadata` | Object | Arbitrary key-value pairs (catch-all) | +| `inputs` | Object | Input data | +| `outputs` | Object | Output data | +| `user_properties` | Object | User ID, tier, email, etc. | +| `error` | String | Error information (span-level only) | + +### Invocation Patterns + +All of these are equivalent: + +```python +# Simple dict +enrich_span({"user_id": "user_123", "feature": "chat"}) + +# Keyword arguments (go to metadata) +enrich_span(user_id="user_123", feature="chat") + +# Explicit namespaces +enrich_span( + metadata={"user_id": "user_123"}, + metrics={"score": 0.95}, +) + +# Mixed +enrich_span( + metadata={"user_id": "user_123"}, + metrics={"score": 0.95}, + feature="chat", # extra kwargs go to metadata +) +``` + +--- + +## Step 6: Distributed Tracing (Multi-Service) + +### Simple: Session ID Passing + +Pass `session_id` between services so events land in the same session: + +```python +# Service A: get session_id +session_id = tracer.session_id + +# Service B: init tracer with same session_id +tracer_b = HoneyHiveTracer.init( + api_key=os.getenv("HH_API_KEY"), + project=os.getenv("HH_PROJECT"), + session_id=session_id, +) +``` + +### Full: W3C Context Propagation + +For true parent-child relationships across services: + +```python +# Client side: inject context into outgoing headers +from honeyhive.tracer.processing.context import inject_context_into_carrier, enrich_span_context + +with enrich_span_context(event_name="call_remote"): + headers = {"Content-Type": "application/json"} + inject_context_into_carrier(headers, tracer) + response = requests.post(url, json=payload, headers=headers) + +# Server side: extract context from incoming headers +from honeyhive.tracer.processing.context import with_distributed_trace_context + +@app.route("/agent/invoke", methods=["POST"]) +async def invoke_agent(): + with with_distributed_trace_context(dict(request.headers), tracer): + result = await run_agent(...) +``` + +--- + +## Step 7: Multi-Turn Conversations + +For multi-turn conversations in web servers, the first request creates a session and returns the ID. Subsequent requests link to that session: + +```python +@app.middleware("http") +async def session_middleware(request: Request, call_next): + existing_session = request.headers.get("X-Session-ID") + if existing_session: + await tracer.acreate_session( + session_id=existing_session, + skip_api_call=True, # Session already exists, just set context + ) + else: + session_id = await tracer.acreate_session( + session_name=f"conversation-{request.url.path}" + ) + request.state.new_session_id = session_id + + response = await call_next(request) + if hasattr(request.state, "new_session_id"): + response.headers["X-Session-ID"] = request.state.new_session_id + return response +``` + +--- + +## Complete Example + +```python +import os +from honeyhive import HoneyHiveTracer, trace, enrich_span +from openinference.instrumentation.openai import OpenAIInstrumentor +from openai import OpenAI + +# 1. Initialize tracer +tracer = HoneyHiveTracer.init( + api_key=os.getenv("HH_API_KEY"), + project=os.getenv("HH_PROJECT"), + session_name="rag-pipeline", + source="development", +) + +# 2. Register instrumentor +OpenAIInstrumentor().instrument(tracer_provider=tracer.provider) + +# 3. Session-level enrichment +tracer.enrich_session( + user_properties={"user_id": "user_123", "plan": "premium"}, + metadata={"app_version": "2.1.0"}, +) + +client = OpenAI() + +# 4. Custom spans with enrichment +@trace(event_type="tool") +def retrieve_docs(query: str) -> list: + docs = search_vector_db(query) + enrich_span(metrics={"num_docs": len(docs)}) + return docs + +@trace(event_type="chain") +def rag_pipeline(query: str) -> str: + docs = retrieve_docs(query) + response = client.chat.completions.create( + model="gpt-4o-mini", + messages=[ + {"role": "system", "content": f"Answer using context: {docs}"}, + {"role": "user", "content": query}, + ], + ) + answer = response.choices[0].message.content + enrich_span(metrics={"answer_length": len(answer)}) + return answer + +result = rag_pipeline("How do I build an integration?") +print(result) +``` + +--- + +## Best Practices + +1. **Pass an explicit tracer to `@trace`** in production: `@trace(event_type="tool", tracer=tracer)` +2. **Create sessions per logical unit of work** even with a global tracer +3. **Use `test_mode=True`** for local development without sending data: `HoneyHiveTracer.init(..., test_mode=True)` +4. **Use descriptive span names**: `@trace(event_name="payment_processing_stripe")` not `@trace(event_name="process")` +5. **Avoid over-instrumentation**: Don't create a span per item in a hot loop --- trace the batch +6. **Use consistent key names** across your app for enrichment +7. **Don't include sensitive data** (passwords, API keys, PII) in enrichments +8. **Keep enrichment values under 1KB** per field +9. **Use namespaces explicitly**: `metadata=`, `metrics=`, `user_properties=` for clarity + +## Troubleshooting + +| Issue | Solution | +|-------|----------| +| Traces not appearing | Check `HH_API_KEY` and `HH_PROJECT` are set correctly | +| Events in wrong session | Remove global `HoneyHiveTracer.init()` if using `evaluate()` | +| Race conditions in web server | Use `create_session()` not `session_start()` | +| Lambda spans missing | Set `disable_batch=True` on tracer init | +| Instrumentor not capturing | Ensure `tracer_provider=tracer.provider` is passed to `.instrument()` |