Skip to content

refactor(core): migrate LLM providers from Vercel AI SDK to @mariozechner/pi-ai #1205

@christso

Description

@christso

Objective

Replace the Vercel AI SDK (@ai-sdk/*, ai, @openrouter/ai-sdk-provider) with @mariozechner/pi-ai as the LLM provider layer for grader/rubric/agentv-provider call sites in packages/core. AgentV already depends on @mariozechner/pi-coding-agent (which sits on top of pi-ai), so this consolidates onto a single LLM stack and removes ~6 SDK packages from the dependency graph.

Background

packages/core/src/evaluation/providers/ai-sdk.ts (559 lines) wraps Vercel AI SDK to expose 5 providers: OpenAIProvider, AzureProvider, OpenRouterProvider, AnthropicProvider, GeminiProvider. All converge on a single generateText() call per invoke() (stateless RPC shape).

pi-ai covers the same provider surface natively:

  • OpenAI, Azure OpenAI (Responses), Anthropic, Google, OpenRouter, plus Vertex/Bedrock/Mistral/Groq/xAI/Cerebras/etc.
  • Dedicated provider files in pi-ai/dist/providers/ including azure-openai-responses.js.
  • Unified complete(model, context) and stream(model, context) APIs.
  • Built-in token usage + cost tracking.
  • Unified reasoning: 'minimal'|'low'|'medium'|'high'|'xhigh' thinking control.

OpenRouter is a first-class supported provider with its own routing config (openRouterRouting).

Call sites in scope

  1. packages/core/src/evaluation/providers/ai-sdk.ts — 5 provider classes
  2. packages/core/src/evaluation/providers/agentv-provider.ts — built-in grader provider
  3. packages/core/src/evaluation/graders/llm-grader.tsgenerateText() + filesystem tool() definitions
  4. packages/core/src/evaluation/graders/composite.tsgenerateText()
  5. packages/core/src/evaluation/generators/rubric-generator.tsgenerateText()
  6. packages/core/src/evaluation/providers/index.ts — registry wiring

Design latitude

  • Keep the existing Provider.invoke(request) -> response contract. Implement provider classes as thin adapters over pi-ai's complete(). Don't refactor call sites to a session-based shape — that's a much larger change and pi-coding-agent's session model is heavier than graders need.
  • Tool definitions move from Zod (ai-sdk's tool()) to TypeBox (pi-ai's Type.Object()). Mechanical port for the small set of filesystem tools in llm-grader.
  • Anthropic thinking budget: today the config takes a numeric budgetTokens; pi-ai exposes a 5-bucket reasoning enum. Map numeric budgets to the closest bucket and document the change.
  • Retry/backoff: ai-sdk.ts lines 520–559 have a custom exponential-backoff loop. Either preserve as a wrapper around complete() or accept pi-ai's defaults.

Spike scope (first PR)

Single-provider PoC to de-risk the migration before scoping the full port:

  • Port OpenAIProvider only to a pi-ai adapter behind the existing Provider interface.
  • Leave the other 4 providers and all consumers unchanged.
  • Run the existing grader-score baselines (scripts/check-grader-scores.ts) against an OpenAI-targeted eval and confirm scores stay within range.
  • Capture findings on: token-usage shape mapping, retry-loop placement, tool-definition port complexity, any Azure/Anthropic-specific gotchas observed while reading pi-ai source.

The spike PR is not intended to remove @ai-sdk/openai from package.json — both libraries co-exist for the duration of the spike.

Acceptance signals (full migration)

  • All 5 provider classes in ai-sdk.ts reimplemented over pi-ai.
  • llm-grader, composite, rubric-generator, agentv-provider updated.
  • @ai-sdk/anthropic, @ai-sdk/azure, @ai-sdk/google, @ai-sdk/openai, ai, @openrouter/ai-sdk-provider removed from all package.json files.
  • All grader-score baselines under examples/**/*.grader-scores.yaml pass.
  • At least one live eval per provider (OpenAI, Azure, Anthropic, Google, OpenRouter) produces correct scores[].type, scores in expected range, and non-zero token usage.
  • Anthropic thinking budget config: numeric → bucket mapping documented in skill files and code header.
  • No regressions in bun run test or bun run validate:examples.

Risks / unknowns

  • Token-usage object shape. pi-ai returns {input, output, cost}; ai-sdk surfaces {inputTokens, outputTokens, cachedInputTokens, reasoningTokens}. JSONL output and Studio aggregation may need adjustment if any consumer relies on cached/reasoning fields.
  • Azure Responses API parity. useDeploymentBasedUrls + apiFormat: 'responses' switching needs verification with real deployment.
  • Anthropic thinking. Going from numeric budget to 5-bucket enum is a lossy API change for anyone setting fine-grained budgets — call out as a behavior change in the PR.
  • Retry semantics. ai-sdk.ts has bespoke backoff; pi-ai's behavior differs. Decide: wrap or replace.

Non-goals

  • No streaming. Current call sites are non-streaming; don't add streaming as part of this migration.
  • No move to pi-coding-agent's session model — keep grader calls stateless.
  • Not changing the public Provider interface or ProviderRequest/ProviderResponse shapes consumed elsewhere in core.
  • Not adding new providers exposed by pi-ai (Bedrock, Vertex, Mistral, etc.) in this issue — separate work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    coreAnything pertaining to core functionality of AgentVrefactor

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions