refactor(core): migrate LLM providers from Vercel AI SDK to @mariozechner/pi-ai

## Objective

Replace the Vercel AI SDK (`@ai-sdk/*`, `ai`, `@openrouter/ai-sdk-provider`) with `@mariozechner/pi-ai` as the LLM provider layer for grader/rubric/agentv-provider call sites in `packages/core`. AgentV already depends on `@mariozechner/pi-coding-agent` (which sits on top of `pi-ai`), so this consolidates onto a single LLM stack and removes ~6 SDK packages from the dependency graph.

## Background

`packages/core/src/evaluation/providers/ai-sdk.ts` (559 lines) wraps Vercel AI SDK to expose 5 providers: `OpenAIProvider`, `AzureProvider`, `OpenRouterProvider`, `AnthropicProvider`, `GeminiProvider`. All converge on a single `generateText()` call per `invoke()` (stateless RPC shape).

`pi-ai` covers the same provider surface natively:
- OpenAI, Azure OpenAI (Responses), Anthropic, Google, OpenRouter, plus Vertex/Bedrock/Mistral/Groq/xAI/Cerebras/etc.
- Dedicated provider files in `pi-ai/dist/providers/` including `azure-openai-responses.js`.
- Unified `complete(model, context)` and `stream(model, context)` APIs.
- Built-in token usage + cost tracking.
- Unified `reasoning: 'minimal'|'low'|'medium'|'high'|'xhigh'` thinking control.

OpenRouter is a first-class supported provider with its own routing config (`openRouterRouting`).

## Call sites in scope

1. `packages/core/src/evaluation/providers/ai-sdk.ts` — 5 provider classes
2. `packages/core/src/evaluation/providers/agentv-provider.ts` — built-in grader provider
3. `packages/core/src/evaluation/graders/llm-grader.ts` — `generateText()` + filesystem `tool()` definitions
4. `packages/core/src/evaluation/graders/composite.ts` — `generateText()`
5. `packages/core/src/evaluation/generators/rubric-generator.ts` — `generateText()`
6. `packages/core/src/evaluation/providers/index.ts` — registry wiring

## Design latitude

- Keep the existing `Provider.invoke(request) -> response` contract. Implement provider classes as thin adapters over `pi-ai`'s `complete()`. Don't refactor call sites to a session-based shape — that's a much larger change and `pi-coding-agent`'s session model is heavier than graders need.
- Tool definitions move from Zod (ai-sdk's `tool()`) to TypeBox (pi-ai's `Type.Object()`). Mechanical port for the small set of filesystem tools in `llm-grader`.
- Anthropic thinking budget: today the config takes a numeric `budgetTokens`; pi-ai exposes a 5-bucket `reasoning` enum. Map numeric budgets to the closest bucket and document the change.
- Retry/backoff: `ai-sdk.ts` lines 520–559 have a custom exponential-backoff loop. Either preserve as a wrapper around `complete()` or accept pi-ai's defaults.

## Spike scope (first PR)

Single-provider PoC to de-risk the migration before scoping the full port:

- Port **OpenAIProvider only** to a `pi-ai` adapter behind the existing `Provider` interface.
- Leave the other 4 providers and all consumers unchanged.
- Run the existing grader-score baselines (`scripts/check-grader-scores.ts`) against an OpenAI-targeted eval and confirm scores stay within range.
- Capture findings on: token-usage shape mapping, retry-loop placement, tool-definition port complexity, any Azure/Anthropic-specific gotchas observed while reading pi-ai source.

The spike PR is **not** intended to remove `@ai-sdk/openai` from `package.json` — both libraries co-exist for the duration of the spike.

## Acceptance signals (full migration)

- [x] All 5 provider classes in `ai-sdk.ts` reimplemented over `pi-ai`.
- [ ] `llm-grader`, `composite`, `rubric-generator`, `agentv-provider` updated.
- [ ] `@ai-sdk/anthropic`, `@ai-sdk/azure`, `@ai-sdk/google`, `@ai-sdk/openai`, `ai`, `@openrouter/ai-sdk-provider` removed from all `package.json` files.
- [x] All grader-score baselines under `examples/**/*.grader-scores.yaml` pass.
- [ ] At least one live eval per provider (OpenAI, Azure, Anthropic, Google, OpenRouter) produces correct `scores[].type`, scores in expected range, and non-zero token usage.
- [ ] Anthropic thinking budget config: numeric → bucket mapping documented in skill files and code header.
- [ ] No regressions in `bun run test` or `bun run validate:examples`.

## Risks / unknowns

- **Token-usage object shape.** pi-ai returns `{input, output, cost}`; ai-sdk surfaces `{inputTokens, outputTokens, cachedInputTokens, reasoningTokens}`. JSONL output and Studio aggregation may need adjustment if any consumer relies on cached/reasoning fields.
- **Azure Responses API parity.** `useDeploymentBasedUrls` + `apiFormat: 'responses'` switching needs verification with real deployment.
- **Anthropic thinking.** Going from numeric budget to 5-bucket enum is a **lossy** API change for anyone setting fine-grained budgets — call out as a behavior change in the PR.
- **Retry semantics.** `ai-sdk.ts` has bespoke backoff; pi-ai's behavior differs. Decide: wrap or replace.

## Non-goals

- No streaming. Current call sites are non-streaming; don't add streaming as part of this migration.
- No move to `pi-coding-agent`'s session model — keep grader calls stateless.
- Not changing the public `Provider` interface or `ProviderRequest`/`ProviderResponse` shapes consumed elsewhere in core.
- Not adding new providers exposed by pi-ai (Bedrock, Vertex, Mistral, etc.) in this issue — separate work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(core): migrate LLM providers from Vercel AI SDK to @mariozechner/pi-ai #1205

Objective

Background

Call sites in scope

Design latitude

Spike scope (first PR)

Acceptance signals (full migration)

Risks / unknowns

Non-goals

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

refactor(core): migrate LLM providers from Vercel AI SDK to @mariozechner/pi-ai #1205

Description

Objective

Background

Call sites in scope

Design latitude

Spike scope (first PR)

Acceptance signals (full migration)

Risks / unknowns

Non-goals

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions