From 249a04c09a0cf5eefe2a89a3e5f482382a5b8575 Mon Sep 17 00:00:00 2001
From: Gautam Datla <gautamdsrc@gmail.com>
Date: Sun, 18 Jan 2026 21:44:16 -0500
Subject: [PATCH] Claude code skills for transformers-api

---
 .claude/skills/transformers-api/SKILL.md      | 159 +++++
 .../reference/areas/export-serving.md         | 286 +++++++++
 .../reference/areas/generation.md             | 466 ++++++++++++++
 .../reference/areas/inference.md              | 286 +++++++++
 .../reference/areas/performance.md            | 371 ++++++++++++
 .../reference/areas/preprocessing.md          | 434 +++++++++++++
 .../reference/areas/repo-contributing.md      | 315 ++++++++++
 .../reference/areas/training.md               | 353 +++++++++++
 .../reference/areas/troubleshooting.md        | 343 +++++++++++
 .../reference/generated/module_tree.md        | 553 +++++++++++++++++
 .../reference/generated/public_api.md         | 572 ++++++++++++++++++
 .../templates/minimal_repro.md                | 167 +++++
 12 files changed, 4305 insertions(+)
 create mode 100644 .claude/skills/transformers-api/SKILL.md
 create mode 100644 .claude/skills/transformers-api/reference/areas/export-serving.md
 create mode 100644 .claude/skills/transformers-api/reference/areas/generation.md
 create mode 100644 .claude/skills/transformers-api/reference/areas/inference.md
 create mode 100644 .claude/skills/transformers-api/reference/areas/performance.md
 create mode 100644 .claude/skills/transformers-api/reference/areas/preprocessing.md
 create mode 100644 .claude/skills/transformers-api/reference/areas/repo-contributing.md
 create mode 100644 .claude/skills/transformers-api/reference/areas/training.md
 create mode 100644 .claude/skills/transformers-api/reference/areas/troubleshooting.md
 create mode 100644 .claude/skills/transformers-api/reference/generated/module_tree.md
 create mode 100644 .claude/skills/transformers-api/reference/generated/public_api.md
 create mode 100644 .claude/skills/transformers-api/templates/minimal_repro.md

diff --git a/.claude/skills/transformers-api/SKILL.md b/.claude/skills/transformers-api/SKILL.md
new file mode 100644
index 000000000000..24c7d70db219
--- /dev/null
+++ b/.claude/skills/transformers-api/SKILL.md
@@ -0,0 +1,159 @@
+---
+name: transformers-api
+description: Guides coding and debugging in the Hugging Face Transformers repo. Use when questions involve transformers APIs (pipeline, AutoModel*, AutoTokenizer, Trainer, generate), repo navigation (“where is X implemented?”), performance/quantization, export/serving, or stack traces referencing transformers/ or src/transformers/.
+---
+
+# Transformers API Navigator (Claude Code)
+
+## Purpose
+This Skill is an **operating playbook** for working with the `huggingface/transformers` codebase and answering Transformers API questions **without guessing**.
+
+It optimizes for:
+- correct API choice (pipeline vs Auto* vs Trainer vs export/perf)
+- fast debugging (minimal repro-first)
+- accurate repo navigation (“where is X implemented?”)
+- small, testable changes when modifying the repo
+
+This file is intentionally **high-level**. Detailed breakdowns live in individual markdown files under `reference/areas/*`.
+
+---
+
+## When to activate
+Activate this Skill if **any** of the following are true:
+- The user mentions Transformers or `transformers` APIs (`pipeline`, `AutoModel*`, `AutoTokenizer`, `Trainer`, `generate`, etc.)
+- They reference Transformers artifacts (`config.json`, `tokenizer.json`, `generation_config.json`, `model.safetensors`, etc.)
+- They show code importing `transformers` or stack traces mentioning `transformers/` or `src/transformers/`
+- They need a Transformers-specific decision (inference vs training, generation knobs, perf/quantization, export/serving)
+- They ask repo questions: “where is X implemented?”, “which file owns Y?”
+
+Do **not** activate if the request is mostly:
+- Hub/Datasets usage with no `transformers` callsite, or
+- **tokenizers library internals** (the separate tokenizers repo / Rust internals) with no Transformers usage.
+
+Do activate if it’s **Transformers usage of tokenizers/processors** (route to Preprocessing).
+
+---
+
+## Reference entry points 
+
+### Buckets (open exactly ONE first)
+- Inference → `reference/areas/inference.md`
+- Preprocessing → `reference/areas/preprocessing.md`
+- Generation → `reference/areas/generation.md`
+- Training / Evaluation → `reference/areas/training.md`
+- Performance / Memory / Quantization → `reference/areas/performance.md`
+- Export / Serving → `reference/areas/export-serving.md`
+- Repo navigation / Contributing → `reference/areas/repo-contributing.md`
+- Debugging / Troubleshooting → `reference/areas/troubleshooting.md`
+
+### Verification (“don’t hallucinate”)
+- Symbol/arg exists → `reference/generated/public_api.md`
+- Where implemented → `reference/generated/module_tree.md`
+
+Full repo structure is captured in: `reference/generated/module_tree.md`
+
+### Debug template
+- Minimal repro form → `templates/minimal_repro.md`
+
+---
+
+## Exact sequential process (always follow this order)
+
+### Step 1 — Classify the request (pick ONE bucket)
+- **Inference** (pipelines, Auto* inference)
+- **Preprocessing** (tokenizers / processors)
+- **Generation** (generate/decoding/chat/streaming)
+- **Training / Evaluation** (Trainer, arguments, callbacks)
+- **Performance / Memory / Quantization**
+- **Export / Serving**
+- **Repo navigation / Contributing**
+- **Debugging / Troubleshooting**
+
+### Step 2 — Ask only what’s missing (0–5 questions, only if ambiguous)
+Ask only the minimum to proceed:
+1) Goal/outcome in one sentence (only if unclear)
+2) Modality/task (Text / Vision / Audio / Video / Multimodal) (only if relevant)
+3) Model id or local path (and revision/commit if pinned) (if loading/inference/training is involved)
+4) Environment: `transformers` version + backend (PyTorch/TF/JAX) + device (CPU/CUDA/MPS) (+ rough VRAM/RAM if perf matters)
+5) If blocked: full stack trace + minimal repro snippet (use `templates/minimal_repro.md`)
+
+### Step 3 — Route first (deterministic router embedded here)
+Follow this router and open **exactly one** bucket file from the list above.
+
+#### Routing rules
+- If the user is blocked by an exception/traceback, regression, or wrong output → open **Troubleshooting** first  
+  **unless** it is clearly a `Trainer`/training-loop failure → open **Training** first.
+- If multiple buckets match, prioritize the user’s **desired outcome** over the first keyword seen.
+- If still tied, use this fixed priority order:  
+  **Troubleshooting > Training > Generation > Inference > Preprocessing > Performance > Export/Serving > Repo/Contributing**
+
+#### Routing table (open exactly ONE file first)
+
+| User intent / signal | Open this first | Common keywords / symptoms |
+|---|---|---|
+| Run inference / predict / use a model quickly | `reference/areas/inference.md` | `pipeline`, `AutoModelFor*`, `from_pretrained`, logits, predict, embeddings, classification, ASR/VQA/etc. |
+| Preprocessing / inputs formatting (text/vision/audio/video) | `reference/areas/preprocessing.md` | `AutoTokenizer`, `AutoProcessor`, `AutoImageProcessor`, `AutoVideoProcessor`, (audio) `FeatureExtractor`, padding, truncation, transforms, normalization, resizing, sampling rate |
+| Text generation / chat behavior | `reference/areas/generation.md` | `generate`, decoding, `max_new_tokens`, sampling, beams, stop tokens, streaming, chat templates |
+| Fine-tuning / training / evaluation | `reference/areas/training.md` | `Trainer`, `TrainingArguments`, `train`, `evaluate`, metrics, collators, checkpoints, distributed, FSDP/DeepSpeed/Accelerate |
+| Performance / memory / quantization | `reference/areas/performance.md` | VRAM/OOM, `device_map`, `torch_dtype`, fp16/bf16, attention backends, 8-bit/4-bit, bitsandbytes/GPTQ/AWQ |
+| Export / serving / deployment | `reference/areas/export-serving.md` | ONNX/export, serving, batching, vLLM/TGI/SGLang, `transformers serve` (moderate-load/experimental), `transformers chat` |
+| Repo navigation / contributing / “where is X implemented?” | `reference/areas/repo-contributing.md` | “where is”, “which file”, “implementation”, `src/transformers`, tests, docs, PR, add model |
+| Errors, crashes, regressions, wrong outputs | `reference/areas/troubleshooting.md` | traceback, exception, mismatch, device/dtype errors, missing files, unexpected output |
+
+#### Verification shortcuts 
+Use these only when uncertain about an API/arg/behavior, or when locating code/docs:
+- **Does a symbol/arg exist?** → `reference/generated/public_api.md`
+- **Where is it implemented?** → `reference/generated/module_tree.md`
+
+#### Fallback (if nothing matches)
+- Open `reference/generated/public_api.md` to identify the closest public surface area.
+- Then route to the nearest bucket in the table above and continue.
+
+### Step 4 — If blocked by an error: reproduce/triage first
+If the user cannot proceed due to an exception or incorrect outputs:
+- prioritize minimal repro + full stack trace + versions
+- classify the failure: **loading** vs **preprocessing** vs **forward/generate** vs **Trainer** vs **integration**
+- apply a targeted fix + propose the smallest next diagnostic step
+
+### Step 5 — Verify only when uncertain (never guess)
+Only consult verification sources when you are unsure about a symbol/arg/behavior/default, or when locating an implementation.
+
+Verification order:
+1) `reference/generated/public_api.md` : confirms what is publicly exposed (what exists)
+2) `reference/generated/module_tree.md` : finds where it lives in `src/transformers/` (where it’s implemented)
+3) Fallback if needed: inspect `src/transformers/`, `docs/source/`, and/or repo search
+
+If `reference/generated/*` looks missing or stale, **regenerate/update it before relying on it**.
+If you cannot verify, say so and point to the most likely file/module to inspect next.
+
+### Step 6 — Respond using the output contract
+Every answer must include:
+- **Steps** (numbered)
+- **Minimal runnable snippet** (copy/paste)
+- **Pitfalls & fixes** (“If X → do Y”)
+- **What to change** (3–8 knobs likely to matter)
+
+If the user is changing repo code, also include:
+- exact file paths to edit
+- tests to run (smallest relevant set)
+
+---
+
+## Repo anchors (use when needed)
+- Core library: `src/transformers/`
+- Tests: `tests/`
+- Docs source: `docs/source/` (commonly `docs/source/en/`)
+- Examples: `examples/`
+
+When asked “where is X implemented?”:
+- use `reference/generated/module_tree.md` first
+- then point to exact file paths under `src/transformers/`
+- include 1–3 search keywords the user can grep for
+
+---
+
+## Guardrails (non-negotiable)
+- Do not invent APIs/args/behavior. Verify if uncertain.
+- Do not propose large refactors when a small targeted change will do.
+- Behavior changes should come with a test (or an explicit reason why not).
+- Keep Transformers responsibilities separate from Hub/Datasets/Accelerate/PEFT unless the integration point is the blocker.
\ No newline at end of file
diff --git a/.claude/skills/transformers-api/reference/areas/export-serving.md b/.claude/skills/transformers-api/reference/areas/export-serving.md
new file mode 100644
index 000000000000..61c8d7135bcf
--- /dev/null
+++ b/.claude/skills/transformers-api/reference/areas/export-serving.md
@@ -0,0 +1,286 @@
+# Export & Serving (deployment, runtimes, CLIs)
+
+## Contents
+- [Scope](#scope)
+- [Minimum questions to ask](#minimum-questions-to-ask)
+- [Decision guide: serve in Python vs export a portable artifact](#decision-guide-serve-in-python-vs-export-a-portable-artifact)
+- [Quickstarts](#quickstarts)
+  - [1) Local OpenAI-compatible server (`transformers serve`)](#1-local-openai-compatible-server-transformers-serve)
+  - [2) Sanity-check the server (curl)](#2-sanity-check-the-server-curl)
+  - [3) Export to ONNX (Optimum CLI)](#3-export-to-onnx-optimum-cli)
+  - [4) Load + run an ONNX export (ORTModel)](#4-load--run-an-onnx-export-ortmodel)
+  - [5) Export to ExecuTorch (edge/mobile)](#5-export-to-executorch-edgemobile)
+  - [6) Export to TorchScript (PyTorch-only; limited)](#6-export-to-torchscript-pytorch-only-limited)
+- [Knobs that matter (3–8)](#knobs-that-matter-38)
+- [Pitfalls & fixes](#pitfalls--fixes)
+- [Verify / locate in repo](#verify--locate-in-repo)
+
+---
+
+## Scope
+
+Use this page when the user wants to **ship** a Transformers model:
+- serve it behind an HTTP API (local dev → deployment)
+- export it to another runtime (ONNX / ExecuTorch / TFLite via Optimum; TorchScript via PyTorch/Transformers)
+- choose the right “packaging” path given constraints (latency/throughput, hardware, Python vs non-Python)
+
+---
+
+## Minimum questions to ask
+
+Ask only what you need to pick a path (0–5 questions):
+1) **Workload**: encoder inference (cls/embeddings) vs **LLM generation** (chat/completions)?
+2) **Target runtime**: must run **outside Python**? must run on **mobile/edge**? OpenAI-compatible API required?
+3) **Hardware**: CPU / CUDA GPU / MPS / edge accelerator; memory limits
+4) **Model id/path + revision** (pin if you care about reproducibility)
+5) If blocked: exact error + smallest repro + versions (`transformers`, PyTorch, CUDA, Optimum/runtime)
+
+---
+
+## Decision guide: serve in Python vs export a portable artifact
+
+### Choose “Serve” when…
+- you want a fast integration path for an app
+- you can keep Python in the stack
+- you want an HTTP boundary (and potentially OpenAI-compatible endpoints)
+
+Typical choices:
+- **`transformers serve`**: quick local server; good for dev/moderate load
+- production LLM throughput: consider dedicated serving stacks (outside this repo) that specialize in continuous batching, KV cache, tensor parallel, etc.
+
+### Choose “Export” when…
+- you must run in a non-Python runtime
+- you need a portable artifact for inference engines / mobile / embedded
+
+Typical choices:
+- **ONNX** (via Optimum): broad runtime support
+- **ExecuTorch** (via Optimum): PyTorch-native edge/mobile packaging
+- **TorchScript**: PyTorch-only and can be brittle; best for simpler encoder models
+- **TFLite** (via Optimum TF exporters): TensorFlow Lite ecosystems (mobile/edge), often needs fixed shapes
+
+---
+
+## Quickstarts
+
+### 1. Local OpenAI-compatible server (`transformers serve`)
+
+Use this for local/dev integration tests. Always check the current flags in your environment:
+
+```bash
+transformers serve --help
+```
+
+Install serving dependencies:
+
+```bash
+pip install transformers[serving]
+```
+
+Then start the server:
+
+```bash
+transformers serve 
+```
+# Optional: force a single model for all requests (avoids per-request model hints)
+# transformers serve --force-model "Qwen/Qwen2.5-0.5B-Instruct"
+
+Notes:
+- Treat this as **developer-friendly** serving. For high-QPS production, you’ll usually reach for specialized serving runtimes.
+
+---
+
+### 2. Sanity-check the server (curl)
+
+Chat Completions request (OpenAI-compatible):
+
+```bash
+curl -X POST http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"messages":[{"role":"system","content":"hello"}],"temperature":0.9,"max_tokens":1000,"stream":true,"model":"Qwen/Qwen2.5-0.5B-Instruct"}'
+```
+
+The same server also supports the Responses API:
+
+```bash
+curl http://localhost:8000/v1/responses \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen2.5-0.5B-Instruct",
+    "stream": true,
+    "input": "Tell me a three sentence bedtime story about a unicorn."
+  }'
+```
+
+If requests fail:
+- run `transformers serve --help` and confirm host/port/model settings
+- confirm the client `model` string matches what the server expects
+
+---
+
+### 3. Export to ONNX (Optimum CLI)
+
+Install Optimum ONNX tooling:
+
+```bash
+pip install optimum-onnx
+```
+
+Export a model to ONNX:
+
+```bash
+optimum-cli export onnx \
+  --model distilbert/distilbert-base-uncased-distilled-squad \
+  distilbert_squad_onnx/
+```
+
+Notes:
+- If exporting from a local directory, ensure tokenizer/config live alongside weights.
+- If task inference is ambiguous, pass `--task` (e.g., `question-answering`, `text-classification`, `text-generation`).
+
+---
+
+### 4. Load + run an ONNX export (ORTModel)
+
+```python
+from transformers import AutoTokenizer
+from optimum.onnxruntime import ORTModelForQuestionAnswering
+
+onnx_dir = "distilbert_squad_onnx"
+
+tokenizer = AutoTokenizer.from_pretrained(onnx_dir)
+model = ORTModelForQuestionAnswering.from_pretrained(onnx_dir)
+
+inputs = tokenizer(
+    "What runtime is this?",
+    "This is ONNX Runtime via Optimum.",
+    return_tensors="pt",
+)
+outputs = model(**inputs)
+
+print(outputs.start_logits.shape, outputs.end_logits.shape)
+```
+
+Sanity validation tip:
+- compare logits on 3–10 fixed inputs between PyTorch and ONNX before shipping
+
+---
+
+### 5. Export to ExecuTorch (edge/mobile)
+
+This is a practical path when you want a PyTorch-native on-device artifact.
+
+Install ExecuTorch exporter dependencies:
+
+```bash
+git clone https://github.com/huggingface/optimum-executorch.git
+cd optimum-executorch
+pip install .
+```
+
+Export (CLI):
+
+```bash
+optimum-cli export executorch \
+  --model "HuggingFaceTB/SmolLM2-135M-Instruct" \
+  --task "text-generation" \
+  --recipe "xnnpack" \
+  --output_dir "smollm2_executorch"
+```
+
+Run (Python wrapper around the exported artifact):
+
+```python
+from transformers import AutoTokenizer
+from optimum.executorch import ExecuTorchModelForCausalLM
+
+tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")
+model = ExecuTorchModelForCausalLM.from_pretrained("smollm2_executorch/")
+
+prompt = "Explain KV cache in one sentence."
+print(model.text_generation(tokenizer=tokenizer, prompt=prompt, max_seq_len=64))
+```
+
+Validation tip:
+- run the same 3–10 prompts on the original model and the exported artifact; compare outputs at the token level where possible (or at least consistent decoding settings)
+
+---
+
+### 6. Export to TorchScript (PyTorch-only; limited)
+
+TorchScript is best for simpler, stable encoder-style graphs. Many Transformers models require enabling TorchScript mode so outputs are traceable.
+
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+
+model_id = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"
+
+tok = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForSequenceClassification.from_pretrained(
+    model_id,
+    torchscript=True,  # important for many models
+).eval()
+
+# Dummy inputs for tracing (trace will specialize to these shapes)
+ex = tok("hello world", return_tensors="pt")
+
+with torch.no_grad():
+    traced = torch.jit.trace(model, (ex["input_ids"], ex["attention_mask"]))
+
+traced.save("model_ts.pt")
+```
+
+Pitfalls:
+- `torchscript=True` is required for models with tied weights (typically models with a language-model head). Models without an LM head can be exported without it.
+- tracing is shape-sensitive; the trace generally only supports the same input shapes used during tracing (pad/choose a max expected shape).
+
+
+---
+
+## Knobs that matter (3–8)
+
+1) **Serve vs export**
+   - Need an API quickly → serve
+   - Need a portable artifact / non-Python runtime → export
+2) **Workload**
+   - LLM generation is sensitive to KV-cache + batching; encoder inference exports more easily
+3) **Repro pinning**
+   - pin model `revision` and record tool/runtime versions
+4) **Export “task”**
+   - pass `--task` when exporting local models or ambiguous checkpoints
+5) **Shapes**
+   - TorchScript and many mobile exports are sensitive to shapes; validate with representative inputs
+6) **Runtime choice**
+   - ONNX Runtime vs other accelerators; for edge/mobile consider ExecuTorch/TFLite
+7) **Correctness validation**
+   - always compare outputs on a small fixed suite before shipping
+8) **Performance validation**
+   - measure latency/throughput on the target hardware (not just dev machine)
+
+---
+
+## Pitfalls & fixes
+
+- **Server starts but requests fail**
+  - check `transformers serve --help` for port/model routing
+  - confirm endpoint path and request JSON match what your server expects
+- **ONNX export “works” but outputs differ**
+  - verify tokenizer parity (same files/config), and compare logits first
+  - ensure you didn’t accidentally change padding/truncation/max_length
+- **TorchScript breaks on real inputs**
+  - tracing used one example shape; real shapes differ → prefer ONNX or constrain shapes
+- **Edge export slow**
+  - ensure you chose an appropriate recipe/backend and validated quantization/perf settings for the device
+
+---
+
+## Verify / locate in repo
+
+Use Skill verification indexes when uncertain:
+- “Does this symbol/arg exist?” → `reference/generated/public_api.md`
+- “Where is it implemented?” → `reference/generated/module_tree.md`
+
+Useful repo grep keywords:
+- `transformers serve`, `openai`, `chat/completions`, `responses`
+- `export`, `onnx`, `executorch`, `torchscript`
+- `pipelines`, `generation`, `cache`, `continuous batching` (if serving overlaps with perf questions)
\ No newline at end of file
diff --git a/.claude/skills/transformers-api/reference/areas/generation.md b/.claude/skills/transformers-api/reference/areas/generation.md
new file mode 100644
index 000000000000..b5f4034f9aef
--- /dev/null
+++ b/.claude/skills/transformers-api/reference/areas/generation.md
@@ -0,0 +1,466 @@
+# Generation (decode, sampling, beams, stopping, streaming, chat)
+
+## Contents
+- [Scope](#scope)
+- [Minimum questions to ask](#minimum-questions-to-ask)
+- [Always-follow workflow](#always-follow-workflow)
+- [Quickstarts](#Quickstarts)
+  - [A. Decoder-only (CausalLM) minimal generation (greedy)](#a-decoder-only-causallm-minimal-generation-greedy)
+  - [B. Encoder-decoder (Seq2Seq) minimal generation (greedy)](#b-encoder-decoder-seq2seq-minimal-generation-greedy)
+- [Output length (do this first)](#output-length-do-this-first)
+- [Decoding strategies (choose one)](#decoding-strategies-choose-one)
+  - [1. Greedy (deterministic baseline)](#1-greedy-deterministic-baseline)
+  - [2. Sampling (creative / diverse)](#2-sampling-creative--diverse)
+  - [3. Beam search (more exhaustive, more deterministic)](#3-beam-search-more-exhaustive-more-deterministic)
+  - [4. Diverse candidates (multiple outputs)](#4-diverse-candidates-multiple-outputs)
+- [Chat prompting (chat templates)](#chat-prompting-chat-templates)
+  - [Chat template → generate (decoder-only)](#chat-template--generate-decoder-only)
+- [“Decoder-only returns the prompt too” (slice it)](#decoder-only-returns-the-prompt-too-slice-it)
+- [Stopping](#stopping)
+  - [1. EOS-based stopping (default)](#1-eos-based-stopping-default)
+  - [2. Stop on custom condition (StoppingCriteria)](#2-stop-on-custom-condition-stoppingcriteria)
+  - [3. Stop on strings (built-in: stop_strings)](#3-stop-on-strings-built-in-stop_strings)
+- [Streaming](#streaming)
+  - [TextIteratorStreamer (common pattern with a background thread)](#textiteratorstreamer-common-pattern-with-a-background-thread)
+- [Inspecting generation internals (scores, beams, etc.)](#inspecting-generation-internals-scores-beams-etc)
+- [What to change (knobs that matter most)](#what-to-change-knobs-that-matter-most)
+- [Pitfalls & fixes (high-frequency)](#pitfalls--fixes-high-frequency)
+- [Repo hotspots (when asked “where is this implemented?”)](#repo-hotspots-when-asked-where-is-this-implemented)
+- [Verification checklist (anti-hallucination)](#verification-checklist-anti-hallucination)
+
+
+## Scope
+
+Use this page when the user’s goal is **text generation / chat behavior**:
+- `.generate()` decoding strategy (greedy / sampling / beams)
+- output length control (`max_new_tokens`, `min_new_tokens`, etc.)
+- repetition control (`repetition_penalty`, `no_repeat_ngram_size`)
+- stopping (EOS, custom stopping criteria)
+- streaming (streamers)
+- chat templates + generation together
+
+---
+
+## Minimum questions to ask
+
+Ask only what’s required to produce a runnable snippet:
+1) Model type: **decoder-only** (CausalLM) vs **encoder-decoder** (Seq2Seq)
+2) Desired behavior: **deterministic** vs **creative**
+3) Output constraints: length, stop condition, format (JSON, bullets, etc.)
+4) Environment: `transformers` version + backend/device (CPU/CUDA/MPS)
+5) If blocked: full traceback + minimal repro
+
+---
+
+## Always-follow workflow
+
+1) Load model + tokenizer from the same checkpoint.
+2) Prepare prompt (raw text or chat template).
+3) Put tensors + model on the same device.
+4) Choose a decoding strategy (greedy / sampling / beam) and set length via `max_new_tokens`.
+5) Generate under `torch.inference_mode()` (PyTorch).
+6) Decode, and (for decoder-only models) optionally slice off the prompt tokens.
+
+---
+
+## Quickstarts
+
+### A. Decoder-only (CausalLM) minimal generation (greedy)
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model_id = "distilbert/distilgpt2"
+tok = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id)
+model.eval()
+
+prompt = "Write a one-sentence summary of Transformers:"
+inputs = tok(prompt, return_tensors="pt").to(model.device)
+
+with torch.inference_mode():
+    out = model.generate(**inputs, max_new_tokens=50)
+
+text = tok.decode(out[0], skip_special_tokens=True)
+print(text)
+```
+
+### B. Encoder-decoder (Seq2Seq) minimal generation (greedy)
+```python
+import torch
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
+
+model_id = "google/flan-t5-small"
+tok = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
+model.eval()
+
+prompt = "Translate to German: The cat is on the table."
+inputs = tok(prompt, return_tensors="pt").to(model.device)
+
+with torch.inference_mode():
+    out = model.generate(**inputs, max_new_tokens=50)
+
+print(tok.decode(out[0], skip_special_tokens=True))
+```
+
+---
+
+## Output length (do this first)
+
+Prefer `max_new_tokens` over `max_length`.
+
+- `max_new_tokens`: number of tokens **generated beyond** the prompt (recommended)
+- `max_length`: prompt length + generated length (often confusing)
+
+Also consider:
+- `min_new_tokens` (or `min_length` depending on model/version)
+- `early_stopping` (beam search behavior)
+
+---
+
+## Decoding strategies (choose one)
+
+### 1. Greedy (deterministic baseline)
+Good for short, factual, structured outputs. Can repeat for long outputs.
+```python
+out = model.generate(**inputs, max_new_tokens=200, do_sample=False)
+```
+
+### 2. Sampling (creative / diverse)
+Use when you want variation. Typical defaults:
+- `do_sample=True`
+- `temperature` ~ 0.7–1.0
+- `top_p` ~ 0.9–0.95 (nucleus)
+- optionally `top_k` ~ 40–100
+
+```python
+out = model.generate(
+    **inputs,
+    max_new_tokens=200,
+    do_sample=True,
+    temperature=0.8,
+    top_p=0.95,
+    top_k=50,
+)
+```
+
+### 3. Beam search (more exhaustive, more deterministic)
+Useful for translation/summarization; can become repetitive for open-ended chat.
+
+```python
+out = model.generate(
+    **inputs,
+    max_new_tokens=200,
+    num_beams=4,
+    do_sample=False,
+    early_stopping=True,
+)
+```
+
+### 4. Diverse candidates (multiple outputs)
+```python
+out = model.generate(
+    **inputs,
+    max_new_tokens=120,
+    do_sample=True,
+    temperature=0.9,
+    top_p=0.95,
+    num_return_sequences=3,
+)
+texts = tok.batch_decode(out, skip_special_tokens=True)
+for i, t in enumerate(texts, 1):
+    print(f"\n--- candidate {i} ---\n{t}")
+```
+
+---
+
+## Chat prompting (chat templates)
+
+If the model expects chat formatting, use `apply_chat_template` (tokenizer or processor).
+If you’re unsure whether the model is “chat/instruct”, check its docs/model card or your `reference/generated/*`.
+
+### Chat template → generate (decoder-only)
+
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model_id = "meta-llama/Llama-3.1-8B-Instruct"
+tok = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id)
+model.eval()
+
+messages = [
+    {"role": "system", "content": "You are concise."},
+    {"role": "user", "content": "Explain beam search in one paragraph."},
+]
+
+prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to(model.device)
+
+with torch.inference_mode():
+    out = model.generate(**inputs, max_new_tokens=180, do_sample=False)
+
+print(tok.decode(out[0], skip_special_tokens=True))
+```
+
+---
+
+## “Decoder-only returns the prompt too” (slice it)
+
+For decoder-only LMs, `generate()` returns `[prompt + completion]`.
+If you only want the completion tokens:
+
+```python
+with torch.inference_mode():
+    out = model.generate(**inputs, max_new_tokens=80)
+
+prompt_len = inputs["input_ids"].shape[-1]
+completion_ids = out[0, prompt_len:]
+completion_text = tok.decode(completion_ids, skip_special_tokens=True)
+print(completion_text)
+```
+
+(For encoder-decoder models, the generated sequence is usually just the decoder output.)
+
+---
+
+## Stopping
+
+### 1. EOS-based stopping (default)
+Most models stop when `eos_token_id` is produced (or hit length limits).
+If you see “never stops” behavior, verify:
+- `eos_token_id` exists and is correct
+- you didn’t set an incompatible `min_length` / `min_new_tokens`
+
+### 2. Stop on custom condition (StoppingCriteria)
+Use this when you need “stop when a phrase appears” or other custom termination.
+
+```python
+import torch
+from transformers import StoppingCriteria, StoppingCriteriaList
+
+class StopOnTokenSequence(StoppingCriteria):
+    def __init__(self, stop_ids: list[int]):
+        self.stop_ids = torch.tensor(stop_ids, dtype=torch.long)
+
+    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs):
+        # Return shape: (batch_size, 1) — True means “stop for that row”
+        stop_ids = self.stop_ids.to(input_ids.device)
+        bsz, seqlen = input_ids.shape
+        n = stop_ids.numel()
+
+        if seqlen < n:
+            return torch.zeros((bsz, 1), dtype=torch.bool, device=input_ids.device)
+
+        tail = input_ids[:, -n:]  # (bsz, n)
+        matched = (tail == stop_ids).all(dim=1, keepdim=True)  # (bsz, 1)
+        return matched
+
+
+stop_text = "\n###"
+stop_ids = tok(stop_text, add_special_tokens=False)["input_ids"]
+criteria = StoppingCriteriaList([StopOnTokenSequence(stop_ids)])
+
+with torch.inference_mode():
+    out = model.generate(
+        **inputs,
+        max_new_tokens=300,
+        do_sample=True,
+        temperature=0.8,
+        top_p=0.95,
+        stopping_criteria=criteria,
+    )
+
+print(tok.decode(out[0], skip_special_tokens=True))
+```
+
+Notes:
+- In batched generation, stopping criteria can return a `(batch_size, 1)` mask (per-sample). 
+- However, generation often keeps tensor shapes fixed (e.g., padding finished rows), so you may not get compute savings unless you re-batch unfinished samples. 
+
+
+### 3. Stop on strings (built-in: stop_strings)
+
+If you want to stop when the model outputs a specific string, you can use `stop_strings`:
+
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model_id = "distilbert/distilgpt2"
+tok = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id)
+model.eval()
+
+prompt = "Write a short answer, then end with a line containing ###:\n"
+inputs = tok(prompt, return_tensors="pt").to(model.device)
+
+with torch.inference_mode():
+    out = model.generate(
+        **inputs,
+        max_new_tokens=300,
+        do_sample=True,
+        temperature=0.8,
+        top_p=0.95,
+        stop_strings=["\n###"],
+        tokenizer=tok,  # required so stop_strings can be matched against decoded text
+    )
+
+text = tok.decode(out[0], skip_special_tokens=True)
+print(text)
+```
+
+Notes:
+- `stop_strings` stops generation *after* the stop string is produced.
+- Pass `tokenizer=tok` so Transformers can detect the stop string correctly during generation.
+- If you need the returned text *without* the stop string, trim it after decoding (e.g., `text.split("\n###")[0]`).
+
+---
+
+## Streaming
+
+Use streamers when you want token-by-token (or chunk-by-chunk) output.
+
+### TextIteratorStreamer (common pattern with a background thread)
+```python
+import torch
+from threading import Thread
+from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
+
+model_id = "distilbert/distilgpt2"
+tok = AutoTokenizer.from_pretrained(model_id)
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
+model.eval()
+
+prompt = "Tell me a short story about a robot:"
+inputs = tok(prompt, return_tensors="pt").to(model.device)
+streamer = TextIteratorStreamer(tok, skip_special_tokens=True, skip_prompt=True)
+generation_kwargs = dict(
+    **inputs,
+    max_new_tokens=120,
+    do_sample=True,
+    temperature=0.9,
+    top_p=0.95,
+    streamer=streamer,
+)
+
+thread = Thread(target=model.generate, kwargs=generation_kwargs)
+thread.start()
+
+for text_chunk in streamer:
+    print(text_chunk, end="", flush=True)
+
+thread.join()
+print()
+```
+Pitfall: Some pipelines/deepcopies can conflict with streamer objects; if you hit errors, call `model.generate`
+directly (like above) rather than wrapping in a pipeline.
+---
+
+## Inspecting generation internals (scores, beams, etc.)
+
+If you need token-level probabilities, request structured outputs from `generate()`:
+
+```python
+with torch.inference_mode():
+    out = model.generate(
+        **inputs,
+        max_new_tokens=50,
+        do_sample=False,
+        return_dict_in_generate=True,
+        output_scores=True,
+    )
+
+# out.sequences: token ids
+# out.scores: tuple of per-step logits (one tensor per generated step)
+print(type(out))
+print(out.sequences.shape, len(out.scores))
+```
+---
+
+## What to change (knobs that matter most)
+
+Length / termination:
+- `max_new_tokens` (primary)
+- `min_new_tokens` / `min_length`
+- `eos_token_id`, `pad_token_id`
+- `stopping_criteria`
+
+Creativity / diversity:
+- `do_sample`
+- `temperature`
+- `top_p`, `top_k`
+- `typical_p` (if supported by your version/model)
+
+Determinism / search:
+- `num_beams`
+- `early_stopping`
+- `length_penalty`
+
+Repetition control:
+- `repetition_penalty`
+- `no_repeat_ngram_size`
+- `encoder_no_repeat_ngram_size` (encoder-decoder)
+
+Multiple outputs:
+- `num_return_sequences` (sampling or beams + sampling variants)
+
+---
+
+## Pitfalls & fixes (high-frequency)
+
+### “It ignores temperature/top_p”
+Sampling knobs only apply when `do_sample=True`.
+Fix: set `do_sample=True` (and typically keep `num_beams=1` for pure sampling).
+
+### “It stops too early / too late”
+- Prefer `max_new_tokens` for length.
+- Verify `eos_token_id` and that you didn’t set `min_new_tokens` too high.
+
+### “Beam search is repetitive”
+Try:
+- smaller `num_beams` (e.g., 2–4)
+- `repetition_penalty` or `no_repeat_ngram_size`
+- or switch to sampling with moderate temperature/top_p.
+
+### “Decoder-only output contains prompt”
+Slice using `prompt_len` (see above).
+
+### “Batched generation breaks on padding”
+For decoder-only:
+- ensure a pad token exists (`tok.pad_token = tok.eos_token` is common)
+- consider `tok.padding_side = "left"` for batched generation
+
+### “OOM during generation”
+Route to `performance.md` for:
+- `device_map="auto"`, dtype reduction, quantization
+- smaller `max_new_tokens`, smaller batch size
+- attention backend / KV cache strategies
+
+---
+
+## Repo hotspots (when asked “where is this implemented?”)
+
+Generation configuration + defaults:
+- `src/transformers/generation/configuration_utils.py`
+Streaming:
+- `src/transformers/generation/streamers.py`
+Logits processors / warpers (repetition penalty, top-k/top-p, etc.):
+- `src/transformers/generation/logits_process.py`
+Pipelines wrapping generation:
+- `src/transformers/pipelines/text_generation.py`
+Core generate logic commonly lives under:
+- `src/transformers/generation/` (search for `GenerationMixin` and `generate`)
+
+---
+
+## Verification checklist (anti-hallucination)
+
+When uncertain, verify in this order:
+1) `reference/generated/public_api.md` (does the symbol/kwarg exist in this version?)
+2) `reference/generated/module_tree.md` (where is it implemented?)
+3) `reference/generated/docs_map.md` (where is it documented?)
+4) Then inspect `src/transformers/generation/...` and grep the exact name (e.g., `stop_strings`, `typical_p`, `TextIteratorStreamer`).
\ No newline at end of file
diff --git a/.claude/skills/transformers-api/reference/areas/inference.md b/.claude/skills/transformers-api/reference/areas/inference.md
new file mode 100644
index 000000000000..bc8d19163fbb
--- /dev/null
+++ b/.claude/skills/transformers-api/reference/areas/inference.md
@@ -0,0 +1,286 @@
+# Inference (pipelines + Auto* inference)
+
+## Contents
+- [Scope](#scope)
+- [Minimum questions to ask](#minimum-questions-to-ask)
+- [Decision guide: `pipeline()` vs manual Auto*](#decision-guide-pipeline-vs-manual-auto)
+- [Quickstarts](#quickstarts)
+  - [1) Pipeline: text classification (single + batch)](#1-pipeline-text-classification-single--batch)
+  - [2) Pipeline: iterate a Dataset efficiently (KeyDataset)](#2-pipeline-iterate-a-dataset-efficiently-keydataset)
+  - [3) Pipeline: generator input (num_workers caveat)](#3-pipeline-generator-input-num_workers-caveat)
+  - [4) Pipeline: image classification (non-text example)](#4-pipeline-image-classification-non-text-example)
+  - [5) Manual Auto*: classification logits (most control)](#5-manual-auto-classification-logits-most-control)
+  - [6) Manual Auto*: embeddings (mean pool)](#6-manual-auto-embeddings-mean-pool)
+- [Knobs that matter (3–8)](#knobs-that-matter-38)
+- [Pitfalls & fixes](#pitfalls--fixes)
+- [Chunk batching (QA / zero-shot) and why it matters](#chunk-batching-qa--zero-shot-and-why-it-matters)
+- [Verify / locate in repo](#verify--locate-in-repo)
+
+---
+
+## Scope
+
+Use this page when the user wants to **run a model for inference** (predict/classify/score/encode) in `transformers`.
+
+---
+
+## Minimum questions to ask
+
+Ask only what you need to produce a runnable snippet (0–5 questions):
+1) **Task** (e.g., `text-classification`, `question-answering`, `automatic-speech-recognition`, `image-classification`, `feature-extraction`)
+2) **Model id or local path** (and `revision` if pinned)
+3) **Backend + device** (PyTorch/TF/JAX; CPU/CUDA/MPS; rough VRAM if relevant)
+4) **Input modality** (text/image/audio) if unclear
+5) If blocked: **full traceback + exact versions** + smallest repro
+
+---
+
+## Decision guide: `pipeline()` vs manual Auto*
+
+### Prefer `pipeline()` when…
+- You want the fastest path to correct inference with task-specific preprocessing/postprocessing
+- You want easy batching or dataset iteration
+- You’re okay with outputs formatted by the task pipeline
+
+### Prefer manual Auto* when…
+- You need direct control over tensors/logits/hidden states and custom pooling/postprocessing
+- You need to debug shapes/dtypes/devices precisely
+- You’re integrating into an existing service/loop and want strict control
+
+---
+
+## Quickstarts
+
+### 1. Pipeline: text classification (single + batch)
+
+```python
+from transformers import pipeline
+
+pipe = pipeline(
+    task="text-classification",
+    model="distilbert/distilbert-base-uncased-finetuned-sst-2-english",
+    device=0,       # GPU ordinal; use -1 for CPU
+    dtype="auto",   # can also be torch.float16 / "float16" for PyTorch models
+)
+
+print(pipe("This restaurant is awesome"))
+print(pipe(["Great!", "Terrible..."], batch_size=8))
+```
+
+Notes:
+- For large models, prefer `device_map="auto"` over a single `device` (sharding/offload).
+- If you must set `trust_remote_code=True`, pin `revision=` and treat it like running third-party code.
+
+---
+
+### 2. Pipeline: iterate a Dataset efficiently (KeyDataset)
+
+Recommended for large datasets: iterate the dataset directly to avoid loading everything into memory and to avoid writing your own batching loops.
+
+```python
+import datasets
+from tqdm.auto import tqdm
+from transformers import pipeline
+from transformers.pipelines.pt_utils import KeyDataset
+
+pipe = pipeline(
+    "text-classification",
+    model="distilbert/distilbert-base-uncased-finetuned-sst-2-english",
+    device=0,
+)
+
+ds = datasets.load_dataset("imdb", split="test[:200]")
+
+# Some texts tokenize longer than the model’s max sequence length (e.g., 512), causing a size-mismatch error; truncation (and padding for batching) fixes it by enforcing a consistent max length.
+for out in tqdm(pipe(KeyDataset(ds, "text"), batch_size=16, truncation=True, max_length=512, padding=True)):
+    pass
+```
+
+---
+
+### 3. Pipeline: generator input (num_workers caveat)
+
+A generator/iterator is convenient for streaming inputs (queues/HTTP/DB), but note the caveat: with iterative generators you cannot use `num_workers > 1` for multi-process preprocessing.
+
+```python
+from transformers import pipeline
+
+pipe = pipeline(
+    "text-classification",
+    model="distilbert/distilbert-base-uncased-finetuned-sst-2-english",
+    device=0,
+)
+
+def data():
+    for i in range(100):
+        yield f"My example {i}"
+
+# Caveat: because this is iterative, you cannot use num_workers > 1 to preprocess in parallel.
+for out in pipe(data(), batch_size=8):
+    pass
+```
+
+---
+
+### 4. Pipeline: image classification (non-text example)
+
+Pipelines support computer vision tasks. Inputs may be:
+- an HTTP(S) URL string
+- a local file path string
+- a PIL image object
+
+If you pass a *batch* of images, they must all be in the same format (all URLs, all paths, or all PIL images).
+
+```python
+from transformers import pipeline
+
+# Vision pipelines require Pillow (PIL). If you get: "This image processor cannot be instantiated... install Pillow",
+# run:  pip install -U pillow   (or: conda install -c conda-forge pillow) and restart your notebook/kernel.
+clf = pipeline(
+    "image-classification",
+    model="google/vit-base-patch16-224",
+    device=0,
+    dtype="auto",
+)
+
+img_url = "https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png"
+print(clf(img_url))
+```
+
+---
+
+### 5. Manual Auto*: classification logits (most control)
+
+Use this as the baseline when debugging correctness or needing raw logits.
+
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+
+model_id = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"
+
+tok = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForSequenceClassification.from_pretrained(model_id)
+
+device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+model.to(device).eval()
+
+texts = ["I love this.", "I hate this."]
+batch = tok(texts, return_tensors="pt", padding=True, truncation=True)
+batch = {k: v.to(device) for k, v in batch.items()}
+
+with torch.inference_mode():
+    logits = model(**batch).logits
+    probs = logits.softmax(dim=-1)
+
+print("probs:", probs)
+print("pred:", probs.argmax(dim=-1))
+```
+
+---
+
+### 6. Manual Auto*: embeddings (mean pool)
+
+Use when the user wants embeddings/features (not generation).
+
+```python
+import torch
+from transformers import AutoTokenizer, AutoModel
+
+model_id = "distilbert-base-uncased"
+
+tok = AutoTokenizer.from_pretrained(model_id)
+model = AutoModel.from_pretrained(model_id)
+
+device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
+model.to(device).eval()
+
+texts = ["hello world", "another sentence"]
+batch = tok(texts, return_tensors="pt", padding=True, truncation=True)
+batch = {k: v.to(device) for k, v in batch.items()}
+
+with torch.inference_mode():
+    out = model(**batch)  # out.last_hidden_state: (B, T, H)
+    mask = batch["attention_mask"].unsqueeze(-1).type_as(out.last_hidden_state)  # (B, T, 1)
+    summed = (out.last_hidden_state * mask).sum(dim=1)  # (B, H)
+    counts = mask.sum(dim=1).clamp(min=1)               # (B, 1)
+    emb = summed / counts                               # (B, H)
+
+print("embeddings shape:", emb.shape)
+```
+
+If they need “sentence embeddings” in production:
+- confirm pooling + normalization strategy
+- validate with a small retrieval sanity check (nearest neighbors look sensible)
+
+---
+
+## Knobs that matter (3–8)
+
+Prioritize these knobs before anything else:
+
+1) **Task ↔ checkpoint compatibility**
+   - Pipeline: correct `task` (or model with an embedded task)
+   - Manual: correct `AutoModelFor*` class
+2) **`model` + `revision`** (pin for reproducibility)
+3) **Placement:** `device` vs `device_map`
+4) **Precision:** `dtype` (pipeline) / `torch_dtype` (many manual loading paths)
+5) **Batching:** list inputs + `batch_size` (avoid per-example loops)
+6) **Tokenization:** `padding`, `truncation`, `max_length`
+7) **Overrides:** `tokenizer`, `feature_extractor`, `image_processor`, `processor` (when default loading is wrong)
+8) **Security/repro:** `trust_remote_code` (only if trusted) + pinned `revision`
+
+Useful to know: the documented `pipeline()` constructor includes (among others)
+`task`, `model`, `config`, `tokenizer`, `feature_extractor`, `image_processor`, `processor`, `revision`, `use_fast`, `token`,
+`device`, `device_map`, `dtype='auto'`, `trust_remote_code`, and `model_kwargs`.
+
+---
+
+## Pitfalls & fixes
+
+- **It’s slow**
+  - You’re processing one-by-one → pass a **list** / **dataset iterator** and use `batch_size` (avoid per-example loops)
+  - Batching isn’t always faster → **measure** on your hardware/model; batching is often most helpful on GPU
+  - You’re on CPU → consider moving to GPU; (rule of thumb: batching on CPU often doesn’t help much)
+  - Inputs are huge → set `truncation=True` and tune `max_length` (shorter `max_length` is usually faster/cheaper)
+
+- **Wrong head / mismatched task**
+  - Pipeline: ensure `task` matches the checkpoint’s intent (e.g., `"text-classification"` vs `"token-classification"`)
+  - Manual: choose the correct `AutoModelFor*` (e.g., `AutoModelForSequenceClassification`, `AutoModelForTokenClassification`)
+
+- **Device / dtype issues**
+  - Manual: move **both** the model and **all** input tensors to the same device
+  - Inference best-practice: `model.eval()` (often already true after `from_pretrained`) + `torch.inference_mode()`
+  - Pipeline placement: use **either** `device` **or** `device_map` (don’t set both)
+
+- **Batching causes OOM**
+  - Reduce `batch_size`; consider smaller `max_length`; handle OOM gracefully (retry with a smaller batch)
+  - If lengths vary a lot, consider bucketing by length or using smaller `max_length` to stabilize memory
+  - For large models, consider `device_map="auto"` (sharding/offload) and lower precision (`dtype="float16"` / `torch.float16` where supported; PyTorch backend)
+
+---
+
+## Chunk batching (QA / zero-shot) and why it matters
+
+Some tasks (notably `question-answering` and `zero-shot-classification`) may require **multiple forward passes per “one” user input**.
+Transformers handles this via a `ChunkPipeline` implementation so you can tune `batch_size` without manually accounting for how many
+forward passes a single input triggers.
+
+Practical implications:
+- If a user reports “batch_size doesn’t behave as expected” for QA/zero-shot, check whether chunking is the cause.
+- Don’t assume “1 input = 1 forward pass” for these pipelines.
+
+---
+
+## Verify / locate in repo
+
+Common repo hotspots:
+- Pipelines:
+  - `src/transformers/pipelines/__init__.py` (factory/registry)
+  - `src/transformers/pipelines/base.py` (base `Pipeline` / batching machinery)
+  - `src/transformers/pipelines/*.py` (task implementations)
+- Auto factories:
+  - `src/transformers/models/auto/` (AutoModel/AutoConfig/AutoTokenizer mappings)
+- Core loading utilities:
+  - `src/transformers/modeling_utils.py`
+  - `src/transformers/configuration_utils.py`
\ No newline at end of file
diff --git a/.claude/skills/transformers-api/reference/areas/performance.md b/.claude/skills/transformers-api/reference/areas/performance.md
new file mode 100644
index 000000000000..0952842a97fc
--- /dev/null
+++ b/.claude/skills/transformers-api/reference/areas/performance.md
@@ -0,0 +1,371 @@
+# Performance (memory + speed + quantization)
+
+## Contents
+- [Scope](#scope)
+- [Minimum questions to ask](#minimum-questions-to-ask)
+- [Triage ladder (do these first)](#triage-ladder-do-these-first)
+- [Quickstarts](#quickstarts)
+  - [1) Baseline: correct device placement + mixed precision](#1-baseline-correct-device-placement--mixed-precision)
+  - [2) Faster attention: set `attn_implementation`](#2-faster-attention-set-attn_implementation)
+  - [3) `torch.compile`: compile once, run faster](#3-torchcompile-compile-once-run-faster)
+  - [4) bitsandbytes 8-bit / 4-bit: `BitsAndBytesConfig`](#4-bitsandbytes-8-bit--4-bit-bitsandbytesconfig)
+  - [5) GPTQ: post-training int4 with `gptqmodel` + `GPTQConfig`](#5-gptq-post-training-int4-with-gptqmodel--gptqconfig)
+  - [6) Continuous batching for serving: `generate_batch()` / `transformers serve`](#6-continuous-batching-for-serving-generate_batch--transformers-serve)
+- [Knobs that matter (3–8)](#knobs-that-matter-38)
+- [Pitfalls & fixes](#pitfalls--fixes)
+- [Verify / locate in repo](#verify--locate-in-repo)
+
+---
+
+## Scope
+
+Use this page when the user’s goal is **performance** in `transformers`:
+- Reduce **VRAM/RAM** (fit the model)
+- Increase **throughput** (tokens/sec, examples/sec)
+- Reduce **latency** (time-to-first-token, p95)
+- Use **quantization**, **compiled execution**, **optimized attention/kernels**, **parallelism**, or **continuous batching**
+
+---
+
+## Minimum questions to ask
+
+Ask only what you need to recommend the right optimization (0–5 questions):
+1) **Workload**: inference vs training? generation vs encoder-only?
+2) **Target**: memory bound vs compute bound? (OOM? too slow? p95 latency? throughput?)
+3) **Hardware**: CPU vs GPU (which GPU?) vs multi-GPU?
+4) **Model + dtype constraints**: model id/path + `transformers` version + backend (PyTorch/TF/JAX)
+5) If blocked: exact **OOM/traceback**, plus a minimal runnable snippet
+
+---
+
+## Triage ladder (do these first)
+
+This ordering avoids “cool tricks” before basics:
+
+1) **Stop accidental slow paths**
+   - Batch your requests; avoid per-item loops.
+   - Ensure the model and inputs are on the same device.
+2) **Right-size precision**
+   - Mixed precision (`float16` / `bfloat16`) usually yields large speed/memory wins on GPUs.
+3) **Use an optimized attention backend**
+   - Swap `attn_implementation` before changing architectures.
+4) **Compile**
+   - `torch.compile` can reduce Python overhead and fuse kernels.
+5) **Quantize**
+   - 8-bit / 4-bit (bitsandbytes) or GPTQ can be the difference between “fits” and “doesn’t”.
+6) **Scale/serve**
+   - Continuous batching and parallelism matter most when serving many concurrent requests.
+
+---
+
+## Quickstarts
+
+### 1. Baseline: correct device placement + mixed precision
+
+Use this when the user says “it’s slow” or “it OOMs” and you need a sane baseline.
+
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+
+model_id = "google/gemma-2b"  # example
+
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+# Mixed precision + automatic device placement (single GPU or multi-GPU sharding/offload)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map="auto",
+    dtype=torch.bfloat16,  # or torch.float16
+).eval()
+
+
+# Put inputs on the model's device
+inputs = tokenizer("Hello!", return_tensors="pt").to(model.device)
+
+with torch.inference_mode():
+    out = model.generate(**inputs, max_new_tokens=32)
+
+print(tokenizer.decode(out[0], skip_special_tokens=True))
+```
+
+The `dtype` argument controls the instantiated weight dtype.
+- Use `dtype="auto"` to load the checkpoint’s intended dtype.
+- Or force `dtype=torch.float16` / `dtype=torch.bfloat16` for mixed precision (GPU permitting).
+
+
+---
+
+### 2. Faster attention: set `attn_implementation`
+
+Transformers exposes multiple attention backends through a single knob: `attn_implementation`.
+Supported values in the attention-backends interface include (among others):  
+"flash_attention_3", "flash_attention_2", "flex_attention", "sdpa" (and "eager"), plus paged variants like "paged|flash_attention_3" / "paged|flash_attention_2" / "paged|sdpa" / "paged|eager".
+
+
+```python
+from transformers import AutoModelForCausalLM
+
+model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-3.2-1B",
+    attn_implementation="flash_attention_2",
+)
+```
+
+You can also switch implementations at runtime without reloading:
+
+```python
+model.set_attn_implementation("sdpa")
+```
+
+If you don’t want to install a FlashAttention package (CUDA/PyTorch version mismatch pain), you can load a compiled kernel from the Hub via the Kernels integration:
+
+```python
+from transformers import AutoModelForCausalLM
+
+model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-3.2-1B",
+    attn_implementation="kernels-community/flash-attn2",
+)
+```
+
+**Gotchas (read before benchmarking):**
+- **Backend availability depends on model + PyTorch/CUDA + dtype.**  
+  For example, FlashAttention2 requires CUDA and typically `float16` or `bfloat16`; it will silently fall back or error if the dtype or build is incompatible.
+- **FlashAttention2 does not support attention over padded tokens.**  
+  In batched generation with padding, this can reduce performance unless you avoid padding, unpad inputs, or use an alternative backend (e.g. SDPA).
+- **Some attention params force a fallback to eager.**
+  For example, `output_attentions=True` is unsupported in some optimized attention paths and triggers a fallback warning.
+
+---
+
+### 3. `torch.compile`: static cache + compile `forward` (generation)
+
+For generation workloads, Transformers recommends enabling StaticCache via `cache_implementation="static"`. This also turns on automatic compilation of the decoding stage for greedy and sampling decode. You can control this via `compile_config` (or disable it with `disable_compile`) and still need stable shapes to avoid recompilation.
+
+
+```python
+import os
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
+
+model_id = "google/gemma-2b"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map="auto",
+    dtype="auto",  # or torch.float16 / torch.bfloat16
+).eval()
+
+# Compile the forward pass; generate() calls model.forward internally
+model.forward = torch.compile(
+    model.forward,
+    mode="reduce-overhead",
+    fullgraph=True,
+)
+
+# Keep shapes stable to avoid recompilation
+inputs = tokenizer(
+    "Hello!",
+    return_tensors="pt",
+    pad_to_multiple_of=8,
+).to(model.device)
+
+with torch.inference_mode():
+    outputs = model.generate(**inputs, max_new_tokens=32, cache_implementation="static")
+
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+
+Notes:
+- The **first call is slower** due to compilation; benchmark after warmup.
+- Keep batch size, prompt length, and `max_new_tokens` stable to avoid recompilation.
+- If `fullgraph=True` fails due to graph breaks, retry with `fullgraph=False`.
+
+---
+
+### 4. bitsandbytes 8-bit / 4-bit: `BitsAndBytesConfig`
+
+This is the fastest “make it fit” move for many LLMs. Install deps first:
+
+```bash
+pip install --upgrade transformers accelerate bitsandbytes
+```
+
+**8-bit example (generation path):**
+
+```python
+from transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM
+
+quantization_config = BitsAndBytesConfig(load_in_8bit=True)
+
+tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
+model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-3.1-8B",
+    device_map="auto",
+    quantization_config=quantization_config,
+)
+
+inputs = tokenizer("Hello, my llama is cute", return_tensors="pt").to(model.device)
+generated_ids = model.generate(**inputs)
+print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])
+```
+**Preferred API:**  
+Use `quantization_config=BitsAndBytesConfig(...)` when loading models.  
+Avoid passing `load_in_8bit` or `load_in_4bit` directly to `from_pretrained()`; these flags exist for compatibility but are not the recommended interface.
+
+Notes:
+- The GPU performance guide explicitly recommends **using `generate()` rather than the Pipeline API** for **8-bit text generation**, because Pipeline is not optimized for 8-bit models and some sampling strategies may not be supported there.
+- For multi-GPU/distributed, you can pass `max_memory={...}` to control per-device allocation when using `device_map="auto"`.
+
+---
+
+### 5. GPTQ: post-training int4 with `gptqmodel` + `GPTQConfig`
+
+Transformers’ GPTQ doc states:
+- GPTQ is supported via the **`gptqmodel`** package.
+- Transformers supports GPTQ via GPTQModel and still documents AutoGPTQ, but AutoGPTQ is likely to be deprecated; prefer GPTQModel going forward.
+
+Install:
+
+```bash
+pip install --upgrade accelerate optimum transformers
+pip install gptqmodel --no-build-isolation
+```
+
+Quantize (example pattern):
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
+
+tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m")
+gptq_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer)
+
+quantized_model = AutoModelForCausalLM.from_pretrained(
+    "facebook/opt-125m",
+    device_map="auto",
+    quantization_config=gptq_config,
+)
+```
+
+If you hit memory pressure during quantization, GPTQ docs recommend using `max_memory={...}` (disk offloading is not supported for the dataset).
+
+---
+
+### 6. Continuous batching for serving: `generate_batch()` / `transformers serve`
+
+Continuous batching increases throughput and reduces latency by dynamically re-forming the batch each step (removing finished requests and adding new ones) to avoid GPU idling. It works with `transformers serve` and `generate_batch()`.
+
+- **PagedAttention is automatically enabled under continuous batching.**  
+  You can also explicitly select a paged backend via `attn_implementation="paged|..."` if needed.
+
+
+Minimal `generate_batch()` shape (tokenized inputs list + `GenerationConfig`):
+
+```python
+import datasets
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from transformers.generation import GenerationConfig
+
+model = AutoModelForCausalLM.from_pretrained(
+    "Qwen/Qwen3-4B-Instruct-2507",
+    attn_implementation="sdpa_paged",
+    device_map="cuda",
+    dtype=torch.bfloat16,
+)
+tokenizer = AutoTokenizer.from_pretrained(
+    "Qwen/Qwen3-4B-Instruct-2507",
+    padding_side="left",
+)
+
+dataset = datasets.load_dataset("openai/gsm8k", "socratic", split="test").select(range(8))
+tokenized = dataset.map(lambda x: tokenizer(x["question"]), batched=True)
+
+simple_batch_inputs = [item["input_ids"] for item in tokenized]
+
+generation_config = GenerationConfig(
+    max_new_tokens=32,
+    do_sample=False,
+    eos_token_id=tokenizer.eos_token_id,
+    pad_token_id=tokenizer.pad_token_id,
+    max_batch_tokens=512,  # token budget for batching
+    use_cuda_graph=False,
+)
+
+batch_outputs = model.generate_batch(inputs=simple_batch_inputs, generation_config=generation_config)
+
+for request_id, output in batch_outputs.items():
+    print(request_id, tokenizer.decode(output.generated_tokens, skip_special_tokens=True))
+```
+
+If you need custom scheduling, the docs expose a `ContinuousBatchingManager` and schedulers (default FIFO).
+
+---
+
+## Knobs that matter (3–8)
+
+Prioritize these knobs before anything else:
+
+1) **Batching + padding strategy**
+   - batch requests; for LLM generation use left-padding (`padding_side="left"`) when appropriate
+2) **Placement**: `device` vs `device_map` (and **inputs on `model.device`**)
+3) **Precision**: `dtype` / `torch_dtype` (fp16/bf16) vs full fp32
+4) **Attention backend**: `attn_implementation` (FlashAttention/SDPA/paged variants)
+5) **Compilation**: `torch.compile(...)` knobs (`mode`, `fullgraph`) or compile-via-`generate()` with a static cache
+6) **Quantization**: `quantization_config` (bitsandbytes 8/4-bit, GPTQ, etc.)
+7) **Memory partitioning** (multi-GPU/offload): `max_memory={...}` with `device_map="auto"`
+8) **Serving throughput**: continuous batching (`generate_batch()`, `max_batch_tokens`) / `transformers serve`
+
+---
+
+## Pitfalls & fixes
+
+- **“It’s still slow after moving to GPU”**
+  - Inputs not on GPU → ensure `tokenizer(...).to(model.device)`
+  - You’re running batch_size=1 loops → batch requests; avoid Python overhead
+- **“FlashAttention enabled but errors”**
+  - Use the Kernels integration (`attn_implementation="kernels-community/flash-attn2"`) to avoid local build/version mismatch
+  - Or fall back to `attn_implementation="sdpa"` for a safer baseline
+- **“`torch.compile` made it slower”**
+  - First run includes compilation; benchmark after warmup
+  - Try `mode="reduce-overhead"`; avoid recompiling on shape changes (keep shapes stable)
+- **“8-bit pipeline is slow / sampling not supported”**
+  - For 8-bit text generation, prefer calling `model.generate()` directly (per GPU perf guide)
+- **“OOM when quantizing (GPTQ)”**
+  - Use `device_map="auto"` and constrain with `max_memory={...}`
+  - Prefer loading an already-quantized checkpoint from the Hub when available
+- **“Serving latency spikes under load”**
+  - Use continuous batching to prevent GPU idle bubbles and handle ragged request lengths
+  - Tune `max_batch_tokens` and request scheduling
+
+---
+
+## Verify / locate in repo
+
+Repo hotspots (performance)
+- **Loading / placement (`from_pretrained`, `device_map`, `max_memory`, `dtype`)**: `src/transformers/modeling_utils.py`
+- **Attention backend interface (`attn_implementation`, `set_attn_implementation`)**: docs “Attention backends” + model code in `src/transformers/models/<model>/modeling_<model>.py` (where eager/SDPA/FA branches usually live)
+- **KV cache internals (Static/DynamicCache)**: `src/transformers/cache_utils.py` + KV-cache docs (shows `cache_implementation="static"` + compile behavior)
+- **Generation cache/config knobs (`GenerationConfig`, cache impl wiring)**: `src/transformers/generation/configuration_utils.py`
+- **Core `generate()` perf paths**: `src/transformers/generation/utils.py`
+- **Continuous batching (`generate_batch`)**: `src/transformers/generation/continuous_batching/continuous_api.py`
+- **Quantization config objects**: `src/transformers/utils/quantization_config.py`
+- **Quantizer routing (which quantizer gets picked)**: `src/transformers/quantizers/auto.py`
+- **bitsandbytes glue + bnb 4bit internals**: `src/transformers/integrations/bitsandbytes.py` and `src/transformers/quantizers/quantizer_bnb_4bit.py`
+- **`transformers serve` (CLI + behavior)**: docs “Serving” and implementation under `src/transformers/commands/serving.py` (shows up in tracebacks)
+
+When uncertain, use Skill verification indexes:
+- “Does this symbol/arg exist?” → `reference/generated/public_api.md`
+- “Where is it implemented?” → `reference/generated/module_tree.md`
+
+High-signal repo search keywords (grep these):
+- `attn_implementation`, `set_attn_implementation`
+- `torch.compile`, `cache_implementation="static"`
+- `BitsAndBytesConfig`, `quantization_config`, `load_in_8bit`, `load_in_4bit`
+- `GPTQConfig`, `gptqmodel`
+- `generate_batch`, `ContinuousBatchingManager`, `init_continuous_batching`, `max_batch_tokens`
+- `device_map`, `max_memory`
\ No newline at end of file
diff --git a/.claude/skills/transformers-api/reference/areas/preprocessing.md b/.claude/skills/transformers-api/reference/areas/preprocessing.md
new file mode 100644
index 000000000000..11de3d7d518c
--- /dev/null
+++ b/.claude/skills/transformers-api/reference/areas/preprocessing.md
@@ -0,0 +1,434 @@
+# Preprocessing (tokenizers, processors, image/video processors, feature extractors)
+
+## Contents
+- [Scope](#scope)
+- [Minimum questions (0–4)](#minimum-questions-04)
+- [Choose the right preprocessor](#choose-the-right-preprocessor)
+- [Text preprocessing: `AutoTokenizer`](#text-preprocessing-autotokenizer)
+- [Chat templating: `apply_chat_template`](#chat-templating-apply_chat_template)
+- [Vision preprocessing: `AutoImageProcessor`](#vision-preprocessing-autoimageprocessor)
+- [Audio preprocessing: `AutoFeatureExtractor` and `AutoProcessor`](#audio-preprocessing-autofeatureextractor-and-autoprocessor)
+- [Video preprocessing: `AutoVideoProcessor`](#video-preprocessing-autovideoprocessor)
+- [Multimodal preprocessing: `AutoProcessor`](#multimodal-preprocessing-autoprocessor)
+- [Batching + device sanity](#batching--device-sanity)
+- [Pitfalls & fixes](#pitfalls--fixes)
+- [Repo hotspots](#repo-hotspots)
+
+---
+
+## Scope
+
+Use this page when the user needs to convert **raw inputs** (text / images / audio / video / multimodal messages) into **model-ready tensors** for `transformers`.
+
+---
+
+## Minimum questions (0–4)
+
+Ask only what’s needed to produce a runnable snippet:
+1) **Modality + task + raw input format**  
+   - Text / vision / audio / video / multimodal  
+   - What you’re passing in (e.g., plain strings, chat messages, image URL/path/PIL, audio array + sampling rate, video frames)  
+   - Desired output (logits, embeddings, generated tokens) / expected shapes if relevant
+2) **Model id/path**  
+   - Hugging Face Hub id or local path  
+   - Optional but recommended for reproducibility/security: pinned `revision` (tag/branch/commit)
+3) **Backend + device**  
+   - PyTorch / TensorFlow / JAX  
+   - CPU / CUDA / MPS (and which GPU index if CUDA)
+4) If blocked: **full traceback + minimal repro**  
+   - Smallest code sample that still fails + the exact error
+
+---
+
+## Choose the right preprocessor
+
+Rule: **load preprocessing artifacts from the same checkpoint as the model**.
+
+| Modality | Preferred class | Typical output keys |
+|---|---|---|
+| Text | `AutoTokenizer` | `input_ids`, `attention_mask` (maybe `token_type_ids`) |
+| Image | `AutoImageProcessor` | `pixel_values` (maybe `pixel_mask`) |
+| Audio | `AutoFeatureExtractor` *or* `AutoProcessor` (model-dependent) | `input_values` **or** `input_features` (sometimes `attention_mask`) |
+| Video | `AutoVideoProcessor` **or** `AutoImageProcessor` (frame-based; model-dependent) | model-dependent video/frame tensors + optional metadata |
+| Multimodal (text+image/audio/video) | `AutoProcessor` | combination (e.g., `input_ids` + `pixel_values`) |
+
+If the model card/examples show `AutoProcessor`, prefer `AutoProcessor`.
+Note: Some video classification models (e.g., VideoMAE) use a frame/image processor (`AutoImageProcessor` / `VideoMAEImageProcessor`) rather than `AutoVideoProcessor`.
+
+---
+
+## Text preprocessing: `AutoTokenizer`
+
+### Minimal batch tokenization (PyTorch)
+```python
+from transformers import AutoTokenizer
+
+model_id = "bert-base-uncased"
+tok = AutoTokenizer.from_pretrained(model_id)
+
+texts = ["hello world", "a much longer example sentence"]
+batch = tok(
+    texts,
+    padding=True,        # pad to longest in batch
+    truncation=True,     # truncate if needed
+    return_tensors="pt",
+)
+
+print(batch.keys())
+print(batch["input_ids"].shape)
+```
+
+### Practical padding/truncation defaults
+- Safe batch default: `padding=True, truncation=True`
+- Deterministic cap: add `max_length=...`
+- Static shapes: `padding="max_length"` + `max_length=...`
+
+### Decoder-only LMs: pad token + left padding for batching
+Some causal LMs do not define a pad token. For batched inputs (esp. generation), set it explicitly.
+```python
+from transformers import AutoTokenizer
+
+model_id = "gpt2"
+tok = AutoTokenizer.from_pretrained(model_id)
+
+if tok.pad_token is None:
+    tok.pad_token = tok.eos_token
+
+tok.padding_side = "left"  # common for decoder-only batching
+
+batch = tok(["hi", "hello there"], padding=True, return_tensors="pt")
+print(batch["input_ids"].shape)
+```
+
+### Long inputs: sliding window with overlap (`stride`)
+Use this when text exceeds context length and you want overlapping windows.
+```python
+from transformers import AutoTokenizer
+
+tok = AutoTokenizer.from_pretrained("bert-base-uncased")
+
+text = "very long text " * 2000
+enc = tok(
+    text,
+    truncation=True,
+    max_length=512,
+    stride=128,
+    return_overflowing_tokens=True,
+    return_offsets_mapping=True,  # best with fast tokenizers
+)
+
+print("num_windows:", len(enc["input_ids"]))
+```
+
+### Token classification: word alignment (`is_split_into_words`)
+```python
+from transformers import AutoTokenizer
+
+tok = AutoTokenizer.from_pretrained("bert-base-cased")
+
+words = ["New", "York", "City"]
+enc = tok(words, is_split_into_words=True, return_tensors="pt")
+
+# Fast tokenizers provide token->word alignment
+word_ids = enc.word_ids(batch_index=0)
+print(word_ids)
+```
+
+---
+
+## Chat templating: `apply_chat_template`
+
+Use chat templates when the model expects a specific conversation format.
+If the user’s issue is decoding/stopping/streaming, route to `generation.md`.
+
+```python
+from transformers import AutoTokenizer
+
+model_id = "meta-llama/Llama-3.1-8B-Instruct"
+
+# Access/auth note:
+# - If this line fails with 401 Unauthorized / GatedRepoError, the repo is gated or private.
+# - Fix: (1) request/accept access on the model page, then (2) authenticate:
+#     * terminal: `huggingface-cli login`
+#     * or set env var `HF_TOKEN=hf_...` and restart your kernel/session
+# - Optional token examples:
+#     * AutoTokenizer.from_pretrained(model_id, token=True)        # use cached login or HF_TOKEN
+#     * AutoTokenizer.from_pretrained(model_id, token="hf_...")    # explicit token
+# - Public demo alternative (no gating): "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
+tok = AutoTokenizer.from_pretrained(model_id)
+
+messages = [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "Write a haiku about preprocessing."},
+]
+
+prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+print(prompt)
+
+# If you later tokenize `prompt` yourself, set add_special_tokens=False to avoid duplicating special tokens.
+```
+
+To directly get token ids:
+```python
+enc = tok.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
+print(enc.shape)
+```
+
+---
+
+## Vision preprocessing: `AutoImageProcessor`
+
+### Minimal image preprocessing (PIL)
+```python
+from transformers import AutoImageProcessor
+from PIL import Image
+
+model_id = "google/vit-base-patch16-224"
+imgp = AutoImageProcessor.from_pretrained(model_id)
+
+image = Image.open("image.jpg")
+inputs = imgp(images=image, return_tensors="pt")
+
+print(inputs.keys())  # typically includes "pixel_values" (and sometimes "pixel_mask", model-dependent)
+print(inputs["pixel_values"].shape)
+```
+
+### Fast image processors (if supported)
+Some checkpoints provide a “fast” image processor path.
+```python
+from transformers import AutoImageProcessor
+
+imgp = AutoImageProcessor.from_pretrained(
+    "google/vit-base-patch16-224",
+    use_fast=True,
+)
+```
+
+### What to change (typical; varies by checkpoint)
+Prefer changing processor config/kwargs rather than writing ad-hoc transforms. Not every processor supports every knob below:
+- resize/crop: `do_resize`, `size`, `do_center_crop`, `crop_size`
+- normalize: `do_normalize`, `image_mean`, `image_std`
+- rescale: `do_rescale`, `rescale_factor`
+
+---
+
+## Audio preprocessing: `AutoFeatureExtractor` and `AutoProcessor`
+
+### Key rule: sampling rate must match
+If audio outputs are “nonsense,” sampling-rate mismatch is a top cause. 
+Prefer reading the expected sampling rate from the preprocessor rather than hardcoding it.
+
+### Waveform models (e.g., wav2vec2): `AutoFeatureExtractor`
+```python
+import numpy as np
+from transformers import AutoFeatureExtractor
+
+model_id = "facebook/wav2vec2-base-960h"
+fe = AutoFeatureExtractor.from_pretrained(model_id)
+# 1 second of silence at the model's expected sampling rate (replace with real audio)
+sr = fe.sampling_rate
+waveform = np.zeros(sr, dtype=np.float32)
+inputs = fe(
+    waveform,
+    sampling_rate=sr,
+    padding=True,
+    return_tensors="pt",
+)
+print(inputs.keys())  # typically includes "input_values" (+ "attention_mask" sometimes, model-dependent)
+```
+
+### Spectrogram-feature models (common for Whisper): `AutoProcessor`
+Whisper-style models typically use a processor that returns `input_features`.
+```python
+import numpy as np
+from transformers import AutoProcessor
+
+model_id = "openai/whisper-small"
+proc = AutoProcessor.from_pretrained(model_id)
+sr = proc.feature_extractor.sampling_rate
+waveform = np.zeros(sr, dtype=np.float32)
+inputs = proc(
+    waveform,
+    sampling_rate=sr,
+    return_tensors="pt",
+)
+print(inputs.keys())  # typically includes "input_features"
+```
+
+---
+
+## Video preprocessing: `AutoVideoProcessor`
+
+Video preprocessing may require a decoding backend depending on how you provide video.
+Safest approach (no decoder dependency): **decode frames yourself** and pass frames.
+
+### Option A (decoder-free): pass frames you already have
+Example assumes you have a list of PIL images (frames) or numpy arrays.  
+For a batch of videos, pass a list of frame-lists: `[[frame1, frame2, ...], [...]]`
+
+```python
+# VideoMAE uses an *image/frame* processor; AutoVideoProcessor is for certain VLM/video-chat model types.
+import numpy as np
+from PIL import Image
+from transformers import AutoImageProcessor, VideoMAEForVideoClassification
+
+mid = "MCG-NJU/videomae-base-finetuned-kinetics"
+proc = AutoImageProcessor.from_pretrained(mid)
+model = VideoMAEForVideoClassification.from_pretrained(mid)
+num_frames = getattr(model.config, "num_frames", 16)
+H, W = 224, 224
+frames = [Image.fromarray(np.random.randint(0,256,(H,W,3),dtype=np.uint8))
+          for _ in range(num_frames)]
+inputs = proc(images=frames, return_tensors="pt")  # <-- key change
+pred = model(**inputs).logits.argmax(-1).item()
+print(model.config.id2label[pred])
+```
+
+### Option B: decode with TorchCodec, then pass frames to VideoMAE
+```python
+# Requirements:
+#   pip install torch transformers torchcodec
+#   + install FFmpeg (shared libs; on Windows this matters)
+
+import torch
+from torchcodec.decoders import VideoDecoder
+from transformers import AutoImageProcessor, VideoMAEForVideoClassification
+
+video_path = "video.mp4"
+model_id = "MCG-NJU/videomae-base-finetuned-kinetics"
+
+proc = AutoImageProcessor.from_pretrained(model_id)   # VideoMAE is frame-based
+model = VideoMAEForVideoClassification.from_pretrained(model_id).eval()
+
+# TorchCodec decodes frames as uint8 tensors; use NHWC to get (N, H, W, C)
+decoder = VideoDecoder(video_path, dimension_order="NHWC")
+T = len(decoder)
+if T == 0:
+    raise RuntimeError(f"Video has 0 frames: {video_path}")
+
+num = getattr(model.config, "num_frames", 16)
+idx = torch.linspace(0, T - 1, num).round().long().clamp(0, T - 1)
+
+fb = decoder.get_frames_at(indices=idx.tolist())  # FrameBatch; pixels in fb.data (uint8)
+frames = [fb.data[i].cpu().numpy() for i in range(fb.data.shape[0])]  # list of HWC uint8 arrays
+
+inputs = proc(images=frames, return_tensors="pt")
+print(inputs.keys())
+
+with torch.no_grad():
+    pred = model(**inputs).logits.argmax(-1).item()
+
+print(model.config.id2label[pred])
+```
+---
+
+## Multimodal preprocessing: `AutoProcessor`
+
+Use `AutoProcessor` for models that combine modalities (text + image/audio/video). 
+
+### Recommended: chat template + image
+
+```python
+from transformers import AutoProcessor
+from PIL import Image
+
+model_id = "llava-hf/llava-onevision-qwen2-0.5b-ov-hf"
+# For **LLaVA-OneVision**, it’s safest to build the prompt with the **chat template** (it inserts the required image placeholder token).
+proc = AutoProcessor.from_pretrained(model_id)
+image = Image.open("image.jpg")
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image"},
+            {"type": "text", "text": "Describe this image."},
+        ],
+    }
+]
+prompt = proc.apply_chat_template(messages, add_generation_prompt=True)
+inputs = proc(text=prompt, images=image, return_tensors="pt")
+print(inputs.keys())  # typically includes input_ids/attention_mask + pixel_values (and possibly others)
+```
+---
+
+## Batching + device sanity
+### Inspect keys, shapes, dtypes
+```python
+import torch
+for k, v in inputs.items():
+    if torch.is_tensor(v):
+        print(k, tuple(v.shape), v.dtype, v.device)
+    else:
+        print(k, type(v))
+```
+
+### Move tensors to device (PyTorch)
+Some outputs support `.to(device)`; otherwise move per-tensor.
+```python
+import torch
+device = "cuda" if torch.cuda.is_available() else "cpu"
+try:
+    inputs = inputs.to(device)
+except Exception:
+    inputs = {k: (v.to(device) if torch.is_tensor(v) else v) for k, v in inputs.items()}
+```
+---
+
+## Pitfalls & fixes
+
+### “Batch fails / shapes differ”
+- Text: `padding=True` and usually `truncation=True`
+- Audio: `padding=True` + consistent `sampling_rate`
+- Vision/video: pass lists consistently; avoid mixing PIL paths/URLs/arrays in the same batch
+
+### “Tokenizer has no pad token”
+- Decoder-only: set `pad_token` (often to `eos_token`) and consider `padding_side="left"`
+
+### “Output keys don’t match model forward”
+Print `inputs.keys()` and confirm expected keys:
+- text: `input_ids`, `attention_mask` (maybe `token_type_ids`)
+- vision: `pixel_values` (maybe `pixel_mask`)
+- audio: `input_values` or `input_features`
+- multimodal: combinations
+
+### “Audio outputs are wrong”
+- Verify sampling rate, dtype, and that you’re passing a 1D waveform (not stereo without handling)
+
+### “Double preprocessing (manual normalize + processor normalize)”
+- Prefer processor config; if you must customize, disable the relevant processor steps (model-dependent)
+
+---
+
+## Repo hotspots 
+
+### Tokenizers
+- src/transformers/tokenization_utils_base.py
+- src/transformers/tokenization_utils_fast.py
+- src/transformers/tokenization_utils_tokenizers.py
+- src/transformers/models/auto/tokenization_auto.py
+
+### Processors
+- src/transformers/processing_utils.py
+- src/transformers/models/auto/processing_auto.py
+
+### Image processors
+- src/transformers/image_processing_utils.py
+- src/transformers/image_processing_base.py
+- src/transformers/models/auto/image_processing_auto.py
+- model-specific: src/transformers/models/*/image_processing_*.py
+
+### Feature extractors
+- src/transformers/feature_extraction_utils.py
+- src/transformers/models/auto/feature_extraction_auto.py
+- model-specific: src/transformers/models/*/feature_extraction_*.py
+
+### Video processors
+- src/transformers/video_processing_utils.py
+- src/transformers/models/auto/video_processing_auto.py
+- src/transformers/video_utils.py
+- model-specific: src/transformers/models/*/video_processing_*.py
+  - example: src/transformers/models/videomae/video_processing_videomae.py
+
+### Tests (entry points)
+- tests/test_tokenization_common.py
+- model-specific: tests/models/<model_name>/...
\ No newline at end of file
diff --git a/.claude/skills/transformers-api/reference/areas/repo-contributing.md b/.claude/skills/transformers-api/reference/areas/repo-contributing.md
new file mode 100644
index 000000000000..2d26ef38650c
--- /dev/null
+++ b/.claude/skills/transformers-api/reference/areas/repo-contributing.md
@@ -0,0 +1,315 @@
+# Repo navigation & contributing (where is X implemented? + PR hygiene)
+
+## Contents
+- [Scope](#scope)
+- [Minimum questions to ask](#minimum-questions-to-ask)
+- [Decision guide](#decision-guide)
+- [Quickstarts](#quickstarts)
+  - [1) Locate an implementation (public API → module → file)](#1-locate-an-implementation-public-api--module--file)
+  - [2) Set up a dev environment (editable install)](#2-set-up-a-dev-environment-editable-install)
+  - [3) Run the smallest relevant tests](#3-run-the-smallest-relevant-tests)
+  - [4) Run style/quality checks (make targets)](#4-run-stylequality-checks-make-targets)
+  - [5) Run repo consistency checks (make repo-consistency)](#5-run-repo-consistency-checks-make-repo-consistency)
+  - [6) Build docs locally (doc-builder)](#6-build-docs-locally-doc-builder)
+  - [7) Model contributions (modular approach + checklist)](#7-model-contributions-modular-approach--checklist)
+- [Knobs that matter (3–8)](#knobs-that-matter-38)
+- [Pitfalls & fixes](#pitfalls--fixes)
+- [Repo hotspots](#repo-hotspots)
+- [Verify / locate in repo](#verify--locate-in-repo)
+
+---
+
+## Scope
+
+Use this page when the user wants to:
+- find “**where is X implemented?**” (exact file/class/function)
+- understand the **repo layout** (`src/`, `tests/`, `docs/`, `examples/`)
+- make a **small targeted change** and open a PR safely
+- add/update **docs**, **tests**, or a **model**
+
+---
+
+## Minimum questions to ask
+
+Ask only what you need (0–5 questions):
+1) The **symbol/name** (class/function/arg) OR the **behavior** (what changed / what’s wrong)
+2) Is the request “**where is it**” or “**change it**” or “**add it**”?
+3) Which backend matters (PyTorch/TF/JAX) and which area (pipelines/generation/trainer/tokenizers/processors)?
+4) Do they have a **repro** or failing test? (ideal)
+5) Are they changing **public API** or internal behavior only?
+
+---
+
+## Decision guide
+
+### If the question is “Where is X implemented?”
+Use this ladder (don’t guess):
+1) Confirm the public symbol exists → `reference/generated/public_api.md`
+2) Map it to a file path → `reference/generated/module_tree.md`
+3) Grep the repo for the symbol / error substring / config key
+4) Find the tests that cover it, then adjust minimally
+
+### If the goal is “Change X” (bug fix / behavior change)
+1) Reproduce (minimal script) OR write a failing test first
+2) Make the smallest code change
+3) Run the smallest relevant tests
+4) Run `make fixup` and fix remaining issues
+5) Open PR with a clear title and minimal diff
+
+### If the goal is “Add X” (new model / new feature)
+1) Prefer the modular approach when available (keeps contributions maintainable)
+2) Add code + docs + tests together
+3) Run repo consistency checks so required registries/indexes don’t get missed
+4) Keep the PR as small and focused as possible
+
+---
+
+## Quickstarts
+
+### 1. Locate an implementation (public API → module → file)
+
+Follow this sequence:
+
+1) **Does the symbol exist publicly?**  
+   Open: `reference/generated/public_api.md`
+
+2) **Where is it implemented?**  
+   Open: `reference/generated/module_tree.md`  
+   - Identify the owning module/file under `src/transformers/`
+   - Note adjacent files in the same folder (helpers/configs/variants)
+
+3) **Grep keywords**  
+   Use 1–3 high-signal search terms:
+   - exact symbol name (e.g., `set_attn_implementation`)
+   - error substring from traceback
+   - config key (e.g., `attn_implementation`, `torch_dtype`)
+
+---
+
+### 2. Set up a dev environment (editable install)
+
+```bash
+git clone https://github.com/<your-handle>/transformers.git
+cd transformers
+git remote add upstream https://github.com/huggingface/transformers.git
+git checkout -b my-descriptive-branch
+```
+
+Before opening a PR (or if a maintainer asks), rebase your branch on upstream:
+
+```bash
+git fetch upstream
+git rebase upstream/main
+```
+
+Editable install in a virtualenv:
+
+```bash
+pip install -e ".[dev]"
+```
+
+If that fails (optional deps can be heavy), install PyTorch first, then:
+
+```bash
+pip install -e ".[quality]"
+```
+
+If Transformers was already installed in that env, uninstall it first:
+
+```bash
+pip uninstall transformers
+```
+
+---
+
+### 3. Run the smallest relevant tests
+
+Run only what you touched first:
+
+```bash
+pytest tests/<TEST_TO_RUN>.py
+```
+
+Iterate faster with keyword filtering:
+
+```bash
+pytest -k "keyword_here" tests/<TEST_TO_RUN>.py
+```
+
+#### Match CI’s test selection (tests_fetcher)
+
+Transformers CI selects tests impacted by your PR diff. You can reproduce that selection locally by running the same helper script CI uses.
+
+```bash
+python utils/tests_fetcher.py
+```
+
+This creates a `test_list.txt` file with the tests to run; execute them like this: 
+
+```bash
+python -m pytest -n 8 --dist=loadfile -rA -s $(cat test_list.txt)
+```
+
+If you add/modify `@slow` tests, run them explicitly. By default, slow tests are skipped; set `RUN_SLOW=yes` to enable them — note this can download **many gigabytes** of models (disk + bandwidth required). 
+
+```bash
+RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v tests/<SLOW_TEST_FILE_OR_DIR>
+```
+
+Accepted variant (you’ll also see this form used in some docs/CI contexts):
+
+```bash
+RUN_SLOW=1 pytest tests/<SLOW_TEST_FILE_OR_DIR>
+```
+
+
+---
+
+### 4. Run style/quality checks (make targets)
+
+Full formatting:
+
+```bash
+make style
+```
+
+Quality checks:
+
+```bash
+make quality
+```
+
+Fast path for PR iteration (targets modified files and also runs repo consistency):
+
+```bash
+make fixup
+```
+
+---
+
+### 5. Run repo consistency checks (make repo-consistency)
+
+Run:
+
+```bash
+make repo-consistency
+```
+
+If it fails on copies / generated-content checks, run:
+
+```bash
+make fix-copies
+```
+
+Then rerun:
+
+```bash
+make repo-consistency
+```
+
+---
+
+### 6. Build docs locally (hf-doc-builder)
+
+If you modified anything under `docs/source`, make sure the documentation can still be built.
+
+Install the documentation builder:
+
+```bash
+pip install hf-doc-builder
+```
+
+Run the following command from the root of the repository:
+
+```bash
+doc-builder build transformers docs/source/en --build_dir ~/tmp/test-build
+```
+
+Inspect the output under `~/tmp/test-build`.
+
+---
+
+### 7. Model contributions (modular approach + checklist)
+
+For vision-language / multimodal models (images/videos), follow the official Transformers contribution checklist. 
+
+#### Required checklist (vision-language / multimodal)
+
+1) **Implement a modular file**
+- Prefer the modular architecture pattern: create `modular_<model_name>.py`.
+- Use the CLI to scaffold a modular skeleton:
+  - `transformers add-new-model-like` 
+- Verify the modular file with:
+~~~bash
+python utils/modular_model_converter.py <model_name>
+~~~
+This generates the derived files (`modeling_*.py`, `configuration_*.py`, etc.) and CI enforces that they match the modular source. 
+
+2) **Add a fast image processor (for image/video models)**
+- If your model processes images, add a fast image processor that inherits from `BaseImageProcessorFast` (torch/torchvision-based) for better performance. 
+
+3) **Create a weight conversion script**
+- Add `convert_<model_name>_to_hf.py` to convert original checkpoints to the Hugging Face format (load, map keys, save), including usage examples in the script. 
+
+4) **Add integration tests with exact output matching**
+- Add an `IntegrationTest` that runs end-to-end processing + modeling with **exact output matching** (generated text for generative models; logits for non-generative models).
+- Use real checkpoints + real inputs (consider 4-bit / half precision if the checkpoint is large for CI). 
+
+5) **Update documentation**
+- Add or update `docs/source/en/model_doc/<model_name>.md` with usage examples, model description + paper link, and basic usage with `Pipeline` and `AutoModel`.
+- Add the model to the appropriate TOC files. 
+
+6) **Look for reusable patterns**
+- Reuse established patterns from similar models (LLaVA, Idefics2, Fuyu, etc.) and avoid reinventing core components. 
+
+Before pushing, run:
+~~~bash
+make fixup
+~~~
+
+---
+
+## Knobs that matter (3–8)
+
+1) Keep PRs **small** (avoid drive-by refactors)
+2) Repro-first: failing test or minimal repro before changing logic
+3) Run the **smallest relevant tests** first, then expand
+4) Always run `make fixup` before pushing
+5) For new models/features: run `make repo-consistency`
+6) If docs changed: run `doc-builder build ...`
+7) If slow tests changed/added: run `RUN_SLOW=1 pytest ...`
+8) When changing public API: verify docs + exports + tests
+
+---
+
+## Pitfalls & fixes
+
+- Can’t find where something is defined:
+  - confirm in `public_api.md`, then locate via `module_tree.md`, then grep
+- CI fails on formatting/lint:
+  - run `make fixup`, then rerun failing checks
+- Repo consistency fails:
+  - run `make repo-consistency`; if it points to copy checks, try `make fix-copies`
+- Docs build fails:
+  - run `doc-builder build transformers docs/source/ --build_dir ...` and fix missing toctree/refs
+
+---
+
+## Repo hotspots
+
+- Core library: `src/transformers/`
+- Models: `src/transformers/models/`
+- Pipelines: `src/transformers/pipelines/`
+- Generation: `src/transformers/generation/`
+- Trainer: `src/transformers/trainer.py` (+ related modules)
+- Tests: `tests/` (model tests usually under `tests/models/<model_name>/`)
+- Docs: `docs/source/` (English content commonly under `docs/source/en/`)
+- Examples: `examples/`
+
+---
+
+## Verify / locate in repo
+
+When uncertain, use Skill verification indexes:
+- “Does this symbol/arg exist?” → `reference/generated/public_api.md`
+- “Where is it implemented?” → `reference/generated/module_tree.md`
diff --git a/.claude/skills/transformers-api/reference/areas/training.md b/.claude/skills/transformers-api/reference/areas/training.md
new file mode 100644
index 000000000000..b07d1591dd48
--- /dev/null
+++ b/.claude/skills/transformers-api/reference/areas/training.md
@@ -0,0 +1,353 @@
+# Training / Fine-tuning (Trainer + Seq2SeqTrainer)
+
+## Contents
+- [Scope](#scope)
+- [Minimum questions to ask](#minimum-questions-to-ask)
+- [Decision guide: `Trainer` vs `Seq2SeqTrainer` vs custom loop](#decision-guide-trainer-vs-seq2seqtrainer-vs-custom-loop)
+- [Quickstarts](#quickstarts)
+  - [1) Trainer: text classification (baseline + eval)](#1-trainer-text-classification-baseline--eval)
+  - [2) Trainer: map/tokenize a Dataset safely (columns + labels)](#2-trainer-maptokenize-a-dataset-safely-columns--labels)
+  - [3) Trainer: distributed / multi-GPU launch (Accelerate/torchrun)](#3-trainer-distributed--multi-gpu-launch-acceleratetorchrun)
+  - [4) Trainer: image classification (non-text example; `remove_unused_columns=False`)](#4-trainer-image-classification-non-text-example-remove_unused_columnsfalse)
+  - [5) Trainer: custom loss (minimal override)](#5-trainer-custom-loss-minimal-override)
+  - [6) Trainer: evaluate/predict-only (no training)](#6-trainer-evaluatepredict-only-no-training)
+- [Knobs that matter (3–8)](#knobs-that-matter-38)
+- [Pitfalls & fixes](#pitfalls--fixes)
+- [Column dropping and why it matters](#column-dropping-and-why-it-matters)
+- [Verify / locate in repo](#verify--locate-in-repo)
+
+---
+
+## Scope
+
+Use this page when the user wants to **fine-tune / train / evaluate** a model in `transformers` using `Trainer` or `Seq2SeqTrainer`.
+
+---
+
+## Minimum questions to ask
+
+Ask only what you need to produce a runnable snippet (0–6 questions):
+1) **Task** (classification / token classification / seq2seq / causal LM / vision / audio)
+2) **Model id or local path** (and `revision` if pinned)
+3) **Dataset** source + columns (inputs, labels, any extra metadata needed)
+4) **Backend + device** (PyTorch; CPU/CUDA/MPS; num GPUs; rough VRAM)
+5) **Goal** (correctness vs speed vs memory vs reproducibility)
+6) If blocked: **full traceback + exact versions** + smallest repro
+
+---
+## Decision guide: `Trainer` vs `Seq2SeqTrainer` vs custom loop
+
+### Prefer `Trainer` when…
+- You want the **standard, feature-complete** training/eval loop with minimal custom code. 
+- Your evaluation can be done from a **forward pass** (loss/logits → `compute_metrics`), optionally with `preprocess_logits_for_metrics` to transform logits before metrics caching.
+- You may still be doing seq2seq *training*, but you **don’t need `generate()` during eval/predict** (e.g., loss-based evaluation only). 
+
+### Prefer `Seq2SeqTrainer` when…
+- You’re training **sequence-to-sequence** models (e.g., summarization/translation) and want the seq2seq-adapted training path.
+- You want evaluation/prediction **with generation** (`predict_with_generate=True`) so you can compute ROUGE/BLEU-style metrics from generated sequences. 
+- You want easy control over generation at eval/predict time (e.g., `max_length`, `num_beams`, and other `generate` kwargs). 
+
+### Prefer a custom loop when…
+- You need **nonstandard optimizer steps**, RL-style objectives, multi-stage losses, or very custom batching/updates that don’t fit cleanly into Trainer customization.
+- You’re ready to write your own loop (often with **Accelerate** to avoid distributed/mixed-precision boilerplate). 
+---
+
+## Quickstarts
+
+### 1. Trainer: text classification (baseline + eval)
+
+```python
+import numpy as np
+from datasets import load_dataset
+from transformers import (
+    AutoTokenizer,
+    AutoModelForSequenceClassification,
+    DataCollatorWithPadding,
+    TrainingArguments,
+    Trainer,
+)
+
+model_id = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"
+
+ds = load_dataset("imdb")
+tok = AutoTokenizer.from_pretrained(model_id)
+
+def preprocess(batch):
+    return tok(batch["text"], truncation=True)
+
+tok_ds = ds.map(preprocess, batched=True, remove_columns=["text"])
+
+if "label" in tok_ds["train"].column_names and "labels" not in tok_ds["train"].column_names:
+    tok_ds = tok_ds.rename_column("label", "labels")
+
+train_ds = tok_ds["train"].shuffle(seed=42).select(range(2000))
+eval_ds  = tok_ds["test"].shuffle(seed=42).select(range(2000))
+
+model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=2)
+collator = DataCollatorWithPadding(tokenizer=tok)
+
+def compute_metrics(eval_pred):
+    logits = eval_pred.predictions if hasattr(eval_pred, "predictions") else eval_pred[0]
+    labels = eval_pred.label_ids if hasattr(eval_pred, "label_ids") else eval_pred[1]
+    preds = np.argmax(logits, axis=-1)
+    return {"accuracy": float((preds == labels).mean())}
+
+args = TrainingArguments(
+    output_dir="./out_cls",
+    learning_rate=2e-5,
+    per_device_train_batch_size=8,
+    num_train_epochs=1,
+    weight_decay=0.01,
+    eval_strategy="no",
+    save_strategy="no",       
+    load_best_model_at_end=False,
+    report_to="none",
+)
+
+
+trainer = Trainer(
+    model=model,
+    args=args,
+    train_dataset=train_ds,
+    eval_dataset=eval_ds,
+    processing_class=tok,      
+    data_collator=collator,
+    compute_metrics=compute_metrics,
+)
+
+trainer.train()
+print(trainer.evaluate())
+trainer.save_model("./out_cls/final")
+```
+
+Notes:
+- If you don’t want eval, set `eval_strategy="no"` and omit `eval_dataset`.
+- Start by training on a small sample (e.g., 200–2,000 examples) to quickly verify the pipeline runs end-to-end before scaling to the full dataset.
+---
+
+### 2. Trainer: map/tokenize a Dataset safely (columns + labels)
+
+This checklist prevents 80% of “why is loss None / labels missing / shapes wrong” issues.
+
+```python
+from datasets import load_dataset
+from transformers import AutoTokenizer
+
+model_id = "distilbert/distilbert-base-uncased"
+ds = load_dataset("imdb")
+tok = AutoTokenizer.from_pretrained(model_id)
+
+def preprocess(batch):
+    out = tok(batch["text"], truncation=True)
+    out["labels"] = batch["label"]     # make supervision explicit
+    return out
+
+proc = ds["train"].map(preprocess, batched=True, remove_columns=["text"])
+
+ex = proc[0]
+print(sorted(ex.keys()))
+print("len(input_ids):", len(ex["input_ids"]), "labels:", ex["labels"])
+```
+
+If you have multiple supervision fields (e.g., `start_positions`/`end_positions` or multi-task),
+keep them as explicit columns and handle them via your model forward and/or `label_names` (advanced).
+
+---
+
+### 3. Trainer: distributed / multi-GPU launch (Accelerate/torchrun)
+
+Trainer typically scales via the launcher you use (code often stays the same).
+
+**Option A: Accelerate**
+```bash
+accelerate config
+accelerate launch train.py
+```
+
+**Option B: torchrun**
+```bash
+torchrun --nproc_per_node 2 train.py
+```
+
+Practical scaling knobs:
+- Reduce per-device batch size and use `gradient_accumulation_steps` to keep the same global batch.
+- For instability, start with fewer GPUs and confirm correctness first.
+
+---
+
+### 4. Trainer: image classification (non-text example; `remove_unused_columns=False`)
+
+For vision/video, you often need the raw `image`/`video` column to build `pixel_values`.
+Trainer may drop columns by default, so set `remove_unused_columns=False`.
+
+```python
+from datasets import load_dataset
+from transformers import (
+    AutoImageProcessor,
+    AutoModelForImageClassification,
+    DefaultDataCollator,
+    TrainingArguments,
+    Trainer,
+)
+
+model_id = "google/vit-base-patch16-224"
+ds = load_dataset("beans")  # has an `image` column
+
+processor = AutoImageProcessor.from_pretrained(model_id)
+model = AutoModelForImageClassification.from_pretrained(model_id, num_labels=3)
+
+def transform(example):
+    # example["image"] is a PIL image
+    enc = processor(example["image"], return_tensors="pt")
+    example["pixel_values"] = enc["pixel_values"][0]
+    if "label" in example and "labels" not in example:
+        example["labels"] = example["label"]
+    return example
+
+train_ds = ds["train"].with_transform(transform)
+eval_ds  = ds["validation"].with_transform(transform)
+
+args = TrainingArguments(
+    output_dir="./out_vit",
+    per_device_train_batch_size=8,
+    per_device_eval_batch_size=8,
+    num_train_epochs=1,
+    learning_rate=5e-5,
+    eval_strategy="epoch",
+    save_strategy="epoch",
+    remove_unused_columns=False,   # IMPORTANT for transforms that rely on raw columns
+    report_to="none",
+)
+
+trainer = Trainer(
+    model=model,
+    args=args,
+    train_dataset=train_ds,
+    eval_dataset=eval_ds,
+    processing_class=processor,
+    data_collator=DefaultDataCollator(),
+)
+
+trainer.train()
+print(trainer.evaluate())
+```
+
+---
+
+### 5. Trainer: custom loss (minimal override)
+
+Use this when you need a custom loss but want to keep Trainer’s loop.
+
+```python
+import torch
+from transformers import Trainer
+
+class CustomLossTrainer(Trainer):
+    def compute_loss(self, model, inputs, return_outputs=False):
+        labels = inputs.pop("labels")
+        outputs = model(**inputs)
+        logits = outputs.logits
+
+        # Example: multi-label BCE loss (labels should be float multi-hot)
+        loss = torch.nn.functional.binary_cross_entropy_with_logits(logits, labels)
+
+        return (loss, outputs) if return_outputs else loss
+```
+
+Then use it like `Trainer`:
+```python
+# trainer = CustomLossTrainer(model=..., args=..., train_dataset=..., eval_dataset=..., ...)
+```
+
+---
+
+### 6. Trainer: evaluate/predict-only (no training)
+
+Useful for smoke tests, regression checks, or “just compute metrics”.
+
+```python
+# assume you already built: trainer = Trainer(...)
+metrics = trainer.evaluate()
+print("eval:", metrics)
+
+pred = trainer.predict(trainer.eval_dataset)
+print("metrics:", pred.metrics)
+print("predictions shape:", getattr(pred.predictions, "shape", None))
+```
+
+---
+
+## Knobs that matter (3–8)
+
+Prioritize these knobs before anything else:
+
+1) **Task ↔ model head compatibility**
+   - classification → `AutoModelForSequenceClassification`
+   - seq2seq → `AutoModelForSeq2SeqLM` + `Seq2SeqTrainer`
+2) **`model` + `revision`** (pin for reproducibility)
+3) **Data correctness**
+   - label key: prefer `labels`
+   - correct dtypes/shapes (class ids vs multi-hot vs token ids)
+4) **Batching vs memory**
+   - `per_device_train_batch_size`, `gradient_accumulation_steps`
+5) **Evaluation/save cadence**
+   - `eval_strategy`, `eval_steps`, `save_strategy`, `save_steps`
+6) **Precision**
+   - `fp16` / `bf16` (if supported)
+7) **Column handling**
+   - `remove_unused_columns` (often needs `False` for vision/video or custom transforms)
+8) **Best model selection**
+   - `load_best_model_at_end`, `metric_for_best_model`, `greater_is_better`
+
+---
+
+## Pitfalls & fixes
+
+- **TypeError: unexpected keyword**
+  - `eval_strategy` → try `evaluation_strategy`
+  - `processing_class` → try `tokenizer`
+- **Eval enabled but no eval dataset**
+  - Provide `eval_dataset`, or set `eval_strategy="no"`.
+- **Loss is `None` / labels ignored**
+  - Ensure the label key is `labels` and its dtype matches the loss (int class ids vs float multi-hot).
+- **Trainer drops columns you still need**
+  - Set `remove_unused_columns=False` and manage inputs carefully (especially vision/video transforms).
+- **OOM**
+  - Reduce batch size, increase `gradient_accumulation_steps`, lower precision, shorten sequence lengths.
+  - For deep tuning route to `reference/areas/performance.md`.
+- **Very slow “time to first step”**
+  - Dataset transforms/caching/dataloader workers can dominate; start with a tiny subset and `num_workers=0`.
+
+---
+
+## Column dropping and why it matters
+
+By default, Trainer removes dataset columns that aren’t accepted by `model.forward()`.
+
+This is usually helpful, but it can break workflows where:
+- you need raw columns to build model inputs (e.g., `image` → `pixel_values`)
+- you keep metadata columns for metrics/debugging
+
+What to do:
+- If your preprocessing happens in a dataset transform (e.g., `with_transform`) and needs raw columns:
+  - set `TrainingArguments(remove_unused_columns=False)`
+- Ensure your transform or collator produces exactly the tensors the model expects.
+
+---
+
+## Verify / locate in repo
+
+Common repo hotspots:
+- Trainer loop + internals:
+  - `src/transformers/trainer.py`
+  - `src/transformers/trainer_utils.py`
+  - `src/transformers/trainer_callback.py`
+- Seq2Seq training:
+  - `src/transformers/trainer_seq2seq.py`
+  - `src/transformers/training_args_seq2seq.py`
+- Training args + defaults:
+  - `src/transformers/training_args.py`
+- Collators:
+  - `src/transformers/data/data_collator.py`
+- Integrations (DeepSpeed/FSDP/etc.):
+  - `src/transformers/integrations/`
\ No newline at end of file
diff --git a/.claude/skills/transformers-api/reference/areas/troubleshooting.md b/.claude/skills/transformers-api/reference/areas/troubleshooting.md
new file mode 100644
index 000000000000..b5c41ac9e74a
--- /dev/null
+++ b/.claude/skills/transformers-api/reference/areas/troubleshooting.md
@@ -0,0 +1,343 @@
+# Troubleshooting (errors, wrong outputs, regressions)
+
+## Contents
+- [Scope](#scope)
+- [Minimum questions to ask](#minimum-questions-to-ask)
+- [Decision guide: classify the failure](#decision-guide-classify-the-failure)
+- [Quickstarts](#quickstarts)
+  - [1) Make the error actionable (logging + minimal repro)](#1-make-the-error-actionable-logging--minimal-repro)
+  - [2) Firewalled / offline / “Connection error”](#2-firewalled--offline--connection-error)
+  - [3) CUDA out of memory (OOM)](#3-cuda-out-of-memory-oom)
+  - [4) ImportError / missing class after copy-pasting docs](#4-importerror--missing-class-after-copy-pasting-docs)
+  - [5) CUDA error: device-side assert triggered](#5-cuda-error-device-side-assert-triggered)
+  - [6) Silent wrong output from padding tokens (missing attention_mask)](#6-silent-wrong-output-from-padding-tokens-missing-attention_mask)
+- [Knobs that matter (3–8)](#knobs-that-matter-38)
+- [Pitfalls & fixes](#pitfalls--fixes)
+- [Triage flow (repeatable checklist)](#triage-flow-repeatable-checklist)
+- [Verify / locate in repo](#verify--locate-in-repo)
+
+---
+
+## Scope
+
+Use this page when the user is **blocked** (exception, crash, hang, or wrong output) while using `transformers`, or they suspect a regression.
+
+---
+
+## Minimum questions to ask
+
+Ask only what you need (0–5 questions). If the user already pasted these, don’t re-ask.
+
+1) **Exact failure**: full traceback, or “expected vs actual output”
+2) **Minimal repro**: smallest runnable snippet (use `templates/minimal_repro.md`)
+3) **Versions**: `transformers`, backend (`torch` / TF / JAX), Python, CUDA (if relevant)
+4) **Model + revision**: model id or local path; pinned `revision`/commit if applicable
+5) **Hardware**: CPU/CUDA/MPS and rough VRAM if memory/perf related
+
+### 1-minute triage (when the user is blocked)
+
+1) Classify the failure (download/cache, install/version, CUDA runtime, silent correctness, task mismatch)
+2) Ask at most 3 missing facts (traceback, minimal repro, versions)
+3) Apply one smallest fix and one next diagnostic step
+
+---
+
+## Decision guide: classify the failure
+
+Classify before fixing. Most issues fall into one of these buckets:
+
+1) **Download / cache / connectivity**
+   - “Connection error… cannot find requested files in cached path”
+   - hanging at model download / corporate network / firewalled machines
+
+2) **Install / version mismatch**
+   - `ImportError: cannot import name ... from transformers`
+   - missing newer models/features
+
+3) **GPU runtime / CUDA**
+   - CUDA OOM
+   - `device-side assert triggered`
+   - dtype/device mismatch
+
+4) **Silent correctness bugs**
+   - wrong logits/hidden states with padding
+   - wrong outputs due to missing masks or wrong preprocessing
+
+5) **Auto-class / task mismatch**
+   - `ValueError: Unrecognized configuration class ... for this kind of AutoModel`
+   - checkpoint doesn’t support the requested task
+
+Then apply the smallest fix + the smallest next diagnostic step.
+
+---
+
+## Quickstarts
+
+### 1. Make the error actionable (logging + minimal repro)
+
+Turn up logging and isolate to a minimal repro **before** “trying random flags”.
+
+```python
+# 1) Make transformers logs more verbose (runtime)
+from transformers.utils import logging
+logging.set_verbosity_debug()   # or set_verbosity_info()
+logging.enable_default_handler()
+logging.enable_explicit_format()
+
+# 2) If your script is noisy, you can also:
+# logging.disable_progress_bar()
+```
+
+If you can’t change code easily, use environment variables:
+
+```bash
+# More/less logging without editing code:
+TRANSFORMERS_VERBOSITY=debug python your_script.py
+# To suppress "advice" warnings (not errors):
+TRANSFORMERS_NO_ADVISORY_WARNINGS=1 python your_script.py
+```
+
+Now shrink to a repro:
+- one model
+- one input
+- one forward/generate call
+- print shapes/dtypes/devices right before the failure
+
+(Use `templates/minimal_repro.md`.)
+
+---
+
+### 2. Firewalled / offline / “Connection error”
+
+Symptoms: connection errors and the cache doesn’t contain the files yet, often in restricted networks.
+
+Two reliable patterns:
+
+**A. Pre-download the repo, then run offline**
+
+```python
+from huggingface_hub import snapshot_download
+
+local_path = snapshot_download(
+    repo_id="meta-llama/Llama-2-7b-hf",
+    repo_type="model",
+    # revision="main",  # or a tag/commit for reproducibility
+)
+print(local_path)
+```
+Note: if the model is gated or private, you must be authenticated to download files. Use `hf auth login`, or `huggingface_hub.login()`, or pass `token=...` to loading/downloading methods (including `snapshot_download()` / `from_pretrained()`).
+
+
+```bash
+# Avoid HTTP calls to the Hub:
+HF_HUB_OFFLINE=1 python your_script.py
+```
+
+**B. Force local-only loading (no network calls)**
+
+```python
+from transformers import AutoModel
+
+model = AutoModel.from_pretrained("./path/to/local/directory", local_files_only=True)
+```
+
+Also sanity-check cache location if you’re in containers/CI:
+- Default cache location (from `HF_HUB_CACHE`) is `~/.cache/huggingface/hub`
+  - Windows: `C:\Users\<username>\.cache\huggingface\hub`
+- You can redirect the cache via environment variables (priority order):
+  1) `HF_HUB_CACHE` ( default )
+  2) `HF_HOME`
+  3) `XDG_CACHE_HOME` + `/huggingface` (only if `HF_HOME` is not set)
+
+---
+
+### 3. CUDA out of memory (OOM)
+
+Start with the two levers recommended in the official Transformers troubleshooting guide (training):
+- Reduce `per_device_train_batch_size`
+- Increase `gradient_accumulation_steps` to keep the overall batch size
+
+
+```python
+# Trainer-side (example)
+from transformers import TrainingArguments
+
+args = TrainingArguments(
+    output_dir="out",
+    per_device_train_batch_size=1,
+    gradient_accumulation_steps=8,
+)
+```
+
+Common additional levers : reduce inference `batch_size`, reduce `max_length` / `max_new_tokens`, and avoid returning activation-heavy outputs (like hidden states) unless needed.
+
+---
+
+### 4. ImportError / missing class after copy-pasting docs
+
+Symptom example:
+
+`ImportError: cannot import name 'SomeNewThing' from 'transformers'`
+
+This commonly means the docs/snippet assumes a newer version of Transformers.
+
+Fix: upgrade Transformers (and restart the runtime/kernel):
+
+```bash
+pip install --upgrade transformers
+# or install from source (latest changes):
+pip install git+https://github.com/huggingface/transformers
+```
+
+If the model is *very new*, verify you’re on a version that includes it, or install from source.
+
+---
+
+### 5. CUDA error: device-side assert triggered
+
+This is often a vague GPU-side error. Two reliable ways to get a real traceback:
+
+**A. Run on CPU to get a better error message**
+
+```python
+# Important: set this before any CUDA context is initialized
+import os
+os.environ["CUDA_VISIBLE_DEVICES"] = ""   # forces CPU
+```
+
+**B. Force synchronous CUDA to pinpoint the failing op**
+
+```python
+# Important: set this before the first CUDA operation
+import os
+os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
+```
+
+Once you have a real stack trace, the most common underlying causes are:
+- invalid labels / out-of-range class indices (classification)
+- bad token ids (negative or >= vocab size)
+- shape mismatches that only surface on GPU kernels
+
+---
+
+### 6. Silent wrong output from padding tokens (missing attention_mask)
+
+Symptom: outputs/logits differ for padded sequences vs the “true” unpadded sequence, without an obvious error.
+
+Most of the time, fix by passing `attention_mask` so the model ignores padding tokens:
+
+```python
+import torch
+from transformers import AutoModelForSequenceClassification
+
+model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-uncased")
+
+# Two sequences, second is padded with 0
+input_ids = torch.tensor([
+    [7592, 2057, 2097, 2393, 9611, 2115],
+    [7592,    0,    0,    0,    0,    0],
+])
+
+# Correct: mask out padding
+attention_mask = torch.tensor([
+    [1, 1, 1, 1, 1, 1],
+    [1, 0, 0, 0, 0, 0],
+])
+
+out = model(input_ids, attention_mask=attention_mask)
+print(out.logits)
+```
+
+Note: tokenizers often create `attention_mask` for you when you call them, but if you bypass tokenizers and hand-craft `input_ids`, you must provide the mask yourself.
+
+Why it’s manual: Transformers does not automatically infer `attention_mask` from padding because some models have no padding token, and some use-cases intentionally attend to padding tokens.
+
+---
+
+## Knobs that matter (3–8)
+
+Prioritize these knobs before anything else:
+
+1) **Versions**: `transformers` + backend framework version (Torch/TF/JAX)
+2) **Model identity**: model id/path + pinned `revision` (reproducibility)
+3) **Connectivity mode**: `HF_HUB_OFFLINE`, `local_files_only=True`, cache location env vars
+4) **Device placement**: CPU vs CUDA vs MPS; single device vs sharding (`device_map`) when relevant
+5) **Batch/shape**: `batch_size`, sequence length, image size, audio length
+6) **Masks**: `attention_mask` (text), pixel masks where applicable
+7) **Task ↔ class match**: correct `AutoModelFor*` / pipeline task for the checkpoint
+8) **Logging**: `TRANSFORMERS_VERBOSITY`, explicit formatting, disable noisy progress bars
+
+---
+
+## Pitfalls & fixes
+
+- **“Connection error… cannot find requested files in cached path”**
+  - You’re firewalled/offline and the model isn’t cached → pre-download (`snapshot_download`) then set `HF_HUB_OFFLINE=1`, or use `local_files_only=True`.
+
+- **ImportError for a class shown in docs**
+  - You’re on an older Transformers → upgrade or install from source.
+
+- **OOM**
+  - Lower batch/length first; then route to `reference/areas/performance.md`.
+
+- **CUDA device-side assert**
+  - Run on CPU or set `CUDA_LAUNCH_BLOCKING=1` to get a real traceback; then validate label/token id ranges.
+
+- **Wrong outputs with padding**
+  - Pass `attention_mask` (especially when you create `input_ids` manually).
+
+- **AutoModel config mismatch**
+  - The checkpoint configuration cannot be mapped to the requested task head (most commonly because the checkpoint does not support that task) → load with a compatible `AutoModel*` or choose a checkpoint that supports the task.
+
+---
+
+## Triage flow (repeatable checklist)
+
+Use this flow to avoid random guessing:
+
+1) **Freeze the environment**
+   - record versions + model id + revision/commit
+   - re-run in a clean venv if dependency conflicts are suspected
+
+2) **Minimize**
+   - one model
+   - one batch
+   - one call (forward or generate)
+   - print shapes/dtypes/devices right before the failure
+
+3) **Classify**
+   - download/cache vs install/version vs CUDA runtime vs silent correctness vs task mismatch
+
+4) **Apply the smallest fix**
+   - one change at a time, re-run the minimal repro
+
+5) **Only then expand**
+   - re-introduce batching, datasets, distributed, larger inputs, etc.
+
+6) **If you suspect a regression**
+   - try the same repro on a known-good version and the current version
+   - pin the version in the repro so others can reproduce it
+
+---
+
+## Verify / locate in repo
+
+When uncertain, use Skill verification indexes:
+- “Does this symbol/arg exist?” → `reference/generated/public_api.md`
+- “Where is it implemented?” → `reference/generated/module_tree.md`
+
+Common repo hotspots (for debugging “why is this happening?”):
+- Central logging utilities: `src/transformers/utils/logging.py`
+- Import/version gating: `src/transformers/utils/import_utils.py`
+- Model loading + weight init: `src/transformers/modeling_utils.py`
+- Auto class mappings:
+  - `src/transformers/models/auto/modeling_auto.py`
+  - `src/transformers/models/auto/configuration_auto.py`
+- Pipelines core:
+  - `src/transformers/pipelines/__init__.py`
+  - `src/transformers/pipelines/base.py`
+
+If you can’t verify quickly:
+- say what you *did* verify,
+- name the most likely file to inspect next,
+- provide 1–3 grep keywords based on the error string.
\ No newline at end of file
diff --git a/.claude/skills/transformers-api/reference/generated/module_tree.md b/.claude/skills/transformers-api/reference/generated/module_tree.md
new file mode 100644
index 000000000000..2c766aac8a4a
--- /dev/null
+++ b/.claude/skills/transformers-api/reference/generated/module_tree.md
@@ -0,0 +1,553 @@
+# Transformers `src/transformers/` module tree (curated) — **v4.57.6**
+
+> **Purpose**: Fast repo navigation for Transformers API without guessing.
+> **Pinned revision (current)**: `transformers==4.57.6` (PyPI release: **2026-01-16**).
+> **Design goal**: 
+> - Prefer **patterns + canonical entry points + grep keywords** over enumerating every file.
+> - Treat this as **generated**: pin a Transformers revision (tag/commit or exact PyPI version) and regenerate on upgrades.
+> **Not exhaustive**: For model-specific code, use the `models/<model_name>/` patterns and grep tips.
+
+---
+
+## How to use this file
+
+1. Pick the **surface area** below (Loading, Preprocessing, Generation, Pipelines, Training, Integrations/Quantization, Export/ONNX, CLI).
+2. Jump to the **canonical entry point(s)** and search there.
+3. If you need the exact implementation:
+   - `git grep -n "<keyword>" src/transformers` (keywords provided per area)
+   - follow imports into submodules
+
+---
+
+## Core package entry points
+
+```
+
+src/transformers/
+**init**.py
+dependency_versions_check.py
+dependency_versions_table.py
+
+```
+
+- `__init__.py` is the public import surface (re-exports / lazy-import wiring for `from transformers import X`).
+- `dependency_versions_check.py` is where import-time version guards often trigger.
+
+Grep keywords:
+- `_LazyModule`
+- `dependency_versions_check`
+- `require_version`
+
+---
+
+## Configuration, modeling, and loading (PyTorch)
+
+Canonical entry points:
+
+```
+
+src/transformers/
+configuration_utils.py
+modeling_utils.py
+pytorch_utils.py
+modeling_outputs.py
+modeling_layers.py
+
+```
+
+Primary responsibilities:
+- `PreTrainedConfig` (config serialization, `from_pretrained` for configs, validation helpers)
+- `PreTrainedModel` (weight loading/saving, `from_pretrained` for models, sharding, tying weights)
+- torch helpers + shared model output dataclasses/layers
+
+Grep keywords:
+- `from_pretrained(`
+- `save_pretrained(`
+- `get_checkpoint_shard_files`
+- `tie_weights`
+- `state_dict`
+
+Related (often on stack traces):
+
+```
+
+src/transformers/
+dynamic_module_utils.py
+
+src/transformers/utils/
+hub.py
+import_utils.py
+
+```
+
+- `dynamic_module_utils.py` is where `trust_remote_code` plumbing typically lands.
+- `utils/hub.py` is where Hub/caching helpers like `cached_file` and shard resolution live.
+- `utils/import_utils.py` is lazy-import + optional dependency gating.
+
+---
+
+## Tokenization and preprocessing (text / vision / audio / video / multimodal)
+
+### Tokenization (slow + fast)
+
+Canonical entry points:
+
+```
+
+src/transformers/
+tokenization_utils_base.py
+tokenization_utils.py
+tokenization_utils_fast.py
+tokenization_mistral_common.py
+
+```
+
+Notes:
+- Slow (Python) tokenizers: `tokenization_utils.py`
+- Fast tokenizers (Rust `tokenizers` wrappers): `tokenization_utils_fast.py`
+- Shared bases: `tokenization_utils_base.py`
+- Newer/common helpers for Mistral ecosystem: `tokenization_mistral_common.py`
+
+Grep keywords:
+- `PreTrainedTokenizerBase`
+- `BatchEncoding`
+- `AutoTokenizer`
+- `TokenizerFast`
+- `convert_tokens_to_ids`
+
+Related conversion helpers:
+
+```
+
+src/transformers/
+convert_slow_tokenizer.py
+convert_slow_tokenizers_checkpoints_to_fast.py
+convert_slow_tokenizers_checkpoints_to_fast.py
+
+```
+
+Grep keywords:
+- `convert_slow_tokenizer`
+- `SpmConverter`
+- `sentencepiece`
+
+### Processors / image / feature extraction / audio / video
+
+Canonical entry points:
+
+```
+
+src/transformers/
+processing_utils.py
+feature_extraction_utils.py
+feature_extraction_sequence_utils.py
+image_processing_base.py
+image_processing_utils.py
+image_processing_utils_fast.py
+image_transforms.py
+image_utils.py
+audio_utils.py
+video_processing_utils.py
+video_utils.py
+
+```
+
+Primary responsibilities:
+- Processor composition (combining tokenizer + modality preprocessors)
+- Feature extractors and base contracts
+- Image processing base classes + shared image transforms/utils
+- Audio/video helpers used by processors and pipelines
+
+Grep keywords:
+- `AutoProcessor`
+- `ProcessorMixin`
+- `FeatureExtractionMixin`
+- `ImageProcessingMixin`
+- `VideoProcessingMixin`
+
+---
+
+## Generation (text generation / decoding / streaming)
+
+Canonical entry points:
+
+```
+src/transformers/generation/
+configuration_utils.py
+utils.py
+logits_process.py
+stopping_criteria.py
+streamers.py
+beam_search.py
+beam_constraints.py
+candidate_generator.py
+watermarking.py
+
+# Cache utilities used by generation (and models)
+
+src/transformers/
+cache_utils.py
+```
+
+Primary responsibilities:
+- `GenerationConfig` (defaults + `generation_config.json` serialization)
+- `GenerationMixin.generate()` (PyTorch generation loop)
+- Logits processors/warpers, stopping criteria, streamers
+- Beam search + constraints, candidate generation helpers, watermarking
+- KV cache helpers (`cache_utils.py`)
+
+Grep keywords:
+- `class GenerationMixin`
+- `def generate(`
+- `LogitsProcessor`
+- `StoppingCriteria`
+- `TextStreamer`
+- `DynamicCache` / `StaticCache`
+
+---
+
+## Pipelines (high-level inference)
+
+Canonical entry points:
+
+```
+src/transformers/pipelines/
+**init**.py
+base.py
+```
+
+Notes:
+- `pipelines/__init__.py` defines the task registry and the `pipeline()` entry point.
+- `pipelines/base.py` contains the core `Pipeline` base class and shared inference glue.
+- Task-specific pipelines typically follow `pipelines/<task>.py`.
+
+Grep keywords:
+- `class Pipeline`
+- `pipeline(`
+- `SUPPORTED_TASKS`
+
+---
+
+## Training / evaluation (Trainer)
+
+Canonical entry points:
+
+```
+src/transformers/
+trainer.py
+trainer_seq2seq.py
+trainer_callback.py
+trainer_utils.py
+trainer_pt_utils.py
+training_args.py
+training_args_seq2seq.py
+optimization.py
+
+src/transformers/data/
+**init**.py
+data_collator.py
+```
+
+Primary responsibilities:
+- `Trainer` training/eval loops, logging, checkpointing
+- callback system
+- `TrainingArguments` and helper utilities
+- optimizer/scheduler helpers (`optimization.py`)
+- data collators
+
+Grep keywords:
+- `class Trainer`
+- `TrainingArguments`
+- `def training_step(`
+- `CallbackHandler`
+- `get_scheduler`
+- `DataCollator`
+
+---
+
+## Auto classes (model/config/tokenizer/processor dispatch)
+
+Canonical entry points:
+
+```
+src/transformers/models/auto/
+configuration_auto.py
+modeling_auto.py
+modeling_tf_auto.py
+modeling_flax_auto.py
+tokenization_auto.py
+processing_auto.py
+feature_extraction_auto.py
+image_processing_auto.py
+video_processing_auto.py
+auto_factory.py
+```
+
+Primary responsibilities:
+- mapping tables from `model_type` / config class → model/tokenizer/processor classes
+- common auto-loading errors are raised from Auto* dispatch stack (often `configuration_auto.py` / `auto_factory.py`)
+
+Grep keywords:
+- `MODEL_MAPPING`
+- `CONFIG_MAPPING`
+- `TOKENIZER_MAPPING`
+- `PROCESSOR_MAPPING`
+- `model_type`
+
+---
+
+## Models (per-architecture packages)
+
+**Pattern (model implementations):**
+
+```
+src/transformers/models/<model_name>/
+configuration_<model_name>.py
+modeling_<model_name>.py
+modeling_tf_<model_name>.py          # optional
+modeling_flax_<model_name>.py        # optional
+tokenization_<model_name>.py         # optional
+tokenization_<model_name>*fast.py    # optional
+processing*<model_name>.py           # optional
+image_processing_<model_name>.py     # optional
+feature_extraction_<model_name>.py   # optional
+generation_<model_name>.py           # optional (model-specific generation helpers)
+
+# sometimes: video_processing_<model_name>.py, etc.
+```
+
+Handy anchors (examples you’ll often see):
+
+```
+src/transformers/models/bert/modeling_bert.py
+src/transformers/models/t5/modeling_t5.py
+src/transformers/models/llama/modeling_llama.py
+src/transformers/models/qwen2/modeling_qwen2.py
+src/transformers/models/clip/modeling_clip.py
+```
+
+Grep keywords:
+- `class .*Model`
+- `class .*PreTrainedModel`
+- `CONFIG_CLASS`
+
+---
+
+## Performance / kernels / attention backends (common “why is this slow / different?”)
+
+Canonical entry points:
+
+```
+src/transformers/
+modeling_attn_mask_utils.py
+modeling_flash_attention_utils.py
+modeling_rope_utils.py
+modeling_gguf_pytorch_utils.py
+```
+
+Related integration shims (backend-specific routing often lives here):
+
+```
+src/transformers/integrations/
+flash_attention.py
+flex_attention.py
+sdpa_attention.py
+tensor_parallel.py
+```
+
+Grep keywords:
+- `flash_attention`
+- `scaled_dot_product_attention`
+- `sdpa`
+- `use_flash_attention`
+- `gguf`
+
+---
+
+## Utilities and internals
+
+Canonical entry points (frequently involved in stack traces):
+
+```
+src/transformers/utils/
+import_utils.py
+hub.py
+logging.py
+versions.py
+generic.py
+doc.py
+chat_template_utils.py
+peft_utils.py
+quantization_config.py
+
+src/transformers/
+file_utils.py
+debug_utils.py
+testing_utils.py
+```
+
+Primary responsibilities:
+- Lazy import mechanics and optional dependency gating
+- Hub caching/download helpers used by `from_pretrained`
+- logging + version utilities
+- docstring tooling and generic helpers
+- chat template parsing/formatting helpers
+- PEFT helper glue
+- quantization config objects
+- legacy helpers (`file_utils.py`) + debugging/testing utilities
+
+Grep keywords:
+- `_LazyModule`
+- `requires_backends`
+- `is_torch_available`
+- `cached_file`
+- `apply_chat_template`
+- `BitsAndBytesConfig`
+
+---
+
+## Integrations and quantization
+### Integrations (external libs + runtimes)
+
+Canonical entry points:
+
+```
+src/transformers/integrations/
+integration_utils.py
+accelerate.py
+deepspeed.py
+fsdp.py
+peft.py
+bitsandbytes.py
+tiktoken.py
+awq.py
+quanto.py
+```
+
+What lives here:
+- external library shims (Accelerate/DeepSpeed/FSDP/PEFT)
+- tokenizer backends (e.g., tiktoken) and quant backends (AWQ/Quanto/etc.)
+- backend-specific feature routing + capability checks
+
+Grep keywords:
+- `requires_backends`
+- `is_accelerate_available`
+- `is_deepspeed_available`
+- `is_bitsandbytes_available`
+- `device_map`
+
+### Quantizers (unified quantization abstraction)
+
+Canonical entry points:
+
+```
+src/transformers/quantizers/
+auto.py
+base.py
+quantizers_utils.py
+quantizer_bnb_4bit.py
+quantizer_bnb_8bit.py
+quantizer_awq.py
+quantizer_gptq.py
+quantizer_quanto.py
+
+
+src/transformers/utils/
+quantization_config.py
+```
+
+Grep keywords:
+- `HfQuantizer`
+- `quant_method`
+- `BitsAndBytesConfig`
+- `load_in_4bit` / `load_in_8bit`
+- `AutoHfQuantizer`
+
+---
+
+## Export / ONNX
+
+Canonical entry points:
+
+```
+src/transformers/
+convert_graph_to_onnx.py
+src/transformers/onnx/
+**main**.py
+config.py
+convert.py
+features.py
+utils.py
+```
+
+Grep keywords:
+- `OnnxConfig`
+- `export`
+- `opset`
+- `transformers.onnx`
+
+---
+
+## CLI / repo tooling (developer workflows)
+
+Canonical entry points:
+
+```
+src/transformers/commands/
+transformers_cli.py
+chat.py
+serving.py
+add_new_model_like.py
+add_fast_image_processor.py
+convert.py
+download.py
+env.py
+run.py
+train.py
+```
+
+Notes:
+- `transformers_cli.py` is the CLI dispatcher.
+- `chat.py` implements `transformers chat ...`
+- `serving.py` implements `transformers serve ...`
+
+Grep keywords:
+- `main(`
+- `argparse`
+- `transformers chat`
+- `transformers serve`
+- `add_new_model_like`
+
+---
+
+## Production notes (for Skills maintainers)
+
+1. **Pin Transformers**: tie generated references to a specific tag/commit or exact PyPI version.
+2. **Regenerate on upgrade**: when bumping Transformers, regenerate this map alongside any other generated references.
+3. **Keep this file curated**: add new *canonical entry points* as Transformers evolves—don’t mirror the full repo tree.
+4. **Security**: if you ship scripts alongside Skills, keep them least-privilege and auditable.
+
+---
+
+## Quick “where is X implemented?” cheat sheet
+
+| User asks about… | Start here | Then follow into… |
+|---|---|---|
+| `pipeline()` / task pipelines | `src/transformers/pipelines/__init__.py` | `pipelines/base.py` + task file |
+| `AutoModel*` / auto dispatch | `src/transformers/models/auto/modeling_auto.py` | `auto_factory.py` + model subpackage |
+| `AutoTokenizer` | `src/transformers/models/auto/tokenization_auto.py` | model tokenizer module |
+| `AutoProcessor` | `src/transformers/models/auto/processing_auto.py` | model processor module |
+| `from_pretrained` (models) | `src/transformers/modeling_utils.py` | then `src/transformers/utils/hub.py` (caching/shards) |
+| `from_pretrained` (configs) | `src/transformers/configuration_utils.py` | config subclass in model subpackage |
+| `generate()` behavior | `src/transformers/generation/utils.py` | logits/stopping/streamers + beam/candidate helpers |
+| stopping criteria / stop strings | `src/transformers/generation/stopping_criteria.py` | called from generation utils |
+| KV cache / caching behavior | `src/transformers/cache_utils.py` | used by generation + some models |
+| quantization (general) | `src/transformers/quantizers/auto.py` | specific `quantizer_*.py` + `utils/quantization_config.py` |
+| bitsandbytes 4-bit/8-bit | `src/transformers/integrations/bitsandbytes.py` | `quantizers/quantizer_bnb_*.py` |
+| `Trainer` loop / callbacks | `src/transformers/trainer.py` | `trainer_callback.py`, `trainer_utils.py` |
+| schedulers / optim helpers | `src/transformers/optimization.py` | used from Trainer / scripts |
+| data collators | `src/transformers/data/data_collator.py` | task-specific collator classes |
+| ONNX export | `src/transformers/onnx/convert.py` | `onnx/config.py` + `onnx/features.py` |
+| CLI: `transformers chat` | `src/transformers/commands/chat.py` | `commands/transformers_cli.py` |
+| CLI: `transformers serve` | `src/transformers/commands/serving.py` | `commands/transformers_cli.py` |
+```
\ No newline at end of file
diff --git a/.claude/skills/transformers-api/reference/generated/public_api.md b/.claude/skills/transformers-api/reference/generated/public_api.md
new file mode 100644
index 000000000000..eed1c1970628
--- /dev/null
+++ b/.claude/skills/transformers-api/reference/generated/public_api.md
@@ -0,0 +1,572 @@
+# Transformers Public API (Verification Guide)
+
+## Table of Contents
+
+1. [Definition of “Public API”](#1-definition-of-public-api)  
+2. [Version Discipline](#2-version-discipline)  
+3. [Mandatory Verification Workflow](#3-mandatory-verification-workflow)  
+4. [Public API Surfaces (by Area)](#4-public-api-surfaces-by-area)  
+   - 4.1 [Inference](#41-inference)  
+   - 4.2 [Preprocessing](#42-preprocessing)  
+   - 4.3 [Model Loading & Base Classes](#43-model-loading--base-classes)  
+   - 4.4 [Generation](#44-generation)  
+   - 4.5 [Training / Evaluation](#45-training--evaluation)  
+   - 4.6 [Performance / Quantization](#46-performance--quantization)  
+   - 4.7 [Export / Serving](#47-export--serving)  
+5. [Deprecations & Compatibility Traps (Verify, Don’t Assume)](#5-deprecations--compatibility-traps-verify-dont-assume)  
+6. [Model Artifact Files (On-Disk Reality Check)](#6-model-artifact-files-on-disk-reality-check)  
+7. [Regeneration Strategy (Keep This File Correct)](#7-regeneration-strategy-keep-this-file-correct)  
+8. [Minimal Repro Template (Copy/Paste)](#8-minimal-repro-template-copypaste)
+
+---
+
+## 1. Definition of “Public API”
+
+An API surface in `transformers` is considered **public** if **at least one** of the following is true:
+
+1. It is importable directly from the top-level package:
+   ```python
+   from transformers import X
+   ```
+2. It is explicitly documented in the official Hugging Face Transformers documentation (e.g., “Main classes”, “Pipelines”, “Trainer”, “Generation”).
+3. It is a documented CLI, configuration file, or runtime behavior supported in the installed version.
+
+Everything else is **implementation detail** and must not be treated as stable or user-facing.
+
+**Explicitly non-public by default (unless docs say otherwise):**
+- `transformers.models.*`
+- deep imports from `transformers.generation.*` (treat as internal **unless explicitly documented as public** and/or importable from `transformers`)
+- `transformers.pipelines.*` internals
+- anything in `transformers.utils.*` that is not documented as public
+
+**Production rule:**  
+If you can’t 
+(a) import it from `transformers` OR 
+(b) find it in the official docs for the target version OR 
+(c) verify it by runtime introspection, **do not present it as supported**.
+
+---
+
+## 2. Version Discipline
+
+### 2.1 Pin versions (required)
+For production systems, pin **all** of:
+- `transformers` (exact version or exact git commit)
+- backend framework (`torch` / `tensorflow` / `jax`) version
+- key accelerators if used (e.g., `accelerate`, quantization libs, ONNX runtimes)
+
+### 2.2 Record environment fingerprint (required)
+Any debugging request must include:
+- `transformers.__version__`
+- backend + version
+- device (CPU/CUDA/MPS) + CUDA version if applicable
+
+Minimal snippet:
+```python
+import transformers
+print("transformers:", transformers.__version__)
+
+try:
+    import torch
+    print("torch:", torch.__version__)
+    print("cuda available:", torch.cuda.is_available())
+    print("cuda version:", getattr(torch.version, "cuda", None))
+except Exception as e:
+    print("torch not available:", repr(e))
+```
+
+---
+
+## 3. Mandatory Verification Workflow
+
+This is the *only* safe way to answer “does this exist?” questions.
+
+### 3.1 Verify a top-level symbol exists
+```python
+import transformers
+
+def verify_symbol(name: str) -> None:
+    ok = hasattr(transformers, name)
+    print(f"{name}: {'OK' if ok else 'MISSING'}")
+
+for name in [
+    "pipeline",
+    "AutoTokenizer",
+    "AutoModel",
+    "Trainer",
+    "TrainingArguments",
+    "GenerationConfig",
+]:
+    verify_symbol(name)
+```
+
+**If missing:**
+- Do not guess alternatives.
+- Use discovery helpers (below), then present only what is verifiably present.
+
+### 3.2 Verify an argument exists (inspect signature)
+Never claim a kwarg exists without checking the signature in the user’s environment.
+
+```python
+import inspect
+from transformers import AutoModel
+
+sig = inspect.signature(AutoModel.from_pretrained)
+print(sig)
+
+def has_kwarg(fn, kw: str) -> bool:
+    return kw in inspect.signature(fn).parameters
+
+print("has token?", has_kwarg(AutoModel.from_pretrained, "token"))
+print("has use_auth_token?", has_kwarg(AutoModel.from_pretrained, "use_auth_token"))
+```
+
+**Rule:** If the kwarg is not in the signature, do not instruct users to pass it.
+
+### 3.3 Discover available “Auto*” and “Config” classes
+Different versions ship different helpers. Discover dynamically:
+
+```python
+import transformers
+
+def list_names(prefix: str):
+    return sorted([n for n in dir(transformers) if n.startswith(prefix)])
+
+print("Auto*:", list_names("Auto")[:80])
+print("... (total)", len(list_names("Auto")))
+
+print("*Config:", [n for n in dir(transformers) if n.endswith("Config")][:80])
+```
+
+### 3.4 Verify runtime behavior with a minimal forward / generate
+A symbol can exist but still fail due to missing extras, device issues, or incompatible model files.
+
+**Forward sanity check:**
+```python
+from transformers import AutoTokenizer, AutoModel
+import torch
+
+model_id = "distilbert-base-uncased"  # replace
+tok = AutoTokenizer.from_pretrained(model_id)
+model = AutoModel.from_pretrained(model_id)
+
+inputs = tok("hello world", return_tensors="pt")
+with torch.no_grad():
+    out = model(**inputs)
+print(type(out))
+```
+
+**Generate sanity check (only for causal/seq2seq models):**
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+
+model_id = "gpt2"  # replace
+tok = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id)
+
+inputs = tok("Hello", return_tensors="pt")
+with torch.no_grad():
+    ids = model.generate(**inputs, max_new_tokens=10)
+print(tok.decode(ids[0], skip_special_tokens=True))
+```
+
+---
+
+## 4. Public API Surfaces (by Area)
+
+**Important:** The lists below are “common public entry points”, not a guarantee for every version.  
+Always run [Section 3](#3-mandatory-verification-workflow) in the user’s environment.
+
+### 4.1 Inference
+
+**Canonical entry point**
+```python
+from transformers import pipeline
+```
+
+**Verify supported tasks in the install**
+```python
+# Verify supported pipeline tasks WITHOUT assuming a specific registry constant exists.
+from transformers import pipelines
+
+# Prefer the documented registry if present (custom pipeline docs point to PIPELINE_REGISTRY),
+# but fall back gracefully if the installed version uses something else.
+if hasattr(pipelines, "PIPELINE_REGISTRY"):
+    reg = pipelines.PIPELINE_REGISTRY
+
+    # Try a few common ways a registry might expose tasks, but only use what actually exists.
+    for cand in ["get_supported_tasks", "supported_tasks", "SUPPORTED_TASKS"]:
+        if hasattr(reg, cand):
+            obj = getattr(reg, cand)
+            tasks = obj() if callable(obj) else obj
+            print("num tasks:", len(tasks))
+            print("example tasks:", sorted(list(tasks))[:30])
+            break
+    else:
+        print("PIPELINE_REGISTRY present; inspect it for task listing:", [n for n in dir(reg) if "task" in n.lower()])
+
+elif hasattr(pipelines, "SUPPORTED_TASKS"):
+    tasks = pipelines.SUPPORTED_TASKS
+    print("num tasks:", len(tasks))
+    print("example tasks:", sorted(tasks.keys())[:30])
+
+else:
+    print("No known pipeline task registry found; inspect transformers.pipelines:", [n for n in dir(pipelines) if "task" in n.lower()])
+```
+
+**Pitfalls & fixes**
+- If a pipeline task errors with “unknown task”: list `SUPPORTED_TASKS` and pick an available task name.
+- If the pipeline tries to download unexpected files: confirm model id/path + revision, and verify local directory contents.
+
+**Knobs likely to matter**
+- `device` / `device_map`
+- `dtype` (or `torch_dtype` in older installs , **inspect `inspect.signature(transformers.pipeline)`** before recommending)
+- `batch_size`
+- `max_length` / `truncation` / `padding` (varies by pipeline)
+- model-specific kwargs (must be verified)
+
+---
+
+### 4.2 Preprocessing
+
+**Canonical entry points**
+```python
+from transformers import AutoTokenizer, AutoProcessor
+```
+
+Depending on modality and version, these may or may not exist:
+- `AutoImageProcessor`
+- `AutoFeatureExtractor`
+- `AutoVideoProcessor`
+
+**Verify availability**
+```python
+import transformers
+for name in ["AutoTokenizer", "AutoProcessor", "AutoImageProcessor", "AutoFeatureExtractor", "AutoVideoProcessor"]:
+    print(name, hasattr(transformers, name))
+```
+
+**Pitfalls & fixes**
+- “Tokenizer class not found”: verify model repo contains tokenizer artifacts (see Section 6) and that you’re using the right Auto* loader.
+- “Padding/truncation mismatch”: set `padding=True/False`, `truncation=True/False`, and confirm expected tensor shapes.
+
+**Knobs likely to matter**
+- `padding`, `truncation`, `max_length`
+- `return_tensors` (`"pt"`, `"tf"`, `"np"`)
+- modality-specific preprocessing params (verify via processor docs or runtime inspection)
+
+---
+
+### 4.3 Model Loading & Base Classes
+
+**Canonical entry points**
+```python
+from transformers import AutoConfig, AutoModel
+```
+
+Task-specific autos typically exist as `AutoModelFor*` classes, but do not assume which ones.
+Discover in the user’s environment:
+
+```python
+import transformers
+heads = sorted([n for n in dir(transformers) if n.startswith("AutoModelFor")])
+print("AutoModelFor* count:", len(heads))
+print("sample:", heads[:40])
+```
+
+**Base classes (commonly public)**
+```python
+from transformers import PreTrainedModel, PreTrainedConfig
+```
+
+**Pitfalls & fixes**
+- “Unrecognized model type”: verify `config.json` has `model_type`, and that the installed `transformers` supports it.
+- “Missing weights”: confirm `model.safetensors` / shards exist and match index file if sharded.
+
+**Knobs likely to matter (verify before recommending)**
+- `dtype` (or `torch_dtype` in older installs — **inspect signatures** because dtype/precision knobs vary by version/backend)
+- `device_map`
+- `low_cpu_mem_usage`
+- auth kwargs (e.g., `token` vs older names) — verify via signature
+- `trust_remote_code` (security-sensitive; do not recommend unless necessary and understood)
+
+---
+
+### 4.4 Generation
+
+**Canonical surface**
+- `model.generate(...)` (method on generation-capable model classes)
+
+**Generation config (often public, verify)**
+```python
+import transformers
+print("GenerationConfig present?", hasattr(transformers, "GenerationConfig"))
+```
+
+**Streaming helpers (often public, verify)**
+```python
+import transformers
+for name in ["TextStreamer", "TextIteratorStreamer"]:
+    print(name, hasattr(transformers, name))
+```
+
+**Pitfalls & fixes**
+- “generate() got unexpected keyword”: inspect `generate` signature and/or use `model.generation_config` to set fields.
+- “Stops too late / never stops”: verify EOS token id(s) and stopping criteria; confirm tokenizer special tokens.
+
+**Knobs likely to matter**
+- `max_new_tokens`, `min_new_tokens`
+- `do_sample`, `temperature`, `top_p`, `top_k`
+- `num_beams`, `early_stopping`
+- `repetition_penalty`, `no_repeat_ngram_size`
+- `eos_token_id`, `pad_token_id`
+*(All must be version-verified.)*
+
+---
+
+### 4.5 Training / Evaluation
+
+**Canonical Trainer surface (verify)**
+```python
+from transformers import Trainer, TrainingArguments
+```
+
+Optional trainer variants may exist (verify):
+- `Seq2SeqTrainer`
+- `Seq2SeqTrainingArguments`
+
+**Verify availability**
+```python
+import transformers
+for name in ["Trainer", "TrainingArguments", "Seq2SeqTrainer", "Seq2SeqTrainingArguments"]:
+    print(name, hasattr(transformers, name))
+```
+
+**Pitfalls & fixes**
+- “KeyError in metrics / labels”: confirm dataset fields and data collator output keys.
+- “Distributed mismatch”: confirm versions of `accelerate`/backend and consistent launch method.
+
+**Knobs likely to matter**
+- `per_device_train_batch_size`, `gradient_accumulation_steps`
+- `learning_rate`, `warmup_steps`, `lr_scheduler_type`
+- `fp16` / `bf16` (verify supported in the version/backend)
+- `logging_steps`, `eval_steps`, `save_steps`
+- `report_to` integrations (verify installed extras)
+
+---
+
+### 4.6 Performance / Quantization
+
+Quantization support changes across versions and depends on optional dependencies.
+Never claim a quantization config exists without verifying importability.
+
+**Discovery pattern**
+```python
+import transformers
+candidates = [
+    "BitsAndBytesConfig",
+    "GPTQConfig",
+    "AwqConfig",
+    "QuantoConfig",
+]
+for name in candidates:
+    print(name, hasattr(transformers, name))
+```
+
+**Pitfalls & fixes**
+- “ModuleNotFoundError for quantization backend”: install required dependency and re-verify.
+- “dtype/device mismatch”: ensure model weights + inputs on same device; validate `torch_dtype`.
+
+**Knobs likely to matter**
+- `device_map`
+- `dtype` (or `torch_dtype` in older installs — **inspect signatures** because dtype/precision knobs vary by version/backend)
+- quantization config object fields (version-dependent; verify via signature/dir)
+
+---
+
+### 4.7 Export / Serving
+
+Export/serving is often handled by adjacent tooling (e.g., ONNX/export toolchains and serving runtimes).
+Do not invent “native exporter APIs” unless you verify they exist in the target version and are documented.
+
+**Safe guidance approach**
+1. Identify the target runtime (ONNX Runtime / TensorRT / TGI / vLLM / etc.).
+2. Verify which tool owns export in the user’s stack (Transformers vs external).
+3. Provide only documented + verifiable steps.
+
+**Pitfalls & fixes**
+- “Export fails due to unsupported ops”: confirm opset, model architecture, and runtime support.
+
+---
+
+## 5. Deprecations & Compatibility Traps (Verify, Don’t Assume)
+
+This section is intentionally conservative: it tells you **how** to verify, not **what** to assume.
+
+### 5.1 Authentication keyword arguments
+Auth-related kwargs have changed over time across the ecosystem.
+**Always inspect `from_pretrained` signature**:
+```python
+import inspect
+from transformers import AutoTokenizer
+print(inspect.signature(AutoTokenizer.from_pretrained))
+```
+Only recommend kwargs that appear in the signature.
+
+### 5.2 Download/cache kwargs
+Download/caching controls can change; some kwargs become no-ops or get removed.
+Again: inspect signatures and/or consult official docs for the pinned version.
+
+### 5.3 “Internal helpers” are not stable
+If a solution requires importing from deep modules (e.g., `transformers.models...`), treat it as:
+- “implementation detail”
+- “may break across versions”
+- “should be avoided unless you own the pinned commit”
+
+---
+
+## 6. Model Artifact Files (On-Disk Reality Check)
+
+These are common files found in HF model repos or local export directories; actual sets vary.
+
+**Common config/tokenizer files**
+- `config.json`
+- `generation_config.json` (may be absent)
+- `tokenizer.json` (fast tokenizer)
+- `tokenizer_config.json`
+- `special_tokens_map.json`
+
+**Common weights files**
+- `model.safetensors` (or sharded: `model-00001-of-000xx.safetensors` + index json)
+- `pytorch_model.bin` (legacy)
+- backend-specific equivalents may exist depending on framework
+
+**Sanity check: load config + tokenizer**
+```python
+from transformers import AutoConfig, AutoTokenizer
+
+path_or_id = "YOUR_MODEL"  # local path or model id
+cfg = AutoConfig.from_pretrained(path_or_id)
+tok = AutoTokenizer.from_pretrained(path_or_id)
+
+print("model_type:", getattr(cfg, "model_type", None))
+print("tokenizer:", tok.__class__.__name__)
+```
+
+**If load fails**
+- Confirm the directory contains expected artifacts.
+- Confirm backend compatibility (Torch vs TF vs Flax).
+- If `trust_remote_code` is involved, treat it as a security decision:
+  - verify it is required
+  - verify the exact repo revision you trust
+
+---
+
+## 7. Regeneration Strategy (Keep This File Correct)
+
+This file should remain correct across releases by being **workflow-first** and **snapshot-driven**, not a giant hardcoded list.
+
+### 7.1 CI snapshot (recommended)
+In your pinned environment, run a script that records:
+- `transformers.__version__`
+- top-level symbols (filtered)
+- available `Auto*` classes
+- available quantization config candidates
+
+Example snapshot script:
+```python
+import json
+import transformers
+
+def filt(names, prefixes=(), suffixes=(), contains=()):
+    out = []
+    for n in names:
+        if prefixes and not any(n.startswith(p) for p in prefixes):
+            continue
+        if suffixes and not any(n.endswith(s) for s in suffixes):
+            continue
+        if contains and not any(c in n for c in contains):
+            continue
+        out.append(n)
+    return sorted(out)
+
+names = dir(transformers)
+snapshot = {
+    "transformers_version": transformers.__version__,
+    "top_level_selected": filt(
+        names,
+        prefixes=("Auto", "PreTrained", "Text", "Trainer", "Training", "Generation", "pipeline"),
+        suffixes=(),
+        contains=("Config",),
+    )[:2000],
+    "auto_classes": filt(names, prefixes=("Auto",)),
+    "model_for_heads": sorted([n for n in names if n.startswith("AutoModelFor")]),
+    "config_like": sorted([n for n in names if n.endswith("Config")]),
+}
+
+print(json.dumps(snapshot, indent=2)[:20000])
+```
+
+Store this snapshot alongside releases and update this file if:
+- major surfaces change
+- verification steps need to accommodate new patterns
+
+### 7.2 What never changes
+Even when symbols change, the safe workflow remains:
+- check importability
+- inspect signatures
+- run minimal repro
+
+---
+
+## 8. Minimal Repro Template (Copy/Paste)
+
+Use this when users report errors. Require them to fill it.
+
+```python
+"""
+MINIMAL REPRO TEMPLATE (Transformers)
+
+1) Environment
+- transformers==?
+- backend: torch/tf/jax == ?
+- device: CPU/CUDA/MPS (+ CUDA version if relevant)
+- OS: ?
+
+2) Model
+- model id or local path:
+- revision/commit (if pinned):
+- trust_remote_code: True/False (and why)
+
+3) Repro
+- exact code below
+- exact traceback output
+"""
+
+import transformers
+print("transformers:", transformers.__version__)
+
+# Optional backend info
+try:
+    import torch
+    print("torch:", torch.__version__)
+    print("cuda available:", torch.cuda.is_available())
+    print("cuda version:", getattr(torch.version, "cuda", None))
+except Exception as e:
+    print("torch not available:", repr(e))
+
+MODEL = "REPLACE_ME"
+
+# Choose one path (tokenizer/model OR pipeline) depending on issue:
+from transformers import AutoTokenizer, AutoModel
+
+tok = AutoTokenizer.from_pretrained(MODEL)
+model = AutoModel.from_pretrained(MODEL)
+
+inputs = tok("hello", return_tensors="pt")
+out = model(**inputs)
+print(type(out))
+```
+
+---
\ No newline at end of file
diff --git a/.claude/skills/transformers-api/templates/minimal_repro.md b/.claude/skills/transformers-api/templates/minimal_repro.md
new file mode 100644
index 000000000000..afd5540c51a0
--- /dev/null
+++ b/.claude/skills/transformers-api/templates/minimal_repro.md
@@ -0,0 +1,167 @@
+# Minimal Repro Template (Transformers)
+
+Use this template to produce a **copy/paste runnable** repro that someone else can run and see the same issue.
+
+## 0. One-line goal
+**Goal:** <What should happen?>
+
+## 1. What is happening (actual)
+**Actual:** <What happens instead? Include exact error message or wrong output>
+
+## 2. Environment (must be exact)
+Fill in all that apply.
+
+- OS: Windows / Linux / macOS (include version)
+- Python: `python -V`
+- Transformers: `python -c "import transformers; print(transformers.__version__)"`
+- Backend: PyTorch / TensorFlow / JAX (pick one)
+  - PyTorch: `python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"`
+  - TF: `python -c "import tensorflow as tf; print(tf.__version__)"`
+  - JAX: `python -c "import jax; print(jax.__version__)"`
+- Device: CPU / CUDA / MPS
+- GPU (if any): model + VRAM
+- Install method:
+  - pip/venv OR conda (include env name)
+- Install source:
+  - PyPI release OR editable install from repo (`pip install -e .`) OR specific commit/revision
+- Reproducibility:
+  - Does it happen every run? Y/N
+  - First bad version / last good version (if known)
+
+## 3. Installation commands (exact)
+Provide the minimal set of commands someone needs to create a clean environment.
+
+### Option A — venv + pip
+```bash
+python -m venv .venv
+# Windows PowerShell:
+# .\.venv\Scripts\Activate.ps1
+# macOS/Linux:
+# source .venv/bin/activate
+pip install -U pip
+pip install "transformers[torch]"  # or your exact extras
+```
+
+### Option B — conda
+```bash
+conda create -n repro python=3.11 -y
+conda activate repro
+pip install -U pip
+pip install transformers
+```
+
+> If you're using the repo source, replace installs with:
+> `pip install -e .` (from repo root)
+
+## 4. Minimal script (single file)
+
+Create `repro.py` with the smallest code that still fails.
+Rules:
+
+* Use a **single model id** (or local path) and include revision if pinned
+* Set seeds
+* Print versions
+* Avoid unrelated features (Trainer, accelerate, etc.) unless they are the bug
+
+```python
+import os
+import sys
+import platform
+import random
+
+def print_env():
+    print("== ENV ==")
+    print("python:", sys.version.replace("\n", " "))
+    print("platform:", platform.platform())
+    try:
+        import transformers
+        print("transformers:", transformers.__version__)
+    except Exception as e:
+        print("transformers import failed:", repr(e))
+    try:
+        import torch
+        print("torch:", torch.__version__)
+        print("cuda available:", torch.cuda.is_available())
+        if torch.cuda.is_available():
+            print("cuda device:", torch.cuda.get_device_name(0))
+    except Exception as e:
+        print("torch import failed:", repr(e))
+    print("HF_HOME:", os.getenv("HF_HOME"))
+    print("HF_HUB_CACHE:", os.getenv("HF_HUB_CACHE"))
+    print("TRANSFORMERS_CACHE:", os.getenv("TRANSFORMERS_CACHE"))
+    print()
+
+def set_seeds(seed=0):
+    random.seed(seed)
+    try:
+        import numpy as np
+        np.random.seed(seed)
+    except Exception:
+        pass
+    try:
+        import torch
+        torch.manual_seed(seed)
+        if torch.cuda.is_available():
+            torch.cuda.manual_seed_all(seed)
+    except Exception:
+        pass
+
+def main():
+    print_env()
+    set_seeds(0)
+
+    # TODO: replace with minimal failing code
+    from transformers import pipeline
+
+    model_id = "distilbert-base-uncased-finetuned-sst-2-english"
+    nlp = pipeline("sentiment-analysis", model=model_id)
+    print(nlp("hello world"))
+
+if __name__ == "__main__":
+    main()
+```
+
+## 5. Run command + full output
+
+Command used:
+```bash
+python repro.py
+```
+
+Paste the **full output** here (don't truncate).
+
+## 6. Expected vs actual (explicit)
+
+* **Expected:** <exact>
+* **Actual:** <exact>
+
+## 7. Smallest knobs to try (pick only relevant)
+
+Include only the knobs that could change the failure:
+
+* Model: different revision / different model id
+* Device: CPU vs CUDA
+* dtype: `torch_dtype=float16/bfloat16/float32`
+* `device_map="auto"` vs explicit device
+* `low_cpu_mem_usage=True/False`
+* `trust_remote_code=True/False`
+* Tokenization: `padding/truncation/max_length`
+* Generation: `do_sample`, `temperature`, `top_p`, `num_beams`, `max_new_tokens`
+* Attention backend: SDPA / flash-attn (if applicable)
+* Quantization: 8-bit/4-bit settings (bitsandbytes/GPTQ/AWQ)
+
+## 8. If it's a repo bug (for contributors)
+
+* Suspected module/file:
+  * `src/transformers/...`
+* Related tests to run:
+  * `python -m pytest tests/<...> -k "<pattern>"`
+* Minimal patch idea:
+  * <1–3 sentences>
+
+## 9. Attachments checklist (only if needed)
+
+* config.json / tokenizer.json / generation_config.json
+* exact traceback (full)
+* small input sample(s)
+* exact command line flags / env vars
\ No newline at end of file