From 249a04c09a0cf5eefe2a89a3e5f482382a5b8575 Mon Sep 17 00:00:00 2001 From: Gautam Datla Date: Sun, 18 Jan 2026 21:44:16 -0500 Subject: [PATCH] Claude code skills for transformers-api --- .claude/skills/transformers-api/SKILL.md | 159 +++++ .../reference/areas/export-serving.md | 286 +++++++++ .../reference/areas/generation.md | 466 ++++++++++++++ .../reference/areas/inference.md | 286 +++++++++ .../reference/areas/performance.md | 371 ++++++++++++ .../reference/areas/preprocessing.md | 434 +++++++++++++ .../reference/areas/repo-contributing.md | 315 ++++++++++ .../reference/areas/training.md | 353 +++++++++++ .../reference/areas/troubleshooting.md | 343 +++++++++++ .../reference/generated/module_tree.md | 553 +++++++++++++++++ .../reference/generated/public_api.md | 572 ++++++++++++++++++ .../templates/minimal_repro.md | 167 +++++ 12 files changed, 4305 insertions(+) create mode 100644 .claude/skills/transformers-api/SKILL.md create mode 100644 .claude/skills/transformers-api/reference/areas/export-serving.md create mode 100644 .claude/skills/transformers-api/reference/areas/generation.md create mode 100644 .claude/skills/transformers-api/reference/areas/inference.md create mode 100644 .claude/skills/transformers-api/reference/areas/performance.md create mode 100644 .claude/skills/transformers-api/reference/areas/preprocessing.md create mode 100644 .claude/skills/transformers-api/reference/areas/repo-contributing.md create mode 100644 .claude/skills/transformers-api/reference/areas/training.md create mode 100644 .claude/skills/transformers-api/reference/areas/troubleshooting.md create mode 100644 .claude/skills/transformers-api/reference/generated/module_tree.md create mode 100644 .claude/skills/transformers-api/reference/generated/public_api.md create mode 100644 .claude/skills/transformers-api/templates/minimal_repro.md diff --git a/.claude/skills/transformers-api/SKILL.md b/.claude/skills/transformers-api/SKILL.md new file mode 100644 index 000000000000..24c7d70db219 --- /dev/null +++ b/.claude/skills/transformers-api/SKILL.md @@ -0,0 +1,159 @@ +--- +name: transformers-api +description: Guides coding and debugging in the Hugging Face Transformers repo. Use when questions involve transformers APIs (pipeline, AutoModel*, AutoTokenizer, Trainer, generate), repo navigation (“where is X implemented?”), performance/quantization, export/serving, or stack traces referencing transformers/ or src/transformers/. +--- + +# Transformers API Navigator (Claude Code) + +## Purpose +This Skill is an **operating playbook** for working with the `huggingface/transformers` codebase and answering Transformers API questions **without guessing**. + +It optimizes for: +- correct API choice (pipeline vs Auto* vs Trainer vs export/perf) +- fast debugging (minimal repro-first) +- accurate repo navigation (“where is X implemented?”) +- small, testable changes when modifying the repo + +This file is intentionally **high-level**. Detailed breakdowns live in individual markdown files under `reference/areas/*`. + +--- + +## When to activate +Activate this Skill if **any** of the following are true: +- The user mentions Transformers or `transformers` APIs (`pipeline`, `AutoModel*`, `AutoTokenizer`, `Trainer`, `generate`, etc.) +- They reference Transformers artifacts (`config.json`, `tokenizer.json`, `generation_config.json`, `model.safetensors`, etc.) +- They show code importing `transformers` or stack traces mentioning `transformers/` or `src/transformers/` +- They need a Transformers-specific decision (inference vs training, generation knobs, perf/quantization, export/serving) +- They ask repo questions: “where is X implemented?”, “which file owns Y?” + +Do **not** activate if the request is mostly: +- Hub/Datasets usage with no `transformers` callsite, or +- **tokenizers library internals** (the separate tokenizers repo / Rust internals) with no Transformers usage. + +Do activate if it’s **Transformers usage of tokenizers/processors** (route to Preprocessing). + +--- + +## Reference entry points + +### Buckets (open exactly ONE first) +- Inference → `reference/areas/inference.md` +- Preprocessing → `reference/areas/preprocessing.md` +- Generation → `reference/areas/generation.md` +- Training / Evaluation → `reference/areas/training.md` +- Performance / Memory / Quantization → `reference/areas/performance.md` +- Export / Serving → `reference/areas/export-serving.md` +- Repo navigation / Contributing → `reference/areas/repo-contributing.md` +- Debugging / Troubleshooting → `reference/areas/troubleshooting.md` + +### Verification (“don’t hallucinate”) +- Symbol/arg exists → `reference/generated/public_api.md` +- Where implemented → `reference/generated/module_tree.md` + +Full repo structure is captured in: `reference/generated/module_tree.md` + +### Debug template +- Minimal repro form → `templates/minimal_repro.md` + +--- + +## Exact sequential process (always follow this order) + +### Step 1 — Classify the request (pick ONE bucket) +- **Inference** (pipelines, Auto* inference) +- **Preprocessing** (tokenizers / processors) +- **Generation** (generate/decoding/chat/streaming) +- **Training / Evaluation** (Trainer, arguments, callbacks) +- **Performance / Memory / Quantization** +- **Export / Serving** +- **Repo navigation / Contributing** +- **Debugging / Troubleshooting** + +### Step 2 — Ask only what’s missing (0–5 questions, only if ambiguous) +Ask only the minimum to proceed: +1) Goal/outcome in one sentence (only if unclear) +2) Modality/task (Text / Vision / Audio / Video / Multimodal) (only if relevant) +3) Model id or local path (and revision/commit if pinned) (if loading/inference/training is involved) +4) Environment: `transformers` version + backend (PyTorch/TF/JAX) + device (CPU/CUDA/MPS) (+ rough VRAM/RAM if perf matters) +5) If blocked: full stack trace + minimal repro snippet (use `templates/minimal_repro.md`) + +### Step 3 — Route first (deterministic router embedded here) +Follow this router and open **exactly one** bucket file from the list above. + +#### Routing rules +- If the user is blocked by an exception/traceback, regression, or wrong output → open **Troubleshooting** first + **unless** it is clearly a `Trainer`/training-loop failure → open **Training** first. +- If multiple buckets match, prioritize the user’s **desired outcome** over the first keyword seen. +- If still tied, use this fixed priority order: + **Troubleshooting > Training > Generation > Inference > Preprocessing > Performance > Export/Serving > Repo/Contributing** + +#### Routing table (open exactly ONE file first) + +| User intent / signal | Open this first | Common keywords / symptoms | +|---|---|---| +| Run inference / predict / use a model quickly | `reference/areas/inference.md` | `pipeline`, `AutoModelFor*`, `from_pretrained`, logits, predict, embeddings, classification, ASR/VQA/etc. | +| Preprocessing / inputs formatting (text/vision/audio/video) | `reference/areas/preprocessing.md` | `AutoTokenizer`, `AutoProcessor`, `AutoImageProcessor`, `AutoVideoProcessor`, (audio) `FeatureExtractor`, padding, truncation, transforms, normalization, resizing, sampling rate | +| Text generation / chat behavior | `reference/areas/generation.md` | `generate`, decoding, `max_new_tokens`, sampling, beams, stop tokens, streaming, chat templates | +| Fine-tuning / training / evaluation | `reference/areas/training.md` | `Trainer`, `TrainingArguments`, `train`, `evaluate`, metrics, collators, checkpoints, distributed, FSDP/DeepSpeed/Accelerate | +| Performance / memory / quantization | `reference/areas/performance.md` | VRAM/OOM, `device_map`, `torch_dtype`, fp16/bf16, attention backends, 8-bit/4-bit, bitsandbytes/GPTQ/AWQ | +| Export / serving / deployment | `reference/areas/export-serving.md` | ONNX/export, serving, batching, vLLM/TGI/SGLang, `transformers serve` (moderate-load/experimental), `transformers chat` | +| Repo navigation / contributing / “where is X implemented?” | `reference/areas/repo-contributing.md` | “where is”, “which file”, “implementation”, `src/transformers`, tests, docs, PR, add model | +| Errors, crashes, regressions, wrong outputs | `reference/areas/troubleshooting.md` | traceback, exception, mismatch, device/dtype errors, missing files, unexpected output | + +#### Verification shortcuts +Use these only when uncertain about an API/arg/behavior, or when locating code/docs: +- **Does a symbol/arg exist?** → `reference/generated/public_api.md` +- **Where is it implemented?** → `reference/generated/module_tree.md` + +#### Fallback (if nothing matches) +- Open `reference/generated/public_api.md` to identify the closest public surface area. +- Then route to the nearest bucket in the table above and continue. + +### Step 4 — If blocked by an error: reproduce/triage first +If the user cannot proceed due to an exception or incorrect outputs: +- prioritize minimal repro + full stack trace + versions +- classify the failure: **loading** vs **preprocessing** vs **forward/generate** vs **Trainer** vs **integration** +- apply a targeted fix + propose the smallest next diagnostic step + +### Step 5 — Verify only when uncertain (never guess) +Only consult verification sources when you are unsure about a symbol/arg/behavior/default, or when locating an implementation. + +Verification order: +1) `reference/generated/public_api.md` : confirms what is publicly exposed (what exists) +2) `reference/generated/module_tree.md` : finds where it lives in `src/transformers/` (where it’s implemented) +3) Fallback if needed: inspect `src/transformers/`, `docs/source/`, and/or repo search + +If `reference/generated/*` looks missing or stale, **regenerate/update it before relying on it**. +If you cannot verify, say so and point to the most likely file/module to inspect next. + +### Step 6 — Respond using the output contract +Every answer must include: +- **Steps** (numbered) +- **Minimal runnable snippet** (copy/paste) +- **Pitfalls & fixes** (“If X → do Y”) +- **What to change** (3–8 knobs likely to matter) + +If the user is changing repo code, also include: +- exact file paths to edit +- tests to run (smallest relevant set) + +--- + +## Repo anchors (use when needed) +- Core library: `src/transformers/` +- Tests: `tests/` +- Docs source: `docs/source/` (commonly `docs/source/en/`) +- Examples: `examples/` + +When asked “where is X implemented?”: +- use `reference/generated/module_tree.md` first +- then point to exact file paths under `src/transformers/` +- include 1–3 search keywords the user can grep for + +--- + +## Guardrails (non-negotiable) +- Do not invent APIs/args/behavior. Verify if uncertain. +- Do not propose large refactors when a small targeted change will do. +- Behavior changes should come with a test (or an explicit reason why not). +- Keep Transformers responsibilities separate from Hub/Datasets/Accelerate/PEFT unless the integration point is the blocker. \ No newline at end of file diff --git a/.claude/skills/transformers-api/reference/areas/export-serving.md b/.claude/skills/transformers-api/reference/areas/export-serving.md new file mode 100644 index 000000000000..61c8d7135bcf --- /dev/null +++ b/.claude/skills/transformers-api/reference/areas/export-serving.md @@ -0,0 +1,286 @@ +# Export & Serving (deployment, runtimes, CLIs) + +## Contents +- [Scope](#scope) +- [Minimum questions to ask](#minimum-questions-to-ask) +- [Decision guide: serve in Python vs export a portable artifact](#decision-guide-serve-in-python-vs-export-a-portable-artifact) +- [Quickstarts](#quickstarts) + - [1) Local OpenAI-compatible server (`transformers serve`)](#1-local-openai-compatible-server-transformers-serve) + - [2) Sanity-check the server (curl)](#2-sanity-check-the-server-curl) + - [3) Export to ONNX (Optimum CLI)](#3-export-to-onnx-optimum-cli) + - [4) Load + run an ONNX export (ORTModel)](#4-load--run-an-onnx-export-ortmodel) + - [5) Export to ExecuTorch (edge/mobile)](#5-export-to-executorch-edgemobile) + - [6) Export to TorchScript (PyTorch-only; limited)](#6-export-to-torchscript-pytorch-only-limited) +- [Knobs that matter (3–8)](#knobs-that-matter-38) +- [Pitfalls & fixes](#pitfalls--fixes) +- [Verify / locate in repo](#verify--locate-in-repo) + +--- + +## Scope + +Use this page when the user wants to **ship** a Transformers model: +- serve it behind an HTTP API (local dev → deployment) +- export it to another runtime (ONNX / ExecuTorch / TFLite via Optimum; TorchScript via PyTorch/Transformers) +- choose the right “packaging” path given constraints (latency/throughput, hardware, Python vs non-Python) + +--- + +## Minimum questions to ask + +Ask only what you need to pick a path (0–5 questions): +1) **Workload**: encoder inference (cls/embeddings) vs **LLM generation** (chat/completions)? +2) **Target runtime**: must run **outside Python**? must run on **mobile/edge**? OpenAI-compatible API required? +3) **Hardware**: CPU / CUDA GPU / MPS / edge accelerator; memory limits +4) **Model id/path + revision** (pin if you care about reproducibility) +5) If blocked: exact error + smallest repro + versions (`transformers`, PyTorch, CUDA, Optimum/runtime) + +--- + +## Decision guide: serve in Python vs export a portable artifact + +### Choose “Serve” when… +- you want a fast integration path for an app +- you can keep Python in the stack +- you want an HTTP boundary (and potentially OpenAI-compatible endpoints) + +Typical choices: +- **`transformers serve`**: quick local server; good for dev/moderate load +- production LLM throughput: consider dedicated serving stacks (outside this repo) that specialize in continuous batching, KV cache, tensor parallel, etc. + +### Choose “Export” when… +- you must run in a non-Python runtime +- you need a portable artifact for inference engines / mobile / embedded + +Typical choices: +- **ONNX** (via Optimum): broad runtime support +- **ExecuTorch** (via Optimum): PyTorch-native edge/mobile packaging +- **TorchScript**: PyTorch-only and can be brittle; best for simpler encoder models +- **TFLite** (via Optimum TF exporters): TensorFlow Lite ecosystems (mobile/edge), often needs fixed shapes + +--- + +## Quickstarts + +### 1. Local OpenAI-compatible server (`transformers serve`) + +Use this for local/dev integration tests. Always check the current flags in your environment: + +```bash +transformers serve --help +``` + +Install serving dependencies: + +```bash +pip install transformers[serving] +``` + +Then start the server: + +```bash +transformers serve +``` +# Optional: force a single model for all requests (avoids per-request model hints) +# transformers serve --force-model "Qwen/Qwen2.5-0.5B-Instruct" + +Notes: +- Treat this as **developer-friendly** serving. For high-QPS production, you’ll usually reach for specialized serving runtimes. + +--- + +### 2. Sanity-check the server (curl) + +Chat Completions request (OpenAI-compatible): + +```bash +curl -X POST http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{"messages":[{"role":"system","content":"hello"}],"temperature":0.9,"max_tokens":1000,"stream":true,"model":"Qwen/Qwen2.5-0.5B-Instruct"}' +``` + +The same server also supports the Responses API: + +```bash +curl http://localhost:8000/v1/responses \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Qwen/Qwen2.5-0.5B-Instruct", + "stream": true, + "input": "Tell me a three sentence bedtime story about a unicorn." + }' +``` + +If requests fail: +- run `transformers serve --help` and confirm host/port/model settings +- confirm the client `model` string matches what the server expects + +--- + +### 3. Export to ONNX (Optimum CLI) + +Install Optimum ONNX tooling: + +```bash +pip install optimum-onnx +``` + +Export a model to ONNX: + +```bash +optimum-cli export onnx \ + --model distilbert/distilbert-base-uncased-distilled-squad \ + distilbert_squad_onnx/ +``` + +Notes: +- If exporting from a local directory, ensure tokenizer/config live alongside weights. +- If task inference is ambiguous, pass `--task` (e.g., `question-answering`, `text-classification`, `text-generation`). + +--- + +### 4. Load + run an ONNX export (ORTModel) + +```python +from transformers import AutoTokenizer +from optimum.onnxruntime import ORTModelForQuestionAnswering + +onnx_dir = "distilbert_squad_onnx" + +tokenizer = AutoTokenizer.from_pretrained(onnx_dir) +model = ORTModelForQuestionAnswering.from_pretrained(onnx_dir) + +inputs = tokenizer( + "What runtime is this?", + "This is ONNX Runtime via Optimum.", + return_tensors="pt", +) +outputs = model(**inputs) + +print(outputs.start_logits.shape, outputs.end_logits.shape) +``` + +Sanity validation tip: +- compare logits on 3–10 fixed inputs between PyTorch and ONNX before shipping + +--- + +### 5. Export to ExecuTorch (edge/mobile) + +This is a practical path when you want a PyTorch-native on-device artifact. + +Install ExecuTorch exporter dependencies: + +```bash +git clone https://github.com/huggingface/optimum-executorch.git +cd optimum-executorch +pip install . +``` + +Export (CLI): + +```bash +optimum-cli export executorch \ + --model "HuggingFaceTB/SmolLM2-135M-Instruct" \ + --task "text-generation" \ + --recipe "xnnpack" \ + --output_dir "smollm2_executorch" +``` + +Run (Python wrapper around the exported artifact): + +```python +from transformers import AutoTokenizer +from optimum.executorch import ExecuTorchModelForCausalLM + +tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M") +model = ExecuTorchModelForCausalLM.from_pretrained("smollm2_executorch/") + +prompt = "Explain KV cache in one sentence." +print(model.text_generation(tokenizer=tokenizer, prompt=prompt, max_seq_len=64)) +``` + +Validation tip: +- run the same 3–10 prompts on the original model and the exported artifact; compare outputs at the token level where possible (or at least consistent decoding settings) + +--- + +### 6. Export to TorchScript (PyTorch-only; limited) + +TorchScript is best for simpler, stable encoder-style graphs. Many Transformers models require enabling TorchScript mode so outputs are traceable. + +```python +import torch +from transformers import AutoTokenizer, AutoModelForSequenceClassification + +model_id = "distilbert/distilbert-base-uncased-finetuned-sst-2-english" + +tok = AutoTokenizer.from_pretrained(model_id) +model = AutoModelForSequenceClassification.from_pretrained( + model_id, + torchscript=True, # important for many models +).eval() + +# Dummy inputs for tracing (trace will specialize to these shapes) +ex = tok("hello world", return_tensors="pt") + +with torch.no_grad(): + traced = torch.jit.trace(model, (ex["input_ids"], ex["attention_mask"])) + +traced.save("model_ts.pt") +``` + +Pitfalls: +- `torchscript=True` is required for models with tied weights (typically models with a language-model head). Models without an LM head can be exported without it. +- tracing is shape-sensitive; the trace generally only supports the same input shapes used during tracing (pad/choose a max expected shape). + + +--- + +## Knobs that matter (3–8) + +1) **Serve vs export** + - Need an API quickly → serve + - Need a portable artifact / non-Python runtime → export +2) **Workload** + - LLM generation is sensitive to KV-cache + batching; encoder inference exports more easily +3) **Repro pinning** + - pin model `revision` and record tool/runtime versions +4) **Export “task”** + - pass `--task` when exporting local models or ambiguous checkpoints +5) **Shapes** + - TorchScript and many mobile exports are sensitive to shapes; validate with representative inputs +6) **Runtime choice** + - ONNX Runtime vs other accelerators; for edge/mobile consider ExecuTorch/TFLite +7) **Correctness validation** + - always compare outputs on a small fixed suite before shipping +8) **Performance validation** + - measure latency/throughput on the target hardware (not just dev machine) + +--- + +## Pitfalls & fixes + +- **Server starts but requests fail** + - check `transformers serve --help` for port/model routing + - confirm endpoint path and request JSON match what your server expects +- **ONNX export “works” but outputs differ** + - verify tokenizer parity (same files/config), and compare logits first + - ensure you didn’t accidentally change padding/truncation/max_length +- **TorchScript breaks on real inputs** + - tracing used one example shape; real shapes differ → prefer ONNX or constrain shapes +- **Edge export slow** + - ensure you chose an appropriate recipe/backend and validated quantization/perf settings for the device + +--- + +## Verify / locate in repo + +Use Skill verification indexes when uncertain: +- “Does this symbol/arg exist?” → `reference/generated/public_api.md` +- “Where is it implemented?” → `reference/generated/module_tree.md` + +Useful repo grep keywords: +- `transformers serve`, `openai`, `chat/completions`, `responses` +- `export`, `onnx`, `executorch`, `torchscript` +- `pipelines`, `generation`, `cache`, `continuous batching` (if serving overlaps with perf questions) \ No newline at end of file diff --git a/.claude/skills/transformers-api/reference/areas/generation.md b/.claude/skills/transformers-api/reference/areas/generation.md new file mode 100644 index 000000000000..b5f4034f9aef --- /dev/null +++ b/.claude/skills/transformers-api/reference/areas/generation.md @@ -0,0 +1,466 @@ +# Generation (decode, sampling, beams, stopping, streaming, chat) + +## Contents +- [Scope](#scope) +- [Minimum questions to ask](#minimum-questions-to-ask) +- [Always-follow workflow](#always-follow-workflow) +- [Quickstarts](#Quickstarts) + - [A. Decoder-only (CausalLM) minimal generation (greedy)](#a-decoder-only-causallm-minimal-generation-greedy) + - [B. Encoder-decoder (Seq2Seq) minimal generation (greedy)](#b-encoder-decoder-seq2seq-minimal-generation-greedy) +- [Output length (do this first)](#output-length-do-this-first) +- [Decoding strategies (choose one)](#decoding-strategies-choose-one) + - [1. Greedy (deterministic baseline)](#1-greedy-deterministic-baseline) + - [2. Sampling (creative / diverse)](#2-sampling-creative--diverse) + - [3. Beam search (more exhaustive, more deterministic)](#3-beam-search-more-exhaustive-more-deterministic) + - [4. Diverse candidates (multiple outputs)](#4-diverse-candidates-multiple-outputs) +- [Chat prompting (chat templates)](#chat-prompting-chat-templates) + - [Chat template → generate (decoder-only)](#chat-template--generate-decoder-only) +- [“Decoder-only returns the prompt too” (slice it)](#decoder-only-returns-the-prompt-too-slice-it) +- [Stopping](#stopping) + - [1. EOS-based stopping (default)](#1-eos-based-stopping-default) + - [2. Stop on custom condition (StoppingCriteria)](#2-stop-on-custom-condition-stoppingcriteria) + - [3. Stop on strings (built-in: stop_strings)](#3-stop-on-strings-built-in-stop_strings) +- [Streaming](#streaming) + - [TextIteratorStreamer (common pattern with a background thread)](#textiteratorstreamer-common-pattern-with-a-background-thread) +- [Inspecting generation internals (scores, beams, etc.)](#inspecting-generation-internals-scores-beams-etc) +- [What to change (knobs that matter most)](#what-to-change-knobs-that-matter-most) +- [Pitfalls & fixes (high-frequency)](#pitfalls--fixes-high-frequency) +- [Repo hotspots (when asked “where is this implemented?”)](#repo-hotspots-when-asked-where-is-this-implemented) +- [Verification checklist (anti-hallucination)](#verification-checklist-anti-hallucination) + + +## Scope + +Use this page when the user’s goal is **text generation / chat behavior**: +- `.generate()` decoding strategy (greedy / sampling / beams) +- output length control (`max_new_tokens`, `min_new_tokens`, etc.) +- repetition control (`repetition_penalty`, `no_repeat_ngram_size`) +- stopping (EOS, custom stopping criteria) +- streaming (streamers) +- chat templates + generation together + +--- + +## Minimum questions to ask + +Ask only what’s required to produce a runnable snippet: +1) Model type: **decoder-only** (CausalLM) vs **encoder-decoder** (Seq2Seq) +2) Desired behavior: **deterministic** vs **creative** +3) Output constraints: length, stop condition, format (JSON, bullets, etc.) +4) Environment: `transformers` version + backend/device (CPU/CUDA/MPS) +5) If blocked: full traceback + minimal repro + +--- + +## Always-follow workflow + +1) Load model + tokenizer from the same checkpoint. +2) Prepare prompt (raw text or chat template). +3) Put tensors + model on the same device. +4) Choose a decoding strategy (greedy / sampling / beam) and set length via `max_new_tokens`. +5) Generate under `torch.inference_mode()` (PyTorch). +6) Decode, and (for decoder-only models) optionally slice off the prompt tokens. + +--- + +## Quickstarts + +### A. Decoder-only (CausalLM) minimal generation (greedy) +```python +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_id = "distilbert/distilgpt2" +tok = AutoTokenizer.from_pretrained(model_id) +model = AutoModelForCausalLM.from_pretrained(model_id) +model.eval() + +prompt = "Write a one-sentence summary of Transformers:" +inputs = tok(prompt, return_tensors="pt").to(model.device) + +with torch.inference_mode(): + out = model.generate(**inputs, max_new_tokens=50) + +text = tok.decode(out[0], skip_special_tokens=True) +print(text) +``` + +### B. Encoder-decoder (Seq2Seq) minimal generation (greedy) +```python +import torch +from transformers import AutoModelForSeq2SeqLM, AutoTokenizer + +model_id = "google/flan-t5-small" +tok = AutoTokenizer.from_pretrained(model_id) +model = AutoModelForSeq2SeqLM.from_pretrained(model_id) +model.eval() + +prompt = "Translate to German: The cat is on the table." +inputs = tok(prompt, return_tensors="pt").to(model.device) + +with torch.inference_mode(): + out = model.generate(**inputs, max_new_tokens=50) + +print(tok.decode(out[0], skip_special_tokens=True)) +``` + +--- + +## Output length (do this first) + +Prefer `max_new_tokens` over `max_length`. + +- `max_new_tokens`: number of tokens **generated beyond** the prompt (recommended) +- `max_length`: prompt length + generated length (often confusing) + +Also consider: +- `min_new_tokens` (or `min_length` depending on model/version) +- `early_stopping` (beam search behavior) + +--- + +## Decoding strategies (choose one) + +### 1. Greedy (deterministic baseline) +Good for short, factual, structured outputs. Can repeat for long outputs. +```python +out = model.generate(**inputs, max_new_tokens=200, do_sample=False) +``` + +### 2. Sampling (creative / diverse) +Use when you want variation. Typical defaults: +- `do_sample=True` +- `temperature` ~ 0.7–1.0 +- `top_p` ~ 0.9–0.95 (nucleus) +- optionally `top_k` ~ 40–100 + +```python +out = model.generate( + **inputs, + max_new_tokens=200, + do_sample=True, + temperature=0.8, + top_p=0.95, + top_k=50, +) +``` + +### 3. Beam search (more exhaustive, more deterministic) +Useful for translation/summarization; can become repetitive for open-ended chat. + +```python +out = model.generate( + **inputs, + max_new_tokens=200, + num_beams=4, + do_sample=False, + early_stopping=True, +) +``` + +### 4. Diverse candidates (multiple outputs) +```python +out = model.generate( + **inputs, + max_new_tokens=120, + do_sample=True, + temperature=0.9, + top_p=0.95, + num_return_sequences=3, +) +texts = tok.batch_decode(out, skip_special_tokens=True) +for i, t in enumerate(texts, 1): + print(f"\n--- candidate {i} ---\n{t}") +``` + +--- + +## Chat prompting (chat templates) + +If the model expects chat formatting, use `apply_chat_template` (tokenizer or processor). +If you’re unsure whether the model is “chat/instruct”, check its docs/model card or your `reference/generated/*`. + +### Chat template → generate (decoder-only) + +```python +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_id = "meta-llama/Llama-3.1-8B-Instruct" +tok = AutoTokenizer.from_pretrained(model_id) +model = AutoModelForCausalLM.from_pretrained(model_id) +model.eval() + +messages = [ + {"role": "system", "content": "You are concise."}, + {"role": "user", "content": "Explain beam search in one paragraph."}, +] + +prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) +inputs = tok(prompt, return_tensors="pt", add_special_tokens=False).to(model.device) + +with torch.inference_mode(): + out = model.generate(**inputs, max_new_tokens=180, do_sample=False) + +print(tok.decode(out[0], skip_special_tokens=True)) +``` + +--- + +## “Decoder-only returns the prompt too” (slice it) + +For decoder-only LMs, `generate()` returns `[prompt + completion]`. +If you only want the completion tokens: + +```python +with torch.inference_mode(): + out = model.generate(**inputs, max_new_tokens=80) + +prompt_len = inputs["input_ids"].shape[-1] +completion_ids = out[0, prompt_len:] +completion_text = tok.decode(completion_ids, skip_special_tokens=True) +print(completion_text) +``` + +(For encoder-decoder models, the generated sequence is usually just the decoder output.) + +--- + +## Stopping + +### 1. EOS-based stopping (default) +Most models stop when `eos_token_id` is produced (or hit length limits). +If you see “never stops” behavior, verify: +- `eos_token_id` exists and is correct +- you didn’t set an incompatible `min_length` / `min_new_tokens` + +### 2. Stop on custom condition (StoppingCriteria) +Use this when you need “stop when a phrase appears” or other custom termination. + +```python +import torch +from transformers import StoppingCriteria, StoppingCriteriaList + +class StopOnTokenSequence(StoppingCriteria): + def __init__(self, stop_ids: list[int]): + self.stop_ids = torch.tensor(stop_ids, dtype=torch.long) + + def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs): + # Return shape: (batch_size, 1) — True means “stop for that row” + stop_ids = self.stop_ids.to(input_ids.device) + bsz, seqlen = input_ids.shape + n = stop_ids.numel() + + if seqlen < n: + return torch.zeros((bsz, 1), dtype=torch.bool, device=input_ids.device) + + tail = input_ids[:, -n:] # (bsz, n) + matched = (tail == stop_ids).all(dim=1, keepdim=True) # (bsz, 1) + return matched + + +stop_text = "\n###" +stop_ids = tok(stop_text, add_special_tokens=False)["input_ids"] +criteria = StoppingCriteriaList([StopOnTokenSequence(stop_ids)]) + +with torch.inference_mode(): + out = model.generate( + **inputs, + max_new_tokens=300, + do_sample=True, + temperature=0.8, + top_p=0.95, + stopping_criteria=criteria, + ) + +print(tok.decode(out[0], skip_special_tokens=True)) +``` + +Notes: +- In batched generation, stopping criteria can return a `(batch_size, 1)` mask (per-sample). +- However, generation often keeps tensor shapes fixed (e.g., padding finished rows), so you may not get compute savings unless you re-batch unfinished samples. + + +### 3. Stop on strings (built-in: stop_strings) + +If you want to stop when the model outputs a specific string, you can use `stop_strings`: + +```python +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_id = "distilbert/distilgpt2" +tok = AutoTokenizer.from_pretrained(model_id) +model = AutoModelForCausalLM.from_pretrained(model_id) +model.eval() + +prompt = "Write a short answer, then end with a line containing ###:\n" +inputs = tok(prompt, return_tensors="pt").to(model.device) + +with torch.inference_mode(): + out = model.generate( + **inputs, + max_new_tokens=300, + do_sample=True, + temperature=0.8, + top_p=0.95, + stop_strings=["\n###"], + tokenizer=tok, # required so stop_strings can be matched against decoded text + ) + +text = tok.decode(out[0], skip_special_tokens=True) +print(text) +``` + +Notes: +- `stop_strings` stops generation *after* the stop string is produced. +- Pass `tokenizer=tok` so Transformers can detect the stop string correctly during generation. +- If you need the returned text *without* the stop string, trim it after decoding (e.g., `text.split("\n###")[0]`). + +--- + +## Streaming + +Use streamers when you want token-by-token (or chunk-by-chunk) output. + +### TextIteratorStreamer (common pattern with a background thread) +```python +import torch +from threading import Thread +from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer + +model_id = "distilbert/distilgpt2" +tok = AutoTokenizer.from_pretrained(model_id) +device = "cuda" if torch.cuda.is_available() else "cpu" +model = AutoModelForCausalLM.from_pretrained(model_id).to(device) +model.eval() + +prompt = "Tell me a short story about a robot:" +inputs = tok(prompt, return_tensors="pt").to(model.device) +streamer = TextIteratorStreamer(tok, skip_special_tokens=True, skip_prompt=True) +generation_kwargs = dict( + **inputs, + max_new_tokens=120, + do_sample=True, + temperature=0.9, + top_p=0.95, + streamer=streamer, +) + +thread = Thread(target=model.generate, kwargs=generation_kwargs) +thread.start() + +for text_chunk in streamer: + print(text_chunk, end="", flush=True) + +thread.join() +print() +``` +Pitfall: Some pipelines/deepcopies can conflict with streamer objects; if you hit errors, call `model.generate` +directly (like above) rather than wrapping in a pipeline. +--- + +## Inspecting generation internals (scores, beams, etc.) + +If you need token-level probabilities, request structured outputs from `generate()`: + +```python +with torch.inference_mode(): + out = model.generate( + **inputs, + max_new_tokens=50, + do_sample=False, + return_dict_in_generate=True, + output_scores=True, + ) + +# out.sequences: token ids +# out.scores: tuple of per-step logits (one tensor per generated step) +print(type(out)) +print(out.sequences.shape, len(out.scores)) +``` +--- + +## What to change (knobs that matter most) + +Length / termination: +- `max_new_tokens` (primary) +- `min_new_tokens` / `min_length` +- `eos_token_id`, `pad_token_id` +- `stopping_criteria` + +Creativity / diversity: +- `do_sample` +- `temperature` +- `top_p`, `top_k` +- `typical_p` (if supported by your version/model) + +Determinism / search: +- `num_beams` +- `early_stopping` +- `length_penalty` + +Repetition control: +- `repetition_penalty` +- `no_repeat_ngram_size` +- `encoder_no_repeat_ngram_size` (encoder-decoder) + +Multiple outputs: +- `num_return_sequences` (sampling or beams + sampling variants) + +--- + +## Pitfalls & fixes (high-frequency) + +### “It ignores temperature/top_p” +Sampling knobs only apply when `do_sample=True`. +Fix: set `do_sample=True` (and typically keep `num_beams=1` for pure sampling). + +### “It stops too early / too late” +- Prefer `max_new_tokens` for length. +- Verify `eos_token_id` and that you didn’t set `min_new_tokens` too high. + +### “Beam search is repetitive” +Try: +- smaller `num_beams` (e.g., 2–4) +- `repetition_penalty` or `no_repeat_ngram_size` +- or switch to sampling with moderate temperature/top_p. + +### “Decoder-only output contains prompt” +Slice using `prompt_len` (see above). + +### “Batched generation breaks on padding” +For decoder-only: +- ensure a pad token exists (`tok.pad_token = tok.eos_token` is common) +- consider `tok.padding_side = "left"` for batched generation + +### “OOM during generation” +Route to `performance.md` for: +- `device_map="auto"`, dtype reduction, quantization +- smaller `max_new_tokens`, smaller batch size +- attention backend / KV cache strategies + +--- + +## Repo hotspots (when asked “where is this implemented?”) + +Generation configuration + defaults: +- `src/transformers/generation/configuration_utils.py` +Streaming: +- `src/transformers/generation/streamers.py` +Logits processors / warpers (repetition penalty, top-k/top-p, etc.): +- `src/transformers/generation/logits_process.py` +Pipelines wrapping generation: +- `src/transformers/pipelines/text_generation.py` +Core generate logic commonly lives under: +- `src/transformers/generation/` (search for `GenerationMixin` and `generate`) + +--- + +## Verification checklist (anti-hallucination) + +When uncertain, verify in this order: +1) `reference/generated/public_api.md` (does the symbol/kwarg exist in this version?) +2) `reference/generated/module_tree.md` (where is it implemented?) +3) `reference/generated/docs_map.md` (where is it documented?) +4) Then inspect `src/transformers/generation/...` and grep the exact name (e.g., `stop_strings`, `typical_p`, `TextIteratorStreamer`). \ No newline at end of file diff --git a/.claude/skills/transformers-api/reference/areas/inference.md b/.claude/skills/transformers-api/reference/areas/inference.md new file mode 100644 index 000000000000..bc8d19163fbb --- /dev/null +++ b/.claude/skills/transformers-api/reference/areas/inference.md @@ -0,0 +1,286 @@ +# Inference (pipelines + Auto* inference) + +## Contents +- [Scope](#scope) +- [Minimum questions to ask](#minimum-questions-to-ask) +- [Decision guide: `pipeline()` vs manual Auto*](#decision-guide-pipeline-vs-manual-auto) +- [Quickstarts](#quickstarts) + - [1) Pipeline: text classification (single + batch)](#1-pipeline-text-classification-single--batch) + - [2) Pipeline: iterate a Dataset efficiently (KeyDataset)](#2-pipeline-iterate-a-dataset-efficiently-keydataset) + - [3) Pipeline: generator input (num_workers caveat)](#3-pipeline-generator-input-num_workers-caveat) + - [4) Pipeline: image classification (non-text example)](#4-pipeline-image-classification-non-text-example) + - [5) Manual Auto*: classification logits (most control)](#5-manual-auto-classification-logits-most-control) + - [6) Manual Auto*: embeddings (mean pool)](#6-manual-auto-embeddings-mean-pool) +- [Knobs that matter (3–8)](#knobs-that-matter-38) +- [Pitfalls & fixes](#pitfalls--fixes) +- [Chunk batching (QA / zero-shot) and why it matters](#chunk-batching-qa--zero-shot-and-why-it-matters) +- [Verify / locate in repo](#verify--locate-in-repo) + +--- + +## Scope + +Use this page when the user wants to **run a model for inference** (predict/classify/score/encode) in `transformers`. + +--- + +## Minimum questions to ask + +Ask only what you need to produce a runnable snippet (0–5 questions): +1) **Task** (e.g., `text-classification`, `question-answering`, `automatic-speech-recognition`, `image-classification`, `feature-extraction`) +2) **Model id or local path** (and `revision` if pinned) +3) **Backend + device** (PyTorch/TF/JAX; CPU/CUDA/MPS; rough VRAM if relevant) +4) **Input modality** (text/image/audio) if unclear +5) If blocked: **full traceback + exact versions** + smallest repro + +--- + +## Decision guide: `pipeline()` vs manual Auto* + +### Prefer `pipeline()` when… +- You want the fastest path to correct inference with task-specific preprocessing/postprocessing +- You want easy batching or dataset iteration +- You’re okay with outputs formatted by the task pipeline + +### Prefer manual Auto* when… +- You need direct control over tensors/logits/hidden states and custom pooling/postprocessing +- You need to debug shapes/dtypes/devices precisely +- You’re integrating into an existing service/loop and want strict control + +--- + +## Quickstarts + +### 1. Pipeline: text classification (single + batch) + +```python +from transformers import pipeline + +pipe = pipeline( + task="text-classification", + model="distilbert/distilbert-base-uncased-finetuned-sst-2-english", + device=0, # GPU ordinal; use -1 for CPU + dtype="auto", # can also be torch.float16 / "float16" for PyTorch models +) + +print(pipe("This restaurant is awesome")) +print(pipe(["Great!", "Terrible..."], batch_size=8)) +``` + +Notes: +- For large models, prefer `device_map="auto"` over a single `device` (sharding/offload). +- If you must set `trust_remote_code=True`, pin `revision=` and treat it like running third-party code. + +--- + +### 2. Pipeline: iterate a Dataset efficiently (KeyDataset) + +Recommended for large datasets: iterate the dataset directly to avoid loading everything into memory and to avoid writing your own batching loops. + +```python +import datasets +from tqdm.auto import tqdm +from transformers import pipeline +from transformers.pipelines.pt_utils import KeyDataset + +pipe = pipeline( + "text-classification", + model="distilbert/distilbert-base-uncased-finetuned-sst-2-english", + device=0, +) + +ds = datasets.load_dataset("imdb", split="test[:200]") + +# Some texts tokenize longer than the model’s max sequence length (e.g., 512), causing a size-mismatch error; truncation (and padding for batching) fixes it by enforcing a consistent max length. +for out in tqdm(pipe(KeyDataset(ds, "text"), batch_size=16, truncation=True, max_length=512, padding=True)): + pass +``` + +--- + +### 3. Pipeline: generator input (num_workers caveat) + +A generator/iterator is convenient for streaming inputs (queues/HTTP/DB), but note the caveat: with iterative generators you cannot use `num_workers > 1` for multi-process preprocessing. + +```python +from transformers import pipeline + +pipe = pipeline( + "text-classification", + model="distilbert/distilbert-base-uncased-finetuned-sst-2-english", + device=0, +) + +def data(): + for i in range(100): + yield f"My example {i}" + +# Caveat: because this is iterative, you cannot use num_workers > 1 to preprocess in parallel. +for out in pipe(data(), batch_size=8): + pass +``` + +--- + +### 4. Pipeline: image classification (non-text example) + +Pipelines support computer vision tasks. Inputs may be: +- an HTTP(S) URL string +- a local file path string +- a PIL image object + +If you pass a *batch* of images, they must all be in the same format (all URLs, all paths, or all PIL images). + +```python +from transformers import pipeline + +# Vision pipelines require Pillow (PIL). If you get: "This image processor cannot be instantiated... install Pillow", +# run: pip install -U pillow (or: conda install -c conda-forge pillow) and restart your notebook/kernel. +clf = pipeline( + "image-classification", + model="google/vit-base-patch16-224", + device=0, + dtype="auto", +) + +img_url = "https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png" +print(clf(img_url)) +``` + +--- + +### 5. Manual Auto*: classification logits (most control) + +Use this as the baseline when debugging correctness or needing raw logits. + +```python +import torch +from transformers import AutoTokenizer, AutoModelForSequenceClassification + +model_id = "distilbert/distilbert-base-uncased-finetuned-sst-2-english" + +tok = AutoTokenizer.from_pretrained(model_id) +model = AutoModelForSequenceClassification.from_pretrained(model_id) + +device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") +model.to(device).eval() + +texts = ["I love this.", "I hate this."] +batch = tok(texts, return_tensors="pt", padding=True, truncation=True) +batch = {k: v.to(device) for k, v in batch.items()} + +with torch.inference_mode(): + logits = model(**batch).logits + probs = logits.softmax(dim=-1) + +print("probs:", probs) +print("pred:", probs.argmax(dim=-1)) +``` + +--- + +### 6. Manual Auto*: embeddings (mean pool) + +Use when the user wants embeddings/features (not generation). + +```python +import torch +from transformers import AutoTokenizer, AutoModel + +model_id = "distilbert-base-uncased" + +tok = AutoTokenizer.from_pretrained(model_id) +model = AutoModel.from_pretrained(model_id) + +device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") +model.to(device).eval() + +texts = ["hello world", "another sentence"] +batch = tok(texts, return_tensors="pt", padding=True, truncation=True) +batch = {k: v.to(device) for k, v in batch.items()} + +with torch.inference_mode(): + out = model(**batch) # out.last_hidden_state: (B, T, H) + mask = batch["attention_mask"].unsqueeze(-1).type_as(out.last_hidden_state) # (B, T, 1) + summed = (out.last_hidden_state * mask).sum(dim=1) # (B, H) + counts = mask.sum(dim=1).clamp(min=1) # (B, 1) + emb = summed / counts # (B, H) + +print("embeddings shape:", emb.shape) +``` + +If they need “sentence embeddings” in production: +- confirm pooling + normalization strategy +- validate with a small retrieval sanity check (nearest neighbors look sensible) + +--- + +## Knobs that matter (3–8) + +Prioritize these knobs before anything else: + +1) **Task ↔ checkpoint compatibility** + - Pipeline: correct `task` (or model with an embedded task) + - Manual: correct `AutoModelFor*` class +2) **`model` + `revision`** (pin for reproducibility) +3) **Placement:** `device` vs `device_map` +4) **Precision:** `dtype` (pipeline) / `torch_dtype` (many manual loading paths) +5) **Batching:** list inputs + `batch_size` (avoid per-example loops) +6) **Tokenization:** `padding`, `truncation`, `max_length` +7) **Overrides:** `tokenizer`, `feature_extractor`, `image_processor`, `processor` (when default loading is wrong) +8) **Security/repro:** `trust_remote_code` (only if trusted) + pinned `revision` + +Useful to know: the documented `pipeline()` constructor includes (among others) +`task`, `model`, `config`, `tokenizer`, `feature_extractor`, `image_processor`, `processor`, `revision`, `use_fast`, `token`, +`device`, `device_map`, `dtype='auto'`, `trust_remote_code`, and `model_kwargs`. + +--- + +## Pitfalls & fixes + +- **It’s slow** + - You’re processing one-by-one → pass a **list** / **dataset iterator** and use `batch_size` (avoid per-example loops) + - Batching isn’t always faster → **measure** on your hardware/model; batching is often most helpful on GPU + - You’re on CPU → consider moving to GPU; (rule of thumb: batching on CPU often doesn’t help much) + - Inputs are huge → set `truncation=True` and tune `max_length` (shorter `max_length` is usually faster/cheaper) + +- **Wrong head / mismatched task** + - Pipeline: ensure `task` matches the checkpoint’s intent (e.g., `"text-classification"` vs `"token-classification"`) + - Manual: choose the correct `AutoModelFor*` (e.g., `AutoModelForSequenceClassification`, `AutoModelForTokenClassification`) + +- **Device / dtype issues** + - Manual: move **both** the model and **all** input tensors to the same device + - Inference best-practice: `model.eval()` (often already true after `from_pretrained`) + `torch.inference_mode()` + - Pipeline placement: use **either** `device` **or** `device_map` (don’t set both) + +- **Batching causes OOM** + - Reduce `batch_size`; consider smaller `max_length`; handle OOM gracefully (retry with a smaller batch) + - If lengths vary a lot, consider bucketing by length or using smaller `max_length` to stabilize memory + - For large models, consider `device_map="auto"` (sharding/offload) and lower precision (`dtype="float16"` / `torch.float16` where supported; PyTorch backend) + +--- + +## Chunk batching (QA / zero-shot) and why it matters + +Some tasks (notably `question-answering` and `zero-shot-classification`) may require **multiple forward passes per “one” user input**. +Transformers handles this via a `ChunkPipeline` implementation so you can tune `batch_size` without manually accounting for how many +forward passes a single input triggers. + +Practical implications: +- If a user reports “batch_size doesn’t behave as expected” for QA/zero-shot, check whether chunking is the cause. +- Don’t assume “1 input = 1 forward pass” for these pipelines. + +--- + +## Verify / locate in repo + +Common repo hotspots: +- Pipelines: + - `src/transformers/pipelines/__init__.py` (factory/registry) + - `src/transformers/pipelines/base.py` (base `Pipeline` / batching machinery) + - `src/transformers/pipelines/*.py` (task implementations) +- Auto factories: + - `src/transformers/models/auto/` (AutoModel/AutoConfig/AutoTokenizer mappings) +- Core loading utilities: + - `src/transformers/modeling_utils.py` + - `src/transformers/configuration_utils.py` \ No newline at end of file diff --git a/.claude/skills/transformers-api/reference/areas/performance.md b/.claude/skills/transformers-api/reference/areas/performance.md new file mode 100644 index 000000000000..0952842a97fc --- /dev/null +++ b/.claude/skills/transformers-api/reference/areas/performance.md @@ -0,0 +1,371 @@ +# Performance (memory + speed + quantization) + +## Contents +- [Scope](#scope) +- [Minimum questions to ask](#minimum-questions-to-ask) +- [Triage ladder (do these first)](#triage-ladder-do-these-first) +- [Quickstarts](#quickstarts) + - [1) Baseline: correct device placement + mixed precision](#1-baseline-correct-device-placement--mixed-precision) + - [2) Faster attention: set `attn_implementation`](#2-faster-attention-set-attn_implementation) + - [3) `torch.compile`: compile once, run faster](#3-torchcompile-compile-once-run-faster) + - [4) bitsandbytes 8-bit / 4-bit: `BitsAndBytesConfig`](#4-bitsandbytes-8-bit--4-bit-bitsandbytesconfig) + - [5) GPTQ: post-training int4 with `gptqmodel` + `GPTQConfig`](#5-gptq-post-training-int4-with-gptqmodel--gptqconfig) + - [6) Continuous batching for serving: `generate_batch()` / `transformers serve`](#6-continuous-batching-for-serving-generate_batch--transformers-serve) +- [Knobs that matter (3–8)](#knobs-that-matter-38) +- [Pitfalls & fixes](#pitfalls--fixes) +- [Verify / locate in repo](#verify--locate-in-repo) + +--- + +## Scope + +Use this page when the user’s goal is **performance** in `transformers`: +- Reduce **VRAM/RAM** (fit the model) +- Increase **throughput** (tokens/sec, examples/sec) +- Reduce **latency** (time-to-first-token, p95) +- Use **quantization**, **compiled execution**, **optimized attention/kernels**, **parallelism**, or **continuous batching** + +--- + +## Minimum questions to ask + +Ask only what you need to recommend the right optimization (0–5 questions): +1) **Workload**: inference vs training? generation vs encoder-only? +2) **Target**: memory bound vs compute bound? (OOM? too slow? p95 latency? throughput?) +3) **Hardware**: CPU vs GPU (which GPU?) vs multi-GPU? +4) **Model + dtype constraints**: model id/path + `transformers` version + backend (PyTorch/TF/JAX) +5) If blocked: exact **OOM/traceback**, plus a minimal runnable snippet + +--- + +## Triage ladder (do these first) + +This ordering avoids “cool tricks” before basics: + +1) **Stop accidental slow paths** + - Batch your requests; avoid per-item loops. + - Ensure the model and inputs are on the same device. +2) **Right-size precision** + - Mixed precision (`float16` / `bfloat16`) usually yields large speed/memory wins on GPUs. +3) **Use an optimized attention backend** + - Swap `attn_implementation` before changing architectures. +4) **Compile** + - `torch.compile` can reduce Python overhead and fuse kernels. +5) **Quantize** + - 8-bit / 4-bit (bitsandbytes) or GPTQ can be the difference between “fits” and “doesn’t”. +6) **Scale/serve** + - Continuous batching and parallelism matter most when serving many concurrent requests. + +--- + +## Quickstarts + +### 1. Baseline: correct device placement + mixed precision + +Use this when the user says “it’s slow” or “it OOMs” and you need a sane baseline. + +```python +import torch +from transformers import AutoTokenizer, AutoModelForCausalLM + +model_id = "google/gemma-2b" # example + +tokenizer = AutoTokenizer.from_pretrained(model_id) + +# Mixed precision + automatic device placement (single GPU or multi-GPU sharding/offload) +model = AutoModelForCausalLM.from_pretrained( + model_id, + device_map="auto", + dtype=torch.bfloat16, # or torch.float16 +).eval() + + +# Put inputs on the model's device +inputs = tokenizer("Hello!", return_tensors="pt").to(model.device) + +with torch.inference_mode(): + out = model.generate(**inputs, max_new_tokens=32) + +print(tokenizer.decode(out[0], skip_special_tokens=True)) +``` + +The `dtype` argument controls the instantiated weight dtype. +- Use `dtype="auto"` to load the checkpoint’s intended dtype. +- Or force `dtype=torch.float16` / `dtype=torch.bfloat16` for mixed precision (GPU permitting). + + +--- + +### 2. Faster attention: set `attn_implementation` + +Transformers exposes multiple attention backends through a single knob: `attn_implementation`. +Supported values in the attention-backends interface include (among others): +"flash_attention_3", "flash_attention_2", "flex_attention", "sdpa" (and "eager"), plus paged variants like "paged|flash_attention_3" / "paged|flash_attention_2" / "paged|sdpa" / "paged|eager". + + +```python +from transformers import AutoModelForCausalLM + +model = AutoModelForCausalLM.from_pretrained( + "meta-llama/Llama-3.2-1B", + attn_implementation="flash_attention_2", +) +``` + +You can also switch implementations at runtime without reloading: + +```python +model.set_attn_implementation("sdpa") +``` + +If you don’t want to install a FlashAttention package (CUDA/PyTorch version mismatch pain), you can load a compiled kernel from the Hub via the Kernels integration: + +```python +from transformers import AutoModelForCausalLM + +model = AutoModelForCausalLM.from_pretrained( + "meta-llama/Llama-3.2-1B", + attn_implementation="kernels-community/flash-attn2", +) +``` + +**Gotchas (read before benchmarking):** +- **Backend availability depends on model + PyTorch/CUDA + dtype.** + For example, FlashAttention2 requires CUDA and typically `float16` or `bfloat16`; it will silently fall back or error if the dtype or build is incompatible. +- **FlashAttention2 does not support attention over padded tokens.** + In batched generation with padding, this can reduce performance unless you avoid padding, unpad inputs, or use an alternative backend (e.g. SDPA). +- **Some attention params force a fallback to eager.** + For example, `output_attentions=True` is unsupported in some optimized attention paths and triggers a fallback warning. + +--- + +### 3. `torch.compile`: static cache + compile `forward` (generation) + +For generation workloads, Transformers recommends enabling StaticCache via `cache_implementation="static"`. This also turns on automatic compilation of the decoding stage for greedy and sampling decode. You can control this via `compile_config` (or disable it with `disable_compile`) and still need stable shapes to avoid recompilation. + + +```python +import os +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer + +os.environ["TOKENIZERS_PARALLELISM"] = "false" + +model_id = "google/gemma-2b" +tokenizer = AutoTokenizer.from_pretrained(model_id) + +model = AutoModelForCausalLM.from_pretrained( + model_id, + device_map="auto", + dtype="auto", # or torch.float16 / torch.bfloat16 +).eval() + +# Compile the forward pass; generate() calls model.forward internally +model.forward = torch.compile( + model.forward, + mode="reduce-overhead", + fullgraph=True, +) + +# Keep shapes stable to avoid recompilation +inputs = tokenizer( + "Hello!", + return_tensors="pt", + pad_to_multiple_of=8, +).to(model.device) + +with torch.inference_mode(): + outputs = model.generate(**inputs, max_new_tokens=32, cache_implementation="static") + +print(tokenizer.decode(outputs[0], skip_special_tokens=True)) +``` + +Notes: +- The **first call is slower** due to compilation; benchmark after warmup. +- Keep batch size, prompt length, and `max_new_tokens` stable to avoid recompilation. +- If `fullgraph=True` fails due to graph breaks, retry with `fullgraph=False`. + +--- + +### 4. bitsandbytes 8-bit / 4-bit: `BitsAndBytesConfig` + +This is the fastest “make it fit” move for many LLMs. Install deps first: + +```bash +pip install --upgrade transformers accelerate bitsandbytes +``` + +**8-bit example (generation path):** + +```python +from transformers import BitsAndBytesConfig, AutoTokenizer, AutoModelForCausalLM + +quantization_config = BitsAndBytesConfig(load_in_8bit=True) + +tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B") +model = AutoModelForCausalLM.from_pretrained( + "meta-llama/Llama-3.1-8B", + device_map="auto", + quantization_config=quantization_config, +) + +inputs = tokenizer("Hello, my llama is cute", return_tensors="pt").to(model.device) +generated_ids = model.generate(**inputs) +print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]) +``` +**Preferred API:** +Use `quantization_config=BitsAndBytesConfig(...)` when loading models. +Avoid passing `load_in_8bit` or `load_in_4bit` directly to `from_pretrained()`; these flags exist for compatibility but are not the recommended interface. + +Notes: +- The GPU performance guide explicitly recommends **using `generate()` rather than the Pipeline API** for **8-bit text generation**, because Pipeline is not optimized for 8-bit models and some sampling strategies may not be supported there. +- For multi-GPU/distributed, you can pass `max_memory={...}` to control per-device allocation when using `device_map="auto"`. + +--- + +### 5. GPTQ: post-training int4 with `gptqmodel` + `GPTQConfig` + +Transformers’ GPTQ doc states: +- GPTQ is supported via the **`gptqmodel`** package. +- Transformers supports GPTQ via GPTQModel and still documents AutoGPTQ, but AutoGPTQ is likely to be deprecated; prefer GPTQModel going forward. + +Install: + +```bash +pip install --upgrade accelerate optimum transformers +pip install gptqmodel --no-build-isolation +``` + +Quantize (example pattern): + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig + +tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m") +gptq_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer) + +quantized_model = AutoModelForCausalLM.from_pretrained( + "facebook/opt-125m", + device_map="auto", + quantization_config=gptq_config, +) +``` + +If you hit memory pressure during quantization, GPTQ docs recommend using `max_memory={...}` (disk offloading is not supported for the dataset). + +--- + +### 6. Continuous batching for serving: `generate_batch()` / `transformers serve` + +Continuous batching increases throughput and reduces latency by dynamically re-forming the batch each step (removing finished requests and adding new ones) to avoid GPU idling. It works with `transformers serve` and `generate_batch()`. + +- **PagedAttention is automatically enabled under continuous batching.** + You can also explicitly select a paged backend via `attn_implementation="paged|..."` if needed. + + +Minimal `generate_batch()` shape (tokenized inputs list + `GenerationConfig`): + +```python +import datasets +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer +from transformers.generation import GenerationConfig + +model = AutoModelForCausalLM.from_pretrained( + "Qwen/Qwen3-4B-Instruct-2507", + attn_implementation="sdpa_paged", + device_map="cuda", + dtype=torch.bfloat16, +) +tokenizer = AutoTokenizer.from_pretrained( + "Qwen/Qwen3-4B-Instruct-2507", + padding_side="left", +) + +dataset = datasets.load_dataset("openai/gsm8k", "socratic", split="test").select(range(8)) +tokenized = dataset.map(lambda x: tokenizer(x["question"]), batched=True) + +simple_batch_inputs = [item["input_ids"] for item in tokenized] + +generation_config = GenerationConfig( + max_new_tokens=32, + do_sample=False, + eos_token_id=tokenizer.eos_token_id, + pad_token_id=tokenizer.pad_token_id, + max_batch_tokens=512, # token budget for batching + use_cuda_graph=False, +) + +batch_outputs = model.generate_batch(inputs=simple_batch_inputs, generation_config=generation_config) + +for request_id, output in batch_outputs.items(): + print(request_id, tokenizer.decode(output.generated_tokens, skip_special_tokens=True)) +``` + +If you need custom scheduling, the docs expose a `ContinuousBatchingManager` and schedulers (default FIFO). + +--- + +## Knobs that matter (3–8) + +Prioritize these knobs before anything else: + +1) **Batching + padding strategy** + - batch requests; for LLM generation use left-padding (`padding_side="left"`) when appropriate +2) **Placement**: `device` vs `device_map` (and **inputs on `model.device`**) +3) **Precision**: `dtype` / `torch_dtype` (fp16/bf16) vs full fp32 +4) **Attention backend**: `attn_implementation` (FlashAttention/SDPA/paged variants) +5) **Compilation**: `torch.compile(...)` knobs (`mode`, `fullgraph`) or compile-via-`generate()` with a static cache +6) **Quantization**: `quantization_config` (bitsandbytes 8/4-bit, GPTQ, etc.) +7) **Memory partitioning** (multi-GPU/offload): `max_memory={...}` with `device_map="auto"` +8) **Serving throughput**: continuous batching (`generate_batch()`, `max_batch_tokens`) / `transformers serve` + +--- + +## Pitfalls & fixes + +- **“It’s still slow after moving to GPU”** + - Inputs not on GPU → ensure `tokenizer(...).to(model.device)` + - You’re running batch_size=1 loops → batch requests; avoid Python overhead +- **“FlashAttention enabled but errors”** + - Use the Kernels integration (`attn_implementation="kernels-community/flash-attn2"`) to avoid local build/version mismatch + - Or fall back to `attn_implementation="sdpa"` for a safer baseline +- **“`torch.compile` made it slower”** + - First run includes compilation; benchmark after warmup + - Try `mode="reduce-overhead"`; avoid recompiling on shape changes (keep shapes stable) +- **“8-bit pipeline is slow / sampling not supported”** + - For 8-bit text generation, prefer calling `model.generate()` directly (per GPU perf guide) +- **“OOM when quantizing (GPTQ)”** + - Use `device_map="auto"` and constrain with `max_memory={...}` + - Prefer loading an already-quantized checkpoint from the Hub when available +- **“Serving latency spikes under load”** + - Use continuous batching to prevent GPU idle bubbles and handle ragged request lengths + - Tune `max_batch_tokens` and request scheduling + +--- + +## Verify / locate in repo + +Repo hotspots (performance) +- **Loading / placement (`from_pretrained`, `device_map`, `max_memory`, `dtype`)**: `src/transformers/modeling_utils.py` +- **Attention backend interface (`attn_implementation`, `set_attn_implementation`)**: docs “Attention backends” + model code in `src/transformers/models//modeling_.py` (where eager/SDPA/FA branches usually live) +- **KV cache internals (Static/DynamicCache)**: `src/transformers/cache_utils.py` + KV-cache docs (shows `cache_implementation="static"` + compile behavior) +- **Generation cache/config knobs (`GenerationConfig`, cache impl wiring)**: `src/transformers/generation/configuration_utils.py` +- **Core `generate()` perf paths**: `src/transformers/generation/utils.py` +- **Continuous batching (`generate_batch`)**: `src/transformers/generation/continuous_batching/continuous_api.py` +- **Quantization config objects**: `src/transformers/utils/quantization_config.py` +- **Quantizer routing (which quantizer gets picked)**: `src/transformers/quantizers/auto.py` +- **bitsandbytes glue + bnb 4bit internals**: `src/transformers/integrations/bitsandbytes.py` and `src/transformers/quantizers/quantizer_bnb_4bit.py` +- **`transformers serve` (CLI + behavior)**: docs “Serving” and implementation under `src/transformers/commands/serving.py` (shows up in tracebacks) + +When uncertain, use Skill verification indexes: +- “Does this symbol/arg exist?” → `reference/generated/public_api.md` +- “Where is it implemented?” → `reference/generated/module_tree.md` + +High-signal repo search keywords (grep these): +- `attn_implementation`, `set_attn_implementation` +- `torch.compile`, `cache_implementation="static"` +- `BitsAndBytesConfig`, `quantization_config`, `load_in_8bit`, `load_in_4bit` +- `GPTQConfig`, `gptqmodel` +- `generate_batch`, `ContinuousBatchingManager`, `init_continuous_batching`, `max_batch_tokens` +- `device_map`, `max_memory` \ No newline at end of file diff --git a/.claude/skills/transformers-api/reference/areas/preprocessing.md b/.claude/skills/transformers-api/reference/areas/preprocessing.md new file mode 100644 index 000000000000..11de3d7d518c --- /dev/null +++ b/.claude/skills/transformers-api/reference/areas/preprocessing.md @@ -0,0 +1,434 @@ +# Preprocessing (tokenizers, processors, image/video processors, feature extractors) + +## Contents +- [Scope](#scope) +- [Minimum questions (0–4)](#minimum-questions-04) +- [Choose the right preprocessor](#choose-the-right-preprocessor) +- [Text preprocessing: `AutoTokenizer`](#text-preprocessing-autotokenizer) +- [Chat templating: `apply_chat_template`](#chat-templating-apply_chat_template) +- [Vision preprocessing: `AutoImageProcessor`](#vision-preprocessing-autoimageprocessor) +- [Audio preprocessing: `AutoFeatureExtractor` and `AutoProcessor`](#audio-preprocessing-autofeatureextractor-and-autoprocessor) +- [Video preprocessing: `AutoVideoProcessor`](#video-preprocessing-autovideoprocessor) +- [Multimodal preprocessing: `AutoProcessor`](#multimodal-preprocessing-autoprocessor) +- [Batching + device sanity](#batching--device-sanity) +- [Pitfalls & fixes](#pitfalls--fixes) +- [Repo hotspots](#repo-hotspots) + +--- + +## Scope + +Use this page when the user needs to convert **raw inputs** (text / images / audio / video / multimodal messages) into **model-ready tensors** for `transformers`. + +--- + +## Minimum questions (0–4) + +Ask only what’s needed to produce a runnable snippet: +1) **Modality + task + raw input format** + - Text / vision / audio / video / multimodal + - What you’re passing in (e.g., plain strings, chat messages, image URL/path/PIL, audio array + sampling rate, video frames) + - Desired output (logits, embeddings, generated tokens) / expected shapes if relevant +2) **Model id/path** + - Hugging Face Hub id or local path + - Optional but recommended for reproducibility/security: pinned `revision` (tag/branch/commit) +3) **Backend + device** + - PyTorch / TensorFlow / JAX + - CPU / CUDA / MPS (and which GPU index if CUDA) +4) If blocked: **full traceback + minimal repro** + - Smallest code sample that still fails + the exact error + +--- + +## Choose the right preprocessor + +Rule: **load preprocessing artifacts from the same checkpoint as the model**. + +| Modality | Preferred class | Typical output keys | +|---|---|---| +| Text | `AutoTokenizer` | `input_ids`, `attention_mask` (maybe `token_type_ids`) | +| Image | `AutoImageProcessor` | `pixel_values` (maybe `pixel_mask`) | +| Audio | `AutoFeatureExtractor` *or* `AutoProcessor` (model-dependent) | `input_values` **or** `input_features` (sometimes `attention_mask`) | +| Video | `AutoVideoProcessor` **or** `AutoImageProcessor` (frame-based; model-dependent) | model-dependent video/frame tensors + optional metadata | +| Multimodal (text+image/audio/video) | `AutoProcessor` | combination (e.g., `input_ids` + `pixel_values`) | + +If the model card/examples show `AutoProcessor`, prefer `AutoProcessor`. +Note: Some video classification models (e.g., VideoMAE) use a frame/image processor (`AutoImageProcessor` / `VideoMAEImageProcessor`) rather than `AutoVideoProcessor`. + +--- + +## Text preprocessing: `AutoTokenizer` + +### Minimal batch tokenization (PyTorch) +```python +from transformers import AutoTokenizer + +model_id = "bert-base-uncased" +tok = AutoTokenizer.from_pretrained(model_id) + +texts = ["hello world", "a much longer example sentence"] +batch = tok( + texts, + padding=True, # pad to longest in batch + truncation=True, # truncate if needed + return_tensors="pt", +) + +print(batch.keys()) +print(batch["input_ids"].shape) +``` + +### Practical padding/truncation defaults +- Safe batch default: `padding=True, truncation=True` +- Deterministic cap: add `max_length=...` +- Static shapes: `padding="max_length"` + `max_length=...` + +### Decoder-only LMs: pad token + left padding for batching +Some causal LMs do not define a pad token. For batched inputs (esp. generation), set it explicitly. +```python +from transformers import AutoTokenizer + +model_id = "gpt2" +tok = AutoTokenizer.from_pretrained(model_id) + +if tok.pad_token is None: + tok.pad_token = tok.eos_token + +tok.padding_side = "left" # common for decoder-only batching + +batch = tok(["hi", "hello there"], padding=True, return_tensors="pt") +print(batch["input_ids"].shape) +``` + +### Long inputs: sliding window with overlap (`stride`) +Use this when text exceeds context length and you want overlapping windows. +```python +from transformers import AutoTokenizer + +tok = AutoTokenizer.from_pretrained("bert-base-uncased") + +text = "very long text " * 2000 +enc = tok( + text, + truncation=True, + max_length=512, + stride=128, + return_overflowing_tokens=True, + return_offsets_mapping=True, # best with fast tokenizers +) + +print("num_windows:", len(enc["input_ids"])) +``` + +### Token classification: word alignment (`is_split_into_words`) +```python +from transformers import AutoTokenizer + +tok = AutoTokenizer.from_pretrained("bert-base-cased") + +words = ["New", "York", "City"] +enc = tok(words, is_split_into_words=True, return_tensors="pt") + +# Fast tokenizers provide token->word alignment +word_ids = enc.word_ids(batch_index=0) +print(word_ids) +``` + +--- + +## Chat templating: `apply_chat_template` + +Use chat templates when the model expects a specific conversation format. +If the user’s issue is decoding/stopping/streaming, route to `generation.md`. + +```python +from transformers import AutoTokenizer + +model_id = "meta-llama/Llama-3.1-8B-Instruct" + +# Access/auth note: +# - If this line fails with 401 Unauthorized / GatedRepoError, the repo is gated or private. +# - Fix: (1) request/accept access on the model page, then (2) authenticate: +# * terminal: `huggingface-cli login` +# * or set env var `HF_TOKEN=hf_...` and restart your kernel/session +# - Optional token examples: +# * AutoTokenizer.from_pretrained(model_id, token=True) # use cached login or HF_TOKEN +# * AutoTokenizer.from_pretrained(model_id, token="hf_...") # explicit token +# - Public demo alternative (no gating): "TinyLlama/TinyLlama-1.1B-Chat-v1.0" +tok = AutoTokenizer.from_pretrained(model_id) + +messages = [ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Write a haiku about preprocessing."}, +] + +prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) +print(prompt) + +# If you later tokenize `prompt` yourself, set add_special_tokens=False to avoid duplicating special tokens. +``` + +To directly get token ids: +```python +enc = tok.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt") +print(enc.shape) +``` + +--- + +## Vision preprocessing: `AutoImageProcessor` + +### Minimal image preprocessing (PIL) +```python +from transformers import AutoImageProcessor +from PIL import Image + +model_id = "google/vit-base-patch16-224" +imgp = AutoImageProcessor.from_pretrained(model_id) + +image = Image.open("image.jpg") +inputs = imgp(images=image, return_tensors="pt") + +print(inputs.keys()) # typically includes "pixel_values" (and sometimes "pixel_mask", model-dependent) +print(inputs["pixel_values"].shape) +``` + +### Fast image processors (if supported) +Some checkpoints provide a “fast” image processor path. +```python +from transformers import AutoImageProcessor + +imgp = AutoImageProcessor.from_pretrained( + "google/vit-base-patch16-224", + use_fast=True, +) +``` + +### What to change (typical; varies by checkpoint) +Prefer changing processor config/kwargs rather than writing ad-hoc transforms. Not every processor supports every knob below: +- resize/crop: `do_resize`, `size`, `do_center_crop`, `crop_size` +- normalize: `do_normalize`, `image_mean`, `image_std` +- rescale: `do_rescale`, `rescale_factor` + +--- + +## Audio preprocessing: `AutoFeatureExtractor` and `AutoProcessor` + +### Key rule: sampling rate must match +If audio outputs are “nonsense,” sampling-rate mismatch is a top cause. +Prefer reading the expected sampling rate from the preprocessor rather than hardcoding it. + +### Waveform models (e.g., wav2vec2): `AutoFeatureExtractor` +```python +import numpy as np +from transformers import AutoFeatureExtractor + +model_id = "facebook/wav2vec2-base-960h" +fe = AutoFeatureExtractor.from_pretrained(model_id) +# 1 second of silence at the model's expected sampling rate (replace with real audio) +sr = fe.sampling_rate +waveform = np.zeros(sr, dtype=np.float32) +inputs = fe( + waveform, + sampling_rate=sr, + padding=True, + return_tensors="pt", +) +print(inputs.keys()) # typically includes "input_values" (+ "attention_mask" sometimes, model-dependent) +``` + +### Spectrogram-feature models (common for Whisper): `AutoProcessor` +Whisper-style models typically use a processor that returns `input_features`. +```python +import numpy as np +from transformers import AutoProcessor + +model_id = "openai/whisper-small" +proc = AutoProcessor.from_pretrained(model_id) +sr = proc.feature_extractor.sampling_rate +waveform = np.zeros(sr, dtype=np.float32) +inputs = proc( + waveform, + sampling_rate=sr, + return_tensors="pt", +) +print(inputs.keys()) # typically includes "input_features" +``` + +--- + +## Video preprocessing: `AutoVideoProcessor` + +Video preprocessing may require a decoding backend depending on how you provide video. +Safest approach (no decoder dependency): **decode frames yourself** and pass frames. + +### Option A (decoder-free): pass frames you already have +Example assumes you have a list of PIL images (frames) or numpy arrays. +For a batch of videos, pass a list of frame-lists: `[[frame1, frame2, ...], [...]]` + +```python +# VideoMAE uses an *image/frame* processor; AutoVideoProcessor is for certain VLM/video-chat model types. +import numpy as np +from PIL import Image +from transformers import AutoImageProcessor, VideoMAEForVideoClassification + +mid = "MCG-NJU/videomae-base-finetuned-kinetics" +proc = AutoImageProcessor.from_pretrained(mid) +model = VideoMAEForVideoClassification.from_pretrained(mid) +num_frames = getattr(model.config, "num_frames", 16) +H, W = 224, 224 +frames = [Image.fromarray(np.random.randint(0,256,(H,W,3),dtype=np.uint8)) + for _ in range(num_frames)] +inputs = proc(images=frames, return_tensors="pt") # <-- key change +pred = model(**inputs).logits.argmax(-1).item() +print(model.config.id2label[pred]) +``` + +### Option B: decode with TorchCodec, then pass frames to VideoMAE +```python +# Requirements: +# pip install torch transformers torchcodec +# + install FFmpeg (shared libs; on Windows this matters) + +import torch +from torchcodec.decoders import VideoDecoder +from transformers import AutoImageProcessor, VideoMAEForVideoClassification + +video_path = "video.mp4" +model_id = "MCG-NJU/videomae-base-finetuned-kinetics" + +proc = AutoImageProcessor.from_pretrained(model_id) # VideoMAE is frame-based +model = VideoMAEForVideoClassification.from_pretrained(model_id).eval() + +# TorchCodec decodes frames as uint8 tensors; use NHWC to get (N, H, W, C) +decoder = VideoDecoder(video_path, dimension_order="NHWC") +T = len(decoder) +if T == 0: + raise RuntimeError(f"Video has 0 frames: {video_path}") + +num = getattr(model.config, "num_frames", 16) +idx = torch.linspace(0, T - 1, num).round().long().clamp(0, T - 1) + +fb = decoder.get_frames_at(indices=idx.tolist()) # FrameBatch; pixels in fb.data (uint8) +frames = [fb.data[i].cpu().numpy() for i in range(fb.data.shape[0])] # list of HWC uint8 arrays + +inputs = proc(images=frames, return_tensors="pt") +print(inputs.keys()) + +with torch.no_grad(): + pred = model(**inputs).logits.argmax(-1).item() + +print(model.config.id2label[pred]) +``` +--- + +## Multimodal preprocessing: `AutoProcessor` + +Use `AutoProcessor` for models that combine modalities (text + image/audio/video). + +### Recommended: chat template + image + +```python +from transformers import AutoProcessor +from PIL import Image + +model_id = "llava-hf/llava-onevision-qwen2-0.5b-ov-hf" +# For **LLaVA-OneVision**, it’s safest to build the prompt with the **chat template** (it inserts the required image placeholder token). +proc = AutoProcessor.from_pretrained(model_id) +image = Image.open("image.jpg") +messages = [ + { + "role": "user", + "content": [ + {"type": "image"}, + {"type": "text", "text": "Describe this image."}, + ], + } +] +prompt = proc.apply_chat_template(messages, add_generation_prompt=True) +inputs = proc(text=prompt, images=image, return_tensors="pt") +print(inputs.keys()) # typically includes input_ids/attention_mask + pixel_values (and possibly others) +``` +--- + +## Batching + device sanity +### Inspect keys, shapes, dtypes +```python +import torch +for k, v in inputs.items(): + if torch.is_tensor(v): + print(k, tuple(v.shape), v.dtype, v.device) + else: + print(k, type(v)) +``` + +### Move tensors to device (PyTorch) +Some outputs support `.to(device)`; otherwise move per-tensor. +```python +import torch +device = "cuda" if torch.cuda.is_available() else "cpu" +try: + inputs = inputs.to(device) +except Exception: + inputs = {k: (v.to(device) if torch.is_tensor(v) else v) for k, v in inputs.items()} +``` +--- + +## Pitfalls & fixes + +### “Batch fails / shapes differ” +- Text: `padding=True` and usually `truncation=True` +- Audio: `padding=True` + consistent `sampling_rate` +- Vision/video: pass lists consistently; avoid mixing PIL paths/URLs/arrays in the same batch + +### “Tokenizer has no pad token” +- Decoder-only: set `pad_token` (often to `eos_token`) and consider `padding_side="left"` + +### “Output keys don’t match model forward” +Print `inputs.keys()` and confirm expected keys: +- text: `input_ids`, `attention_mask` (maybe `token_type_ids`) +- vision: `pixel_values` (maybe `pixel_mask`) +- audio: `input_values` or `input_features` +- multimodal: combinations + +### “Audio outputs are wrong” +- Verify sampling rate, dtype, and that you’re passing a 1D waveform (not stereo without handling) + +### “Double preprocessing (manual normalize + processor normalize)” +- Prefer processor config; if you must customize, disable the relevant processor steps (model-dependent) + +--- + +## Repo hotspots + +### Tokenizers +- src/transformers/tokenization_utils_base.py +- src/transformers/tokenization_utils_fast.py +- src/transformers/tokenization_utils_tokenizers.py +- src/transformers/models/auto/tokenization_auto.py + +### Processors +- src/transformers/processing_utils.py +- src/transformers/models/auto/processing_auto.py + +### Image processors +- src/transformers/image_processing_utils.py +- src/transformers/image_processing_base.py +- src/transformers/models/auto/image_processing_auto.py +- model-specific: src/transformers/models/*/image_processing_*.py + +### Feature extractors +- src/transformers/feature_extraction_utils.py +- src/transformers/models/auto/feature_extraction_auto.py +- model-specific: src/transformers/models/*/feature_extraction_*.py + +### Video processors +- src/transformers/video_processing_utils.py +- src/transformers/models/auto/video_processing_auto.py +- src/transformers/video_utils.py +- model-specific: src/transformers/models/*/video_processing_*.py + - example: src/transformers/models/videomae/video_processing_videomae.py + +### Tests (entry points) +- tests/test_tokenization_common.py +- model-specific: tests/models//... \ No newline at end of file diff --git a/.claude/skills/transformers-api/reference/areas/repo-contributing.md b/.claude/skills/transformers-api/reference/areas/repo-contributing.md new file mode 100644 index 000000000000..2d26ef38650c --- /dev/null +++ b/.claude/skills/transformers-api/reference/areas/repo-contributing.md @@ -0,0 +1,315 @@ +# Repo navigation & contributing (where is X implemented? + PR hygiene) + +## Contents +- [Scope](#scope) +- [Minimum questions to ask](#minimum-questions-to-ask) +- [Decision guide](#decision-guide) +- [Quickstarts](#quickstarts) + - [1) Locate an implementation (public API → module → file)](#1-locate-an-implementation-public-api--module--file) + - [2) Set up a dev environment (editable install)](#2-set-up-a-dev-environment-editable-install) + - [3) Run the smallest relevant tests](#3-run-the-smallest-relevant-tests) + - [4) Run style/quality checks (make targets)](#4-run-stylequality-checks-make-targets) + - [5) Run repo consistency checks (make repo-consistency)](#5-run-repo-consistency-checks-make-repo-consistency) + - [6) Build docs locally (doc-builder)](#6-build-docs-locally-doc-builder) + - [7) Model contributions (modular approach + checklist)](#7-model-contributions-modular-approach--checklist) +- [Knobs that matter (3–8)](#knobs-that-matter-38) +- [Pitfalls & fixes](#pitfalls--fixes) +- [Repo hotspots](#repo-hotspots) +- [Verify / locate in repo](#verify--locate-in-repo) + +--- + +## Scope + +Use this page when the user wants to: +- find “**where is X implemented?**” (exact file/class/function) +- understand the **repo layout** (`src/`, `tests/`, `docs/`, `examples/`) +- make a **small targeted change** and open a PR safely +- add/update **docs**, **tests**, or a **model** + +--- + +## Minimum questions to ask + +Ask only what you need (0–5 questions): +1) The **symbol/name** (class/function/arg) OR the **behavior** (what changed / what’s wrong) +2) Is the request “**where is it**” or “**change it**” or “**add it**”? +3) Which backend matters (PyTorch/TF/JAX) and which area (pipelines/generation/trainer/tokenizers/processors)? +4) Do they have a **repro** or failing test? (ideal) +5) Are they changing **public API** or internal behavior only? + +--- + +## Decision guide + +### If the question is “Where is X implemented?” +Use this ladder (don’t guess): +1) Confirm the public symbol exists → `reference/generated/public_api.md` +2) Map it to a file path → `reference/generated/module_tree.md` +3) Grep the repo for the symbol / error substring / config key +4) Find the tests that cover it, then adjust minimally + +### If the goal is “Change X” (bug fix / behavior change) +1) Reproduce (minimal script) OR write a failing test first +2) Make the smallest code change +3) Run the smallest relevant tests +4) Run `make fixup` and fix remaining issues +5) Open PR with a clear title and minimal diff + +### If the goal is “Add X” (new model / new feature) +1) Prefer the modular approach when available (keeps contributions maintainable) +2) Add code + docs + tests together +3) Run repo consistency checks so required registries/indexes don’t get missed +4) Keep the PR as small and focused as possible + +--- + +## Quickstarts + +### 1. Locate an implementation (public API → module → file) + +Follow this sequence: + +1) **Does the symbol exist publicly?** + Open: `reference/generated/public_api.md` + +2) **Where is it implemented?** + Open: `reference/generated/module_tree.md` + - Identify the owning module/file under `src/transformers/` + - Note adjacent files in the same folder (helpers/configs/variants) + +3) **Grep keywords** + Use 1–3 high-signal search terms: + - exact symbol name (e.g., `set_attn_implementation`) + - error substring from traceback + - config key (e.g., `attn_implementation`, `torch_dtype`) + +--- + +### 2. Set up a dev environment (editable install) + +```bash +git clone https://github.com//transformers.git +cd transformers +git remote add upstream https://github.com/huggingface/transformers.git +git checkout -b my-descriptive-branch +``` + +Before opening a PR (or if a maintainer asks), rebase your branch on upstream: + +```bash +git fetch upstream +git rebase upstream/main +``` + +Editable install in a virtualenv: + +```bash +pip install -e ".[dev]" +``` + +If that fails (optional deps can be heavy), install PyTorch first, then: + +```bash +pip install -e ".[quality]" +``` + +If Transformers was already installed in that env, uninstall it first: + +```bash +pip uninstall transformers +``` + +--- + +### 3. Run the smallest relevant tests + +Run only what you touched first: + +```bash +pytest tests/.py +``` + +Iterate faster with keyword filtering: + +```bash +pytest -k "keyword_here" tests/.py +``` + +#### Match CI’s test selection (tests_fetcher) + +Transformers CI selects tests impacted by your PR diff. You can reproduce that selection locally by running the same helper script CI uses. + +```bash +python utils/tests_fetcher.py +``` + +This creates a `test_list.txt` file with the tests to run; execute them like this: + +```bash +python -m pytest -n 8 --dist=loadfile -rA -s $(cat test_list.txt) +``` + +If you add/modify `@slow` tests, run them explicitly. By default, slow tests are skipped; set `RUN_SLOW=yes` to enable them — note this can download **many gigabytes** of models (disk + bandwidth required). + +```bash +RUN_SLOW=yes python -m pytest -n auto --dist=loadfile -s -v tests/ +``` + +Accepted variant (you’ll also see this form used in some docs/CI contexts): + +```bash +RUN_SLOW=1 pytest tests/ +``` + + +--- + +### 4. Run style/quality checks (make targets) + +Full formatting: + +```bash +make style +``` + +Quality checks: + +```bash +make quality +``` + +Fast path for PR iteration (targets modified files and also runs repo consistency): + +```bash +make fixup +``` + +--- + +### 5. Run repo consistency checks (make repo-consistency) + +Run: + +```bash +make repo-consistency +``` + +If it fails on copies / generated-content checks, run: + +```bash +make fix-copies +``` + +Then rerun: + +```bash +make repo-consistency +``` + +--- + +### 6. Build docs locally (hf-doc-builder) + +If you modified anything under `docs/source`, make sure the documentation can still be built. + +Install the documentation builder: + +```bash +pip install hf-doc-builder +``` + +Run the following command from the root of the repository: + +```bash +doc-builder build transformers docs/source/en --build_dir ~/tmp/test-build +``` + +Inspect the output under `~/tmp/test-build`. + +--- + +### 7. Model contributions (modular approach + checklist) + +For vision-language / multimodal models (images/videos), follow the official Transformers contribution checklist. + +#### Required checklist (vision-language / multimodal) + +1) **Implement a modular file** +- Prefer the modular architecture pattern: create `modular_.py`. +- Use the CLI to scaffold a modular skeleton: + - `transformers add-new-model-like` +- Verify the modular file with: +~~~bash +python utils/modular_model_converter.py +~~~ +This generates the derived files (`modeling_*.py`, `configuration_*.py`, etc.) and CI enforces that they match the modular source. + +2) **Add a fast image processor (for image/video models)** +- If your model processes images, add a fast image processor that inherits from `BaseImageProcessorFast` (torch/torchvision-based) for better performance. + +3) **Create a weight conversion script** +- Add `convert__to_hf.py` to convert original checkpoints to the Hugging Face format (load, map keys, save), including usage examples in the script. + +4) **Add integration tests with exact output matching** +- Add an `IntegrationTest` that runs end-to-end processing + modeling with **exact output matching** (generated text for generative models; logits for non-generative models). +- Use real checkpoints + real inputs (consider 4-bit / half precision if the checkpoint is large for CI). + +5) **Update documentation** +- Add or update `docs/source/en/model_doc/.md` with usage examples, model description + paper link, and basic usage with `Pipeline` and `AutoModel`. +- Add the model to the appropriate TOC files. + +6) **Look for reusable patterns** +- Reuse established patterns from similar models (LLaVA, Idefics2, Fuyu, etc.) and avoid reinventing core components. + +Before pushing, run: +~~~bash +make fixup +~~~ + +--- + +## Knobs that matter (3–8) + +1) Keep PRs **small** (avoid drive-by refactors) +2) Repro-first: failing test or minimal repro before changing logic +3) Run the **smallest relevant tests** first, then expand +4) Always run `make fixup` before pushing +5) For new models/features: run `make repo-consistency` +6) If docs changed: run `doc-builder build ...` +7) If slow tests changed/added: run `RUN_SLOW=1 pytest ...` +8) When changing public API: verify docs + exports + tests + +--- + +## Pitfalls & fixes + +- Can’t find where something is defined: + - confirm in `public_api.md`, then locate via `module_tree.md`, then grep +- CI fails on formatting/lint: + - run `make fixup`, then rerun failing checks +- Repo consistency fails: + - run `make repo-consistency`; if it points to copy checks, try `make fix-copies` +- Docs build fails: + - run `doc-builder build transformers docs/source/ --build_dir ...` and fix missing toctree/refs + +--- + +## Repo hotspots + +- Core library: `src/transformers/` +- Models: `src/transformers/models/` +- Pipelines: `src/transformers/pipelines/` +- Generation: `src/transformers/generation/` +- Trainer: `src/transformers/trainer.py` (+ related modules) +- Tests: `tests/` (model tests usually under `tests/models//`) +- Docs: `docs/source/` (English content commonly under `docs/source/en/`) +- Examples: `examples/` + +--- + +## Verify / locate in repo + +When uncertain, use Skill verification indexes: +- “Does this symbol/arg exist?” → `reference/generated/public_api.md` +- “Where is it implemented?” → `reference/generated/module_tree.md` diff --git a/.claude/skills/transformers-api/reference/areas/training.md b/.claude/skills/transformers-api/reference/areas/training.md new file mode 100644 index 000000000000..b07d1591dd48 --- /dev/null +++ b/.claude/skills/transformers-api/reference/areas/training.md @@ -0,0 +1,353 @@ +# Training / Fine-tuning (Trainer + Seq2SeqTrainer) + +## Contents +- [Scope](#scope) +- [Minimum questions to ask](#minimum-questions-to-ask) +- [Decision guide: `Trainer` vs `Seq2SeqTrainer` vs custom loop](#decision-guide-trainer-vs-seq2seqtrainer-vs-custom-loop) +- [Quickstarts](#quickstarts) + - [1) Trainer: text classification (baseline + eval)](#1-trainer-text-classification-baseline--eval) + - [2) Trainer: map/tokenize a Dataset safely (columns + labels)](#2-trainer-maptokenize-a-dataset-safely-columns--labels) + - [3) Trainer: distributed / multi-GPU launch (Accelerate/torchrun)](#3-trainer-distributed--multi-gpu-launch-acceleratetorchrun) + - [4) Trainer: image classification (non-text example; `remove_unused_columns=False`)](#4-trainer-image-classification-non-text-example-remove_unused_columnsfalse) + - [5) Trainer: custom loss (minimal override)](#5-trainer-custom-loss-minimal-override) + - [6) Trainer: evaluate/predict-only (no training)](#6-trainer-evaluatepredict-only-no-training) +- [Knobs that matter (3–8)](#knobs-that-matter-38) +- [Pitfalls & fixes](#pitfalls--fixes) +- [Column dropping and why it matters](#column-dropping-and-why-it-matters) +- [Verify / locate in repo](#verify--locate-in-repo) + +--- + +## Scope + +Use this page when the user wants to **fine-tune / train / evaluate** a model in `transformers` using `Trainer` or `Seq2SeqTrainer`. + +--- + +## Minimum questions to ask + +Ask only what you need to produce a runnable snippet (0–6 questions): +1) **Task** (classification / token classification / seq2seq / causal LM / vision / audio) +2) **Model id or local path** (and `revision` if pinned) +3) **Dataset** source + columns (inputs, labels, any extra metadata needed) +4) **Backend + device** (PyTorch; CPU/CUDA/MPS; num GPUs; rough VRAM) +5) **Goal** (correctness vs speed vs memory vs reproducibility) +6) If blocked: **full traceback + exact versions** + smallest repro + +--- +## Decision guide: `Trainer` vs `Seq2SeqTrainer` vs custom loop + +### Prefer `Trainer` when… +- You want the **standard, feature-complete** training/eval loop with minimal custom code. +- Your evaluation can be done from a **forward pass** (loss/logits → `compute_metrics`), optionally with `preprocess_logits_for_metrics` to transform logits before metrics caching. +- You may still be doing seq2seq *training*, but you **don’t need `generate()` during eval/predict** (e.g., loss-based evaluation only). + +### Prefer `Seq2SeqTrainer` when… +- You’re training **sequence-to-sequence** models (e.g., summarization/translation) and want the seq2seq-adapted training path. +- You want evaluation/prediction **with generation** (`predict_with_generate=True`) so you can compute ROUGE/BLEU-style metrics from generated sequences. +- You want easy control over generation at eval/predict time (e.g., `max_length`, `num_beams`, and other `generate` kwargs). + +### Prefer a custom loop when… +- You need **nonstandard optimizer steps**, RL-style objectives, multi-stage losses, or very custom batching/updates that don’t fit cleanly into Trainer customization. +- You’re ready to write your own loop (often with **Accelerate** to avoid distributed/mixed-precision boilerplate). +--- + +## Quickstarts + +### 1. Trainer: text classification (baseline + eval) + +```python +import numpy as np +from datasets import load_dataset +from transformers import ( + AutoTokenizer, + AutoModelForSequenceClassification, + DataCollatorWithPadding, + TrainingArguments, + Trainer, +) + +model_id = "distilbert/distilbert-base-uncased-finetuned-sst-2-english" + +ds = load_dataset("imdb") +tok = AutoTokenizer.from_pretrained(model_id) + +def preprocess(batch): + return tok(batch["text"], truncation=True) + +tok_ds = ds.map(preprocess, batched=True, remove_columns=["text"]) + +if "label" in tok_ds["train"].column_names and "labels" not in tok_ds["train"].column_names: + tok_ds = tok_ds.rename_column("label", "labels") + +train_ds = tok_ds["train"].shuffle(seed=42).select(range(2000)) +eval_ds = tok_ds["test"].shuffle(seed=42).select(range(2000)) + +model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=2) +collator = DataCollatorWithPadding(tokenizer=tok) + +def compute_metrics(eval_pred): + logits = eval_pred.predictions if hasattr(eval_pred, "predictions") else eval_pred[0] + labels = eval_pred.label_ids if hasattr(eval_pred, "label_ids") else eval_pred[1] + preds = np.argmax(logits, axis=-1) + return {"accuracy": float((preds == labels).mean())} + +args = TrainingArguments( + output_dir="./out_cls", + learning_rate=2e-5, + per_device_train_batch_size=8, + num_train_epochs=1, + weight_decay=0.01, + eval_strategy="no", + save_strategy="no", + load_best_model_at_end=False, + report_to="none", +) + + +trainer = Trainer( + model=model, + args=args, + train_dataset=train_ds, + eval_dataset=eval_ds, + processing_class=tok, + data_collator=collator, + compute_metrics=compute_metrics, +) + +trainer.train() +print(trainer.evaluate()) +trainer.save_model("./out_cls/final") +``` + +Notes: +- If you don’t want eval, set `eval_strategy="no"` and omit `eval_dataset`. +- Start by training on a small sample (e.g., 200–2,000 examples) to quickly verify the pipeline runs end-to-end before scaling to the full dataset. +--- + +### 2. Trainer: map/tokenize a Dataset safely (columns + labels) + +This checklist prevents 80% of “why is loss None / labels missing / shapes wrong” issues. + +```python +from datasets import load_dataset +from transformers import AutoTokenizer + +model_id = "distilbert/distilbert-base-uncased" +ds = load_dataset("imdb") +tok = AutoTokenizer.from_pretrained(model_id) + +def preprocess(batch): + out = tok(batch["text"], truncation=True) + out["labels"] = batch["label"] # make supervision explicit + return out + +proc = ds["train"].map(preprocess, batched=True, remove_columns=["text"]) + +ex = proc[0] +print(sorted(ex.keys())) +print("len(input_ids):", len(ex["input_ids"]), "labels:", ex["labels"]) +``` + +If you have multiple supervision fields (e.g., `start_positions`/`end_positions` or multi-task), +keep them as explicit columns and handle them via your model forward and/or `label_names` (advanced). + +--- + +### 3. Trainer: distributed / multi-GPU launch (Accelerate/torchrun) + +Trainer typically scales via the launcher you use (code often stays the same). + +**Option A: Accelerate** +```bash +accelerate config +accelerate launch train.py +``` + +**Option B: torchrun** +```bash +torchrun --nproc_per_node 2 train.py +``` + +Practical scaling knobs: +- Reduce per-device batch size and use `gradient_accumulation_steps` to keep the same global batch. +- For instability, start with fewer GPUs and confirm correctness first. + +--- + +### 4. Trainer: image classification (non-text example; `remove_unused_columns=False`) + +For vision/video, you often need the raw `image`/`video` column to build `pixel_values`. +Trainer may drop columns by default, so set `remove_unused_columns=False`. + +```python +from datasets import load_dataset +from transformers import ( + AutoImageProcessor, + AutoModelForImageClassification, + DefaultDataCollator, + TrainingArguments, + Trainer, +) + +model_id = "google/vit-base-patch16-224" +ds = load_dataset("beans") # has an `image` column + +processor = AutoImageProcessor.from_pretrained(model_id) +model = AutoModelForImageClassification.from_pretrained(model_id, num_labels=3) + +def transform(example): + # example["image"] is a PIL image + enc = processor(example["image"], return_tensors="pt") + example["pixel_values"] = enc["pixel_values"][0] + if "label" in example and "labels" not in example: + example["labels"] = example["label"] + return example + +train_ds = ds["train"].with_transform(transform) +eval_ds = ds["validation"].with_transform(transform) + +args = TrainingArguments( + output_dir="./out_vit", + per_device_train_batch_size=8, + per_device_eval_batch_size=8, + num_train_epochs=1, + learning_rate=5e-5, + eval_strategy="epoch", + save_strategy="epoch", + remove_unused_columns=False, # IMPORTANT for transforms that rely on raw columns + report_to="none", +) + +trainer = Trainer( + model=model, + args=args, + train_dataset=train_ds, + eval_dataset=eval_ds, + processing_class=processor, + data_collator=DefaultDataCollator(), +) + +trainer.train() +print(trainer.evaluate()) +``` + +--- + +### 5. Trainer: custom loss (minimal override) + +Use this when you need a custom loss but want to keep Trainer’s loop. + +```python +import torch +from transformers import Trainer + +class CustomLossTrainer(Trainer): + def compute_loss(self, model, inputs, return_outputs=False): + labels = inputs.pop("labels") + outputs = model(**inputs) + logits = outputs.logits + + # Example: multi-label BCE loss (labels should be float multi-hot) + loss = torch.nn.functional.binary_cross_entropy_with_logits(logits, labels) + + return (loss, outputs) if return_outputs else loss +``` + +Then use it like `Trainer`: +```python +# trainer = CustomLossTrainer(model=..., args=..., train_dataset=..., eval_dataset=..., ...) +``` + +--- + +### 6. Trainer: evaluate/predict-only (no training) + +Useful for smoke tests, regression checks, or “just compute metrics”. + +```python +# assume you already built: trainer = Trainer(...) +metrics = trainer.evaluate() +print("eval:", metrics) + +pred = trainer.predict(trainer.eval_dataset) +print("metrics:", pred.metrics) +print("predictions shape:", getattr(pred.predictions, "shape", None)) +``` + +--- + +## Knobs that matter (3–8) + +Prioritize these knobs before anything else: + +1) **Task ↔ model head compatibility** + - classification → `AutoModelForSequenceClassification` + - seq2seq → `AutoModelForSeq2SeqLM` + `Seq2SeqTrainer` +2) **`model` + `revision`** (pin for reproducibility) +3) **Data correctness** + - label key: prefer `labels` + - correct dtypes/shapes (class ids vs multi-hot vs token ids) +4) **Batching vs memory** + - `per_device_train_batch_size`, `gradient_accumulation_steps` +5) **Evaluation/save cadence** + - `eval_strategy`, `eval_steps`, `save_strategy`, `save_steps` +6) **Precision** + - `fp16` / `bf16` (if supported) +7) **Column handling** + - `remove_unused_columns` (often needs `False` for vision/video or custom transforms) +8) **Best model selection** + - `load_best_model_at_end`, `metric_for_best_model`, `greater_is_better` + +--- + +## Pitfalls & fixes + +- **TypeError: unexpected keyword** + - `eval_strategy` → try `evaluation_strategy` + - `processing_class` → try `tokenizer` +- **Eval enabled but no eval dataset** + - Provide `eval_dataset`, or set `eval_strategy="no"`. +- **Loss is `None` / labels ignored** + - Ensure the label key is `labels` and its dtype matches the loss (int class ids vs float multi-hot). +- **Trainer drops columns you still need** + - Set `remove_unused_columns=False` and manage inputs carefully (especially vision/video transforms). +- **OOM** + - Reduce batch size, increase `gradient_accumulation_steps`, lower precision, shorten sequence lengths. + - For deep tuning route to `reference/areas/performance.md`. +- **Very slow “time to first step”** + - Dataset transforms/caching/dataloader workers can dominate; start with a tiny subset and `num_workers=0`. + +--- + +## Column dropping and why it matters + +By default, Trainer removes dataset columns that aren’t accepted by `model.forward()`. + +This is usually helpful, but it can break workflows where: +- you need raw columns to build model inputs (e.g., `image` → `pixel_values`) +- you keep metadata columns for metrics/debugging + +What to do: +- If your preprocessing happens in a dataset transform (e.g., `with_transform`) and needs raw columns: + - set `TrainingArguments(remove_unused_columns=False)` +- Ensure your transform or collator produces exactly the tensors the model expects. + +--- + +## Verify / locate in repo + +Common repo hotspots: +- Trainer loop + internals: + - `src/transformers/trainer.py` + - `src/transformers/trainer_utils.py` + - `src/transformers/trainer_callback.py` +- Seq2Seq training: + - `src/transformers/trainer_seq2seq.py` + - `src/transformers/training_args_seq2seq.py` +- Training args + defaults: + - `src/transformers/training_args.py` +- Collators: + - `src/transformers/data/data_collator.py` +- Integrations (DeepSpeed/FSDP/etc.): + - `src/transformers/integrations/` \ No newline at end of file diff --git a/.claude/skills/transformers-api/reference/areas/troubleshooting.md b/.claude/skills/transformers-api/reference/areas/troubleshooting.md new file mode 100644 index 000000000000..b5c41ac9e74a --- /dev/null +++ b/.claude/skills/transformers-api/reference/areas/troubleshooting.md @@ -0,0 +1,343 @@ +# Troubleshooting (errors, wrong outputs, regressions) + +## Contents +- [Scope](#scope) +- [Minimum questions to ask](#minimum-questions-to-ask) +- [Decision guide: classify the failure](#decision-guide-classify-the-failure) +- [Quickstarts](#quickstarts) + - [1) Make the error actionable (logging + minimal repro)](#1-make-the-error-actionable-logging--minimal-repro) + - [2) Firewalled / offline / “Connection error”](#2-firewalled--offline--connection-error) + - [3) CUDA out of memory (OOM)](#3-cuda-out-of-memory-oom) + - [4) ImportError / missing class after copy-pasting docs](#4-importerror--missing-class-after-copy-pasting-docs) + - [5) CUDA error: device-side assert triggered](#5-cuda-error-device-side-assert-triggered) + - [6) Silent wrong output from padding tokens (missing attention_mask)](#6-silent-wrong-output-from-padding-tokens-missing-attention_mask) +- [Knobs that matter (3–8)](#knobs-that-matter-38) +- [Pitfalls & fixes](#pitfalls--fixes) +- [Triage flow (repeatable checklist)](#triage-flow-repeatable-checklist) +- [Verify / locate in repo](#verify--locate-in-repo) + +--- + +## Scope + +Use this page when the user is **blocked** (exception, crash, hang, or wrong output) while using `transformers`, or they suspect a regression. + +--- + +## Minimum questions to ask + +Ask only what you need (0–5 questions). If the user already pasted these, don’t re-ask. + +1) **Exact failure**: full traceback, or “expected vs actual output” +2) **Minimal repro**: smallest runnable snippet (use `templates/minimal_repro.md`) +3) **Versions**: `transformers`, backend (`torch` / TF / JAX), Python, CUDA (if relevant) +4) **Model + revision**: model id or local path; pinned `revision`/commit if applicable +5) **Hardware**: CPU/CUDA/MPS and rough VRAM if memory/perf related + +### 1-minute triage (when the user is blocked) + +1) Classify the failure (download/cache, install/version, CUDA runtime, silent correctness, task mismatch) +2) Ask at most 3 missing facts (traceback, minimal repro, versions) +3) Apply one smallest fix and one next diagnostic step + +--- + +## Decision guide: classify the failure + +Classify before fixing. Most issues fall into one of these buckets: + +1) **Download / cache / connectivity** + - “Connection error… cannot find requested files in cached path” + - hanging at model download / corporate network / firewalled machines + +2) **Install / version mismatch** + - `ImportError: cannot import name ... from transformers` + - missing newer models/features + +3) **GPU runtime / CUDA** + - CUDA OOM + - `device-side assert triggered` + - dtype/device mismatch + +4) **Silent correctness bugs** + - wrong logits/hidden states with padding + - wrong outputs due to missing masks or wrong preprocessing + +5) **Auto-class / task mismatch** + - `ValueError: Unrecognized configuration class ... for this kind of AutoModel` + - checkpoint doesn’t support the requested task + +Then apply the smallest fix + the smallest next diagnostic step. + +--- + +## Quickstarts + +### 1. Make the error actionable (logging + minimal repro) + +Turn up logging and isolate to a minimal repro **before** “trying random flags”. + +```python +# 1) Make transformers logs more verbose (runtime) +from transformers.utils import logging +logging.set_verbosity_debug() # or set_verbosity_info() +logging.enable_default_handler() +logging.enable_explicit_format() + +# 2) If your script is noisy, you can also: +# logging.disable_progress_bar() +``` + +If you can’t change code easily, use environment variables: + +```bash +# More/less logging without editing code: +TRANSFORMERS_VERBOSITY=debug python your_script.py +# To suppress "advice" warnings (not errors): +TRANSFORMERS_NO_ADVISORY_WARNINGS=1 python your_script.py +``` + +Now shrink to a repro: +- one model +- one input +- one forward/generate call +- print shapes/dtypes/devices right before the failure + +(Use `templates/minimal_repro.md`.) + +--- + +### 2. Firewalled / offline / “Connection error” + +Symptoms: connection errors and the cache doesn’t contain the files yet, often in restricted networks. + +Two reliable patterns: + +**A. Pre-download the repo, then run offline** + +```python +from huggingface_hub import snapshot_download + +local_path = snapshot_download( + repo_id="meta-llama/Llama-2-7b-hf", + repo_type="model", + # revision="main", # or a tag/commit for reproducibility +) +print(local_path) +``` +Note: if the model is gated or private, you must be authenticated to download files. Use `hf auth login`, or `huggingface_hub.login()`, or pass `token=...` to loading/downloading methods (including `snapshot_download()` / `from_pretrained()`). + + +```bash +# Avoid HTTP calls to the Hub: +HF_HUB_OFFLINE=1 python your_script.py +``` + +**B. Force local-only loading (no network calls)** + +```python +from transformers import AutoModel + +model = AutoModel.from_pretrained("./path/to/local/directory", local_files_only=True) +``` + +Also sanity-check cache location if you’re in containers/CI: +- Default cache location (from `HF_HUB_CACHE`) is `~/.cache/huggingface/hub` + - Windows: `C:\Users\\.cache\huggingface\hub` +- You can redirect the cache via environment variables (priority order): + 1) `HF_HUB_CACHE` ( default ) + 2) `HF_HOME` + 3) `XDG_CACHE_HOME` + `/huggingface` (only if `HF_HOME` is not set) + +--- + +### 3. CUDA out of memory (OOM) + +Start with the two levers recommended in the official Transformers troubleshooting guide (training): +- Reduce `per_device_train_batch_size` +- Increase `gradient_accumulation_steps` to keep the overall batch size + + +```python +# Trainer-side (example) +from transformers import TrainingArguments + +args = TrainingArguments( + output_dir="out", + per_device_train_batch_size=1, + gradient_accumulation_steps=8, +) +``` + +Common additional levers : reduce inference `batch_size`, reduce `max_length` / `max_new_tokens`, and avoid returning activation-heavy outputs (like hidden states) unless needed. + +--- + +### 4. ImportError / missing class after copy-pasting docs + +Symptom example: + +`ImportError: cannot import name 'SomeNewThing' from 'transformers'` + +This commonly means the docs/snippet assumes a newer version of Transformers. + +Fix: upgrade Transformers (and restart the runtime/kernel): + +```bash +pip install --upgrade transformers +# or install from source (latest changes): +pip install git+https://github.com/huggingface/transformers +``` + +If the model is *very new*, verify you’re on a version that includes it, or install from source. + +--- + +### 5. CUDA error: device-side assert triggered + +This is often a vague GPU-side error. Two reliable ways to get a real traceback: + +**A. Run on CPU to get a better error message** + +```python +# Important: set this before any CUDA context is initialized +import os +os.environ["CUDA_VISIBLE_DEVICES"] = "" # forces CPU +``` + +**B. Force synchronous CUDA to pinpoint the failing op** + +```python +# Important: set this before the first CUDA operation +import os +os.environ["CUDA_LAUNCH_BLOCKING"] = "1" +``` + +Once you have a real stack trace, the most common underlying causes are: +- invalid labels / out-of-range class indices (classification) +- bad token ids (negative or >= vocab size) +- shape mismatches that only surface on GPU kernels + +--- + +### 6. Silent wrong output from padding tokens (missing attention_mask) + +Symptom: outputs/logits differ for padded sequences vs the “true” unpadded sequence, without an obvious error. + +Most of the time, fix by passing `attention_mask` so the model ignores padding tokens: + +```python +import torch +from transformers import AutoModelForSequenceClassification + +model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-uncased") + +# Two sequences, second is padded with 0 +input_ids = torch.tensor([ + [7592, 2057, 2097, 2393, 9611, 2115], + [7592, 0, 0, 0, 0, 0], +]) + +# Correct: mask out padding +attention_mask = torch.tensor([ + [1, 1, 1, 1, 1, 1], + [1, 0, 0, 0, 0, 0], +]) + +out = model(input_ids, attention_mask=attention_mask) +print(out.logits) +``` + +Note: tokenizers often create `attention_mask` for you when you call them, but if you bypass tokenizers and hand-craft `input_ids`, you must provide the mask yourself. + +Why it’s manual: Transformers does not automatically infer `attention_mask` from padding because some models have no padding token, and some use-cases intentionally attend to padding tokens. + +--- + +## Knobs that matter (3–8) + +Prioritize these knobs before anything else: + +1) **Versions**: `transformers` + backend framework version (Torch/TF/JAX) +2) **Model identity**: model id/path + pinned `revision` (reproducibility) +3) **Connectivity mode**: `HF_HUB_OFFLINE`, `local_files_only=True`, cache location env vars +4) **Device placement**: CPU vs CUDA vs MPS; single device vs sharding (`device_map`) when relevant +5) **Batch/shape**: `batch_size`, sequence length, image size, audio length +6) **Masks**: `attention_mask` (text), pixel masks where applicable +7) **Task ↔ class match**: correct `AutoModelFor*` / pipeline task for the checkpoint +8) **Logging**: `TRANSFORMERS_VERBOSITY`, explicit formatting, disable noisy progress bars + +--- + +## Pitfalls & fixes + +- **“Connection error… cannot find requested files in cached path”** + - You’re firewalled/offline and the model isn’t cached → pre-download (`snapshot_download`) then set `HF_HUB_OFFLINE=1`, or use `local_files_only=True`. + +- **ImportError for a class shown in docs** + - You’re on an older Transformers → upgrade or install from source. + +- **OOM** + - Lower batch/length first; then route to `reference/areas/performance.md`. + +- **CUDA device-side assert** + - Run on CPU or set `CUDA_LAUNCH_BLOCKING=1` to get a real traceback; then validate label/token id ranges. + +- **Wrong outputs with padding** + - Pass `attention_mask` (especially when you create `input_ids` manually). + +- **AutoModel config mismatch** + - The checkpoint configuration cannot be mapped to the requested task head (most commonly because the checkpoint does not support that task) → load with a compatible `AutoModel*` or choose a checkpoint that supports the task. + +--- + +## Triage flow (repeatable checklist) + +Use this flow to avoid random guessing: + +1) **Freeze the environment** + - record versions + model id + revision/commit + - re-run in a clean venv if dependency conflicts are suspected + +2) **Minimize** + - one model + - one batch + - one call (forward or generate) + - print shapes/dtypes/devices right before the failure + +3) **Classify** + - download/cache vs install/version vs CUDA runtime vs silent correctness vs task mismatch + +4) **Apply the smallest fix** + - one change at a time, re-run the minimal repro + +5) **Only then expand** + - re-introduce batching, datasets, distributed, larger inputs, etc. + +6) **If you suspect a regression** + - try the same repro on a known-good version and the current version + - pin the version in the repro so others can reproduce it + +--- + +## Verify / locate in repo + +When uncertain, use Skill verification indexes: +- “Does this symbol/arg exist?” → `reference/generated/public_api.md` +- “Where is it implemented?” → `reference/generated/module_tree.md` + +Common repo hotspots (for debugging “why is this happening?”): +- Central logging utilities: `src/transformers/utils/logging.py` +- Import/version gating: `src/transformers/utils/import_utils.py` +- Model loading + weight init: `src/transformers/modeling_utils.py` +- Auto class mappings: + - `src/transformers/models/auto/modeling_auto.py` + - `src/transformers/models/auto/configuration_auto.py` +- Pipelines core: + - `src/transformers/pipelines/__init__.py` + - `src/transformers/pipelines/base.py` + +If you can’t verify quickly: +- say what you *did* verify, +- name the most likely file to inspect next, +- provide 1–3 grep keywords based on the error string. \ No newline at end of file diff --git a/.claude/skills/transformers-api/reference/generated/module_tree.md b/.claude/skills/transformers-api/reference/generated/module_tree.md new file mode 100644 index 000000000000..2c766aac8a4a --- /dev/null +++ b/.claude/skills/transformers-api/reference/generated/module_tree.md @@ -0,0 +1,553 @@ +# Transformers `src/transformers/` module tree (curated) — **v4.57.6** + +> **Purpose**: Fast repo navigation for Transformers API without guessing. +> **Pinned revision (current)**: `transformers==4.57.6` (PyPI release: **2026-01-16**). +> **Design goal**: +> - Prefer **patterns + canonical entry points + grep keywords** over enumerating every file. +> - Treat this as **generated**: pin a Transformers revision (tag/commit or exact PyPI version) and regenerate on upgrades. +> **Not exhaustive**: For model-specific code, use the `models//` patterns and grep tips. + +--- + +## How to use this file + +1. Pick the **surface area** below (Loading, Preprocessing, Generation, Pipelines, Training, Integrations/Quantization, Export/ONNX, CLI). +2. Jump to the **canonical entry point(s)** and search there. +3. If you need the exact implementation: + - `git grep -n "" src/transformers` (keywords provided per area) + - follow imports into submodules + +--- + +## Core package entry points + +``` + +src/transformers/ +**init**.py +dependency_versions_check.py +dependency_versions_table.py + +``` + +- `__init__.py` is the public import surface (re-exports / lazy-import wiring for `from transformers import X`). +- `dependency_versions_check.py` is where import-time version guards often trigger. + +Grep keywords: +- `_LazyModule` +- `dependency_versions_check` +- `require_version` + +--- + +## Configuration, modeling, and loading (PyTorch) + +Canonical entry points: + +``` + +src/transformers/ +configuration_utils.py +modeling_utils.py +pytorch_utils.py +modeling_outputs.py +modeling_layers.py + +``` + +Primary responsibilities: +- `PreTrainedConfig` (config serialization, `from_pretrained` for configs, validation helpers) +- `PreTrainedModel` (weight loading/saving, `from_pretrained` for models, sharding, tying weights) +- torch helpers + shared model output dataclasses/layers + +Grep keywords: +- `from_pretrained(` +- `save_pretrained(` +- `get_checkpoint_shard_files` +- `tie_weights` +- `state_dict` + +Related (often on stack traces): + +``` + +src/transformers/ +dynamic_module_utils.py + +src/transformers/utils/ +hub.py +import_utils.py + +``` + +- `dynamic_module_utils.py` is where `trust_remote_code` plumbing typically lands. +- `utils/hub.py` is where Hub/caching helpers like `cached_file` and shard resolution live. +- `utils/import_utils.py` is lazy-import + optional dependency gating. + +--- + +## Tokenization and preprocessing (text / vision / audio / video / multimodal) + +### Tokenization (slow + fast) + +Canonical entry points: + +``` + +src/transformers/ +tokenization_utils_base.py +tokenization_utils.py +tokenization_utils_fast.py +tokenization_mistral_common.py + +``` + +Notes: +- Slow (Python) tokenizers: `tokenization_utils.py` +- Fast tokenizers (Rust `tokenizers` wrappers): `tokenization_utils_fast.py` +- Shared bases: `tokenization_utils_base.py` +- Newer/common helpers for Mistral ecosystem: `tokenization_mistral_common.py` + +Grep keywords: +- `PreTrainedTokenizerBase` +- `BatchEncoding` +- `AutoTokenizer` +- `TokenizerFast` +- `convert_tokens_to_ids` + +Related conversion helpers: + +``` + +src/transformers/ +convert_slow_tokenizer.py +convert_slow_tokenizers_checkpoints_to_fast.py +convert_slow_tokenizers_checkpoints_to_fast.py + +``` + +Grep keywords: +- `convert_slow_tokenizer` +- `SpmConverter` +- `sentencepiece` + +### Processors / image / feature extraction / audio / video + +Canonical entry points: + +``` + +src/transformers/ +processing_utils.py +feature_extraction_utils.py +feature_extraction_sequence_utils.py +image_processing_base.py +image_processing_utils.py +image_processing_utils_fast.py +image_transforms.py +image_utils.py +audio_utils.py +video_processing_utils.py +video_utils.py + +``` + +Primary responsibilities: +- Processor composition (combining tokenizer + modality preprocessors) +- Feature extractors and base contracts +- Image processing base classes + shared image transforms/utils +- Audio/video helpers used by processors and pipelines + +Grep keywords: +- `AutoProcessor` +- `ProcessorMixin` +- `FeatureExtractionMixin` +- `ImageProcessingMixin` +- `VideoProcessingMixin` + +--- + +## Generation (text generation / decoding / streaming) + +Canonical entry points: + +``` +src/transformers/generation/ +configuration_utils.py +utils.py +logits_process.py +stopping_criteria.py +streamers.py +beam_search.py +beam_constraints.py +candidate_generator.py +watermarking.py + +# Cache utilities used by generation (and models) + +src/transformers/ +cache_utils.py +``` + +Primary responsibilities: +- `GenerationConfig` (defaults + `generation_config.json` serialization) +- `GenerationMixin.generate()` (PyTorch generation loop) +- Logits processors/warpers, stopping criteria, streamers +- Beam search + constraints, candidate generation helpers, watermarking +- KV cache helpers (`cache_utils.py`) + +Grep keywords: +- `class GenerationMixin` +- `def generate(` +- `LogitsProcessor` +- `StoppingCriteria` +- `TextStreamer` +- `DynamicCache` / `StaticCache` + +--- + +## Pipelines (high-level inference) + +Canonical entry points: + +``` +src/transformers/pipelines/ +**init**.py +base.py +``` + +Notes: +- `pipelines/__init__.py` defines the task registry and the `pipeline()` entry point. +- `pipelines/base.py` contains the core `Pipeline` base class and shared inference glue. +- Task-specific pipelines typically follow `pipelines/.py`. + +Grep keywords: +- `class Pipeline` +- `pipeline(` +- `SUPPORTED_TASKS` + +--- + +## Training / evaluation (Trainer) + +Canonical entry points: + +``` +src/transformers/ +trainer.py +trainer_seq2seq.py +trainer_callback.py +trainer_utils.py +trainer_pt_utils.py +training_args.py +training_args_seq2seq.py +optimization.py + +src/transformers/data/ +**init**.py +data_collator.py +``` + +Primary responsibilities: +- `Trainer` training/eval loops, logging, checkpointing +- callback system +- `TrainingArguments` and helper utilities +- optimizer/scheduler helpers (`optimization.py`) +- data collators + +Grep keywords: +- `class Trainer` +- `TrainingArguments` +- `def training_step(` +- `CallbackHandler` +- `get_scheduler` +- `DataCollator` + +--- + +## Auto classes (model/config/tokenizer/processor dispatch) + +Canonical entry points: + +``` +src/transformers/models/auto/ +configuration_auto.py +modeling_auto.py +modeling_tf_auto.py +modeling_flax_auto.py +tokenization_auto.py +processing_auto.py +feature_extraction_auto.py +image_processing_auto.py +video_processing_auto.py +auto_factory.py +``` + +Primary responsibilities: +- mapping tables from `model_type` / config class → model/tokenizer/processor classes +- common auto-loading errors are raised from Auto* dispatch stack (often `configuration_auto.py` / `auto_factory.py`) + +Grep keywords: +- `MODEL_MAPPING` +- `CONFIG_MAPPING` +- `TOKENIZER_MAPPING` +- `PROCESSOR_MAPPING` +- `model_type` + +--- + +## Models (per-architecture packages) + +**Pattern (model implementations):** + +``` +src/transformers/models// +configuration_.py +modeling_.py +modeling_tf_.py # optional +modeling_flax_.py # optional +tokenization_.py # optional +tokenization_*fast.py # optional +processing*.py # optional +image_processing_.py # optional +feature_extraction_.py # optional +generation_.py # optional (model-specific generation helpers) + +# sometimes: video_processing_.py, etc. +``` + +Handy anchors (examples you’ll often see): + +``` +src/transformers/models/bert/modeling_bert.py +src/transformers/models/t5/modeling_t5.py +src/transformers/models/llama/modeling_llama.py +src/transformers/models/qwen2/modeling_qwen2.py +src/transformers/models/clip/modeling_clip.py +``` + +Grep keywords: +- `class .*Model` +- `class .*PreTrainedModel` +- `CONFIG_CLASS` + +--- + +## Performance / kernels / attention backends (common “why is this slow / different?”) + +Canonical entry points: + +``` +src/transformers/ +modeling_attn_mask_utils.py +modeling_flash_attention_utils.py +modeling_rope_utils.py +modeling_gguf_pytorch_utils.py +``` + +Related integration shims (backend-specific routing often lives here): + +``` +src/transformers/integrations/ +flash_attention.py +flex_attention.py +sdpa_attention.py +tensor_parallel.py +``` + +Grep keywords: +- `flash_attention` +- `scaled_dot_product_attention` +- `sdpa` +- `use_flash_attention` +- `gguf` + +--- + +## Utilities and internals + +Canonical entry points (frequently involved in stack traces): + +``` +src/transformers/utils/ +import_utils.py +hub.py +logging.py +versions.py +generic.py +doc.py +chat_template_utils.py +peft_utils.py +quantization_config.py + +src/transformers/ +file_utils.py +debug_utils.py +testing_utils.py +``` + +Primary responsibilities: +- Lazy import mechanics and optional dependency gating +- Hub caching/download helpers used by `from_pretrained` +- logging + version utilities +- docstring tooling and generic helpers +- chat template parsing/formatting helpers +- PEFT helper glue +- quantization config objects +- legacy helpers (`file_utils.py`) + debugging/testing utilities + +Grep keywords: +- `_LazyModule` +- `requires_backends` +- `is_torch_available` +- `cached_file` +- `apply_chat_template` +- `BitsAndBytesConfig` + +--- + +## Integrations and quantization +### Integrations (external libs + runtimes) + +Canonical entry points: + +``` +src/transformers/integrations/ +integration_utils.py +accelerate.py +deepspeed.py +fsdp.py +peft.py +bitsandbytes.py +tiktoken.py +awq.py +quanto.py +``` + +What lives here: +- external library shims (Accelerate/DeepSpeed/FSDP/PEFT) +- tokenizer backends (e.g., tiktoken) and quant backends (AWQ/Quanto/etc.) +- backend-specific feature routing + capability checks + +Grep keywords: +- `requires_backends` +- `is_accelerate_available` +- `is_deepspeed_available` +- `is_bitsandbytes_available` +- `device_map` + +### Quantizers (unified quantization abstraction) + +Canonical entry points: + +``` +src/transformers/quantizers/ +auto.py +base.py +quantizers_utils.py +quantizer_bnb_4bit.py +quantizer_bnb_8bit.py +quantizer_awq.py +quantizer_gptq.py +quantizer_quanto.py + + +src/transformers/utils/ +quantization_config.py +``` + +Grep keywords: +- `HfQuantizer` +- `quant_method` +- `BitsAndBytesConfig` +- `load_in_4bit` / `load_in_8bit` +- `AutoHfQuantizer` + +--- + +## Export / ONNX + +Canonical entry points: + +``` +src/transformers/ +convert_graph_to_onnx.py +src/transformers/onnx/ +**main**.py +config.py +convert.py +features.py +utils.py +``` + +Grep keywords: +- `OnnxConfig` +- `export` +- `opset` +- `transformers.onnx` + +--- + +## CLI / repo tooling (developer workflows) + +Canonical entry points: + +``` +src/transformers/commands/ +transformers_cli.py +chat.py +serving.py +add_new_model_like.py +add_fast_image_processor.py +convert.py +download.py +env.py +run.py +train.py +``` + +Notes: +- `transformers_cli.py` is the CLI dispatcher. +- `chat.py` implements `transformers chat ...` +- `serving.py` implements `transformers serve ...` + +Grep keywords: +- `main(` +- `argparse` +- `transformers chat` +- `transformers serve` +- `add_new_model_like` + +--- + +## Production notes (for Skills maintainers) + +1. **Pin Transformers**: tie generated references to a specific tag/commit or exact PyPI version. +2. **Regenerate on upgrade**: when bumping Transformers, regenerate this map alongside any other generated references. +3. **Keep this file curated**: add new *canonical entry points* as Transformers evolves—don’t mirror the full repo tree. +4. **Security**: if you ship scripts alongside Skills, keep them least-privilege and auditable. + +--- + +## Quick “where is X implemented?” cheat sheet + +| User asks about… | Start here | Then follow into… | +|---|---|---| +| `pipeline()` / task pipelines | `src/transformers/pipelines/__init__.py` | `pipelines/base.py` + task file | +| `AutoModel*` / auto dispatch | `src/transformers/models/auto/modeling_auto.py` | `auto_factory.py` + model subpackage | +| `AutoTokenizer` | `src/transformers/models/auto/tokenization_auto.py` | model tokenizer module | +| `AutoProcessor` | `src/transformers/models/auto/processing_auto.py` | model processor module | +| `from_pretrained` (models) | `src/transformers/modeling_utils.py` | then `src/transformers/utils/hub.py` (caching/shards) | +| `from_pretrained` (configs) | `src/transformers/configuration_utils.py` | config subclass in model subpackage | +| `generate()` behavior | `src/transformers/generation/utils.py` | logits/stopping/streamers + beam/candidate helpers | +| stopping criteria / stop strings | `src/transformers/generation/stopping_criteria.py` | called from generation utils | +| KV cache / caching behavior | `src/transformers/cache_utils.py` | used by generation + some models | +| quantization (general) | `src/transformers/quantizers/auto.py` | specific `quantizer_*.py` + `utils/quantization_config.py` | +| bitsandbytes 4-bit/8-bit | `src/transformers/integrations/bitsandbytes.py` | `quantizers/quantizer_bnb_*.py` | +| `Trainer` loop / callbacks | `src/transformers/trainer.py` | `trainer_callback.py`, `trainer_utils.py` | +| schedulers / optim helpers | `src/transformers/optimization.py` | used from Trainer / scripts | +| data collators | `src/transformers/data/data_collator.py` | task-specific collator classes | +| ONNX export | `src/transformers/onnx/convert.py` | `onnx/config.py` + `onnx/features.py` | +| CLI: `transformers chat` | `src/transformers/commands/chat.py` | `commands/transformers_cli.py` | +| CLI: `transformers serve` | `src/transformers/commands/serving.py` | `commands/transformers_cli.py` | +``` \ No newline at end of file diff --git a/.claude/skills/transformers-api/reference/generated/public_api.md b/.claude/skills/transformers-api/reference/generated/public_api.md new file mode 100644 index 000000000000..eed1c1970628 --- /dev/null +++ b/.claude/skills/transformers-api/reference/generated/public_api.md @@ -0,0 +1,572 @@ +# Transformers Public API (Verification Guide) + +## Table of Contents + +1. [Definition of “Public API”](#1-definition-of-public-api) +2. [Version Discipline](#2-version-discipline) +3. [Mandatory Verification Workflow](#3-mandatory-verification-workflow) +4. [Public API Surfaces (by Area)](#4-public-api-surfaces-by-area) + - 4.1 [Inference](#41-inference) + - 4.2 [Preprocessing](#42-preprocessing) + - 4.3 [Model Loading & Base Classes](#43-model-loading--base-classes) + - 4.4 [Generation](#44-generation) + - 4.5 [Training / Evaluation](#45-training--evaluation) + - 4.6 [Performance / Quantization](#46-performance--quantization) + - 4.7 [Export / Serving](#47-export--serving) +5. [Deprecations & Compatibility Traps (Verify, Don’t Assume)](#5-deprecations--compatibility-traps-verify-dont-assume) +6. [Model Artifact Files (On-Disk Reality Check)](#6-model-artifact-files-on-disk-reality-check) +7. [Regeneration Strategy (Keep This File Correct)](#7-regeneration-strategy-keep-this-file-correct) +8. [Minimal Repro Template (Copy/Paste)](#8-minimal-repro-template-copypaste) + +--- + +## 1. Definition of “Public API” + +An API surface in `transformers` is considered **public** if **at least one** of the following is true: + +1. It is importable directly from the top-level package: + ```python + from transformers import X + ``` +2. It is explicitly documented in the official Hugging Face Transformers documentation (e.g., “Main classes”, “Pipelines”, “Trainer”, “Generation”). +3. It is a documented CLI, configuration file, or runtime behavior supported in the installed version. + +Everything else is **implementation detail** and must not be treated as stable or user-facing. + +**Explicitly non-public by default (unless docs say otherwise):** +- `transformers.models.*` +- deep imports from `transformers.generation.*` (treat as internal **unless explicitly documented as public** and/or importable from `transformers`) +- `transformers.pipelines.*` internals +- anything in `transformers.utils.*` that is not documented as public + +**Production rule:** +If you can’t +(a) import it from `transformers` OR +(b) find it in the official docs for the target version OR +(c) verify it by runtime introspection, **do not present it as supported**. + +--- + +## 2. Version Discipline + +### 2.1 Pin versions (required) +For production systems, pin **all** of: +- `transformers` (exact version or exact git commit) +- backend framework (`torch` / `tensorflow` / `jax`) version +- key accelerators if used (e.g., `accelerate`, quantization libs, ONNX runtimes) + +### 2.2 Record environment fingerprint (required) +Any debugging request must include: +- `transformers.__version__` +- backend + version +- device (CPU/CUDA/MPS) + CUDA version if applicable + +Minimal snippet: +```python +import transformers +print("transformers:", transformers.__version__) + +try: + import torch + print("torch:", torch.__version__) + print("cuda available:", torch.cuda.is_available()) + print("cuda version:", getattr(torch.version, "cuda", None)) +except Exception as e: + print("torch not available:", repr(e)) +``` + +--- + +## 3. Mandatory Verification Workflow + +This is the *only* safe way to answer “does this exist?” questions. + +### 3.1 Verify a top-level symbol exists +```python +import transformers + +def verify_symbol(name: str) -> None: + ok = hasattr(transformers, name) + print(f"{name}: {'OK' if ok else 'MISSING'}") + +for name in [ + "pipeline", + "AutoTokenizer", + "AutoModel", + "Trainer", + "TrainingArguments", + "GenerationConfig", +]: + verify_symbol(name) +``` + +**If missing:** +- Do not guess alternatives. +- Use discovery helpers (below), then present only what is verifiably present. + +### 3.2 Verify an argument exists (inspect signature) +Never claim a kwarg exists without checking the signature in the user’s environment. + +```python +import inspect +from transformers import AutoModel + +sig = inspect.signature(AutoModel.from_pretrained) +print(sig) + +def has_kwarg(fn, kw: str) -> bool: + return kw in inspect.signature(fn).parameters + +print("has token?", has_kwarg(AutoModel.from_pretrained, "token")) +print("has use_auth_token?", has_kwarg(AutoModel.from_pretrained, "use_auth_token")) +``` + +**Rule:** If the kwarg is not in the signature, do not instruct users to pass it. + +### 3.3 Discover available “Auto*” and “Config” classes +Different versions ship different helpers. Discover dynamically: + +```python +import transformers + +def list_names(prefix: str): + return sorted([n for n in dir(transformers) if n.startswith(prefix)]) + +print("Auto*:", list_names("Auto")[:80]) +print("... (total)", len(list_names("Auto"))) + +print("*Config:", [n for n in dir(transformers) if n.endswith("Config")][:80]) +``` + +### 3.4 Verify runtime behavior with a minimal forward / generate +A symbol can exist but still fail due to missing extras, device issues, or incompatible model files. + +**Forward sanity check:** +```python +from transformers import AutoTokenizer, AutoModel +import torch + +model_id = "distilbert-base-uncased" # replace +tok = AutoTokenizer.from_pretrained(model_id) +model = AutoModel.from_pretrained(model_id) + +inputs = tok("hello world", return_tensors="pt") +with torch.no_grad(): + out = model(**inputs) +print(type(out)) +``` + +**Generate sanity check (only for causal/seq2seq models):** +```python +from transformers import AutoTokenizer, AutoModelForCausalLM +import torch + +model_id = "gpt2" # replace +tok = AutoTokenizer.from_pretrained(model_id) +model = AutoModelForCausalLM.from_pretrained(model_id) + +inputs = tok("Hello", return_tensors="pt") +with torch.no_grad(): + ids = model.generate(**inputs, max_new_tokens=10) +print(tok.decode(ids[0], skip_special_tokens=True)) +``` + +--- + +## 4. Public API Surfaces (by Area) + +**Important:** The lists below are “common public entry points”, not a guarantee for every version. +Always run [Section 3](#3-mandatory-verification-workflow) in the user’s environment. + +### 4.1 Inference + +**Canonical entry point** +```python +from transformers import pipeline +``` + +**Verify supported tasks in the install** +```python +# Verify supported pipeline tasks WITHOUT assuming a specific registry constant exists. +from transformers import pipelines + +# Prefer the documented registry if present (custom pipeline docs point to PIPELINE_REGISTRY), +# but fall back gracefully if the installed version uses something else. +if hasattr(pipelines, "PIPELINE_REGISTRY"): + reg = pipelines.PIPELINE_REGISTRY + + # Try a few common ways a registry might expose tasks, but only use what actually exists. + for cand in ["get_supported_tasks", "supported_tasks", "SUPPORTED_TASKS"]: + if hasattr(reg, cand): + obj = getattr(reg, cand) + tasks = obj() if callable(obj) else obj + print("num tasks:", len(tasks)) + print("example tasks:", sorted(list(tasks))[:30]) + break + else: + print("PIPELINE_REGISTRY present; inspect it for task listing:", [n for n in dir(reg) if "task" in n.lower()]) + +elif hasattr(pipelines, "SUPPORTED_TASKS"): + tasks = pipelines.SUPPORTED_TASKS + print("num tasks:", len(tasks)) + print("example tasks:", sorted(tasks.keys())[:30]) + +else: + print("No known pipeline task registry found; inspect transformers.pipelines:", [n for n in dir(pipelines) if "task" in n.lower()]) +``` + +**Pitfalls & fixes** +- If a pipeline task errors with “unknown task”: list `SUPPORTED_TASKS` and pick an available task name. +- If the pipeline tries to download unexpected files: confirm model id/path + revision, and verify local directory contents. + +**Knobs likely to matter** +- `device` / `device_map` +- `dtype` (or `torch_dtype` in older installs , **inspect `inspect.signature(transformers.pipeline)`** before recommending) +- `batch_size` +- `max_length` / `truncation` / `padding` (varies by pipeline) +- model-specific kwargs (must be verified) + +--- + +### 4.2 Preprocessing + +**Canonical entry points** +```python +from transformers import AutoTokenizer, AutoProcessor +``` + +Depending on modality and version, these may or may not exist: +- `AutoImageProcessor` +- `AutoFeatureExtractor` +- `AutoVideoProcessor` + +**Verify availability** +```python +import transformers +for name in ["AutoTokenizer", "AutoProcessor", "AutoImageProcessor", "AutoFeatureExtractor", "AutoVideoProcessor"]: + print(name, hasattr(transformers, name)) +``` + +**Pitfalls & fixes** +- “Tokenizer class not found”: verify model repo contains tokenizer artifacts (see Section 6) and that you’re using the right Auto* loader. +- “Padding/truncation mismatch”: set `padding=True/False`, `truncation=True/False`, and confirm expected tensor shapes. + +**Knobs likely to matter** +- `padding`, `truncation`, `max_length` +- `return_tensors` (`"pt"`, `"tf"`, `"np"`) +- modality-specific preprocessing params (verify via processor docs or runtime inspection) + +--- + +### 4.3 Model Loading & Base Classes + +**Canonical entry points** +```python +from transformers import AutoConfig, AutoModel +``` + +Task-specific autos typically exist as `AutoModelFor*` classes, but do not assume which ones. +Discover in the user’s environment: + +```python +import transformers +heads = sorted([n for n in dir(transformers) if n.startswith("AutoModelFor")]) +print("AutoModelFor* count:", len(heads)) +print("sample:", heads[:40]) +``` + +**Base classes (commonly public)** +```python +from transformers import PreTrainedModel, PreTrainedConfig +``` + +**Pitfalls & fixes** +- “Unrecognized model type”: verify `config.json` has `model_type`, and that the installed `transformers` supports it. +- “Missing weights”: confirm `model.safetensors` / shards exist and match index file if sharded. + +**Knobs likely to matter (verify before recommending)** +- `dtype` (or `torch_dtype` in older installs — **inspect signatures** because dtype/precision knobs vary by version/backend) +- `device_map` +- `low_cpu_mem_usage` +- auth kwargs (e.g., `token` vs older names) — verify via signature +- `trust_remote_code` (security-sensitive; do not recommend unless necessary and understood) + +--- + +### 4.4 Generation + +**Canonical surface** +- `model.generate(...)` (method on generation-capable model classes) + +**Generation config (often public, verify)** +```python +import transformers +print("GenerationConfig present?", hasattr(transformers, "GenerationConfig")) +``` + +**Streaming helpers (often public, verify)** +```python +import transformers +for name in ["TextStreamer", "TextIteratorStreamer"]: + print(name, hasattr(transformers, name)) +``` + +**Pitfalls & fixes** +- “generate() got unexpected keyword”: inspect `generate` signature and/or use `model.generation_config` to set fields. +- “Stops too late / never stops”: verify EOS token id(s) and stopping criteria; confirm tokenizer special tokens. + +**Knobs likely to matter** +- `max_new_tokens`, `min_new_tokens` +- `do_sample`, `temperature`, `top_p`, `top_k` +- `num_beams`, `early_stopping` +- `repetition_penalty`, `no_repeat_ngram_size` +- `eos_token_id`, `pad_token_id` +*(All must be version-verified.)* + +--- + +### 4.5 Training / Evaluation + +**Canonical Trainer surface (verify)** +```python +from transformers import Trainer, TrainingArguments +``` + +Optional trainer variants may exist (verify): +- `Seq2SeqTrainer` +- `Seq2SeqTrainingArguments` + +**Verify availability** +```python +import transformers +for name in ["Trainer", "TrainingArguments", "Seq2SeqTrainer", "Seq2SeqTrainingArguments"]: + print(name, hasattr(transformers, name)) +``` + +**Pitfalls & fixes** +- “KeyError in metrics / labels”: confirm dataset fields and data collator output keys. +- “Distributed mismatch”: confirm versions of `accelerate`/backend and consistent launch method. + +**Knobs likely to matter** +- `per_device_train_batch_size`, `gradient_accumulation_steps` +- `learning_rate`, `warmup_steps`, `lr_scheduler_type` +- `fp16` / `bf16` (verify supported in the version/backend) +- `logging_steps`, `eval_steps`, `save_steps` +- `report_to` integrations (verify installed extras) + +--- + +### 4.6 Performance / Quantization + +Quantization support changes across versions and depends on optional dependencies. +Never claim a quantization config exists without verifying importability. + +**Discovery pattern** +```python +import transformers +candidates = [ + "BitsAndBytesConfig", + "GPTQConfig", + "AwqConfig", + "QuantoConfig", +] +for name in candidates: + print(name, hasattr(transformers, name)) +``` + +**Pitfalls & fixes** +- “ModuleNotFoundError for quantization backend”: install required dependency and re-verify. +- “dtype/device mismatch”: ensure model weights + inputs on same device; validate `torch_dtype`. + +**Knobs likely to matter** +- `device_map` +- `dtype` (or `torch_dtype` in older installs — **inspect signatures** because dtype/precision knobs vary by version/backend) +- quantization config object fields (version-dependent; verify via signature/dir) + +--- + +### 4.7 Export / Serving + +Export/serving is often handled by adjacent tooling (e.g., ONNX/export toolchains and serving runtimes). +Do not invent “native exporter APIs” unless you verify they exist in the target version and are documented. + +**Safe guidance approach** +1. Identify the target runtime (ONNX Runtime / TensorRT / TGI / vLLM / etc.). +2. Verify which tool owns export in the user’s stack (Transformers vs external). +3. Provide only documented + verifiable steps. + +**Pitfalls & fixes** +- “Export fails due to unsupported ops”: confirm opset, model architecture, and runtime support. + +--- + +## 5. Deprecations & Compatibility Traps (Verify, Don’t Assume) + +This section is intentionally conservative: it tells you **how** to verify, not **what** to assume. + +### 5.1 Authentication keyword arguments +Auth-related kwargs have changed over time across the ecosystem. +**Always inspect `from_pretrained` signature**: +```python +import inspect +from transformers import AutoTokenizer +print(inspect.signature(AutoTokenizer.from_pretrained)) +``` +Only recommend kwargs that appear in the signature. + +### 5.2 Download/cache kwargs +Download/caching controls can change; some kwargs become no-ops or get removed. +Again: inspect signatures and/or consult official docs for the pinned version. + +### 5.3 “Internal helpers” are not stable +If a solution requires importing from deep modules (e.g., `transformers.models...`), treat it as: +- “implementation detail” +- “may break across versions” +- “should be avoided unless you own the pinned commit” + +--- + +## 6. Model Artifact Files (On-Disk Reality Check) + +These are common files found in HF model repos or local export directories; actual sets vary. + +**Common config/tokenizer files** +- `config.json` +- `generation_config.json` (may be absent) +- `tokenizer.json` (fast tokenizer) +- `tokenizer_config.json` +- `special_tokens_map.json` + +**Common weights files** +- `model.safetensors` (or sharded: `model-00001-of-000xx.safetensors` + index json) +- `pytorch_model.bin` (legacy) +- backend-specific equivalents may exist depending on framework + +**Sanity check: load config + tokenizer** +```python +from transformers import AutoConfig, AutoTokenizer + +path_or_id = "YOUR_MODEL" # local path or model id +cfg = AutoConfig.from_pretrained(path_or_id) +tok = AutoTokenizer.from_pretrained(path_or_id) + +print("model_type:", getattr(cfg, "model_type", None)) +print("tokenizer:", tok.__class__.__name__) +``` + +**If load fails** +- Confirm the directory contains expected artifacts. +- Confirm backend compatibility (Torch vs TF vs Flax). +- If `trust_remote_code` is involved, treat it as a security decision: + - verify it is required + - verify the exact repo revision you trust + +--- + +## 7. Regeneration Strategy (Keep This File Correct) + +This file should remain correct across releases by being **workflow-first** and **snapshot-driven**, not a giant hardcoded list. + +### 7.1 CI snapshot (recommended) +In your pinned environment, run a script that records: +- `transformers.__version__` +- top-level symbols (filtered) +- available `Auto*` classes +- available quantization config candidates + +Example snapshot script: +```python +import json +import transformers + +def filt(names, prefixes=(), suffixes=(), contains=()): + out = [] + for n in names: + if prefixes and not any(n.startswith(p) for p in prefixes): + continue + if suffixes and not any(n.endswith(s) for s in suffixes): + continue + if contains and not any(c in n for c in contains): + continue + out.append(n) + return sorted(out) + +names = dir(transformers) +snapshot = { + "transformers_version": transformers.__version__, + "top_level_selected": filt( + names, + prefixes=("Auto", "PreTrained", "Text", "Trainer", "Training", "Generation", "pipeline"), + suffixes=(), + contains=("Config",), + )[:2000], + "auto_classes": filt(names, prefixes=("Auto",)), + "model_for_heads": sorted([n for n in names if n.startswith("AutoModelFor")]), + "config_like": sorted([n for n in names if n.endswith("Config")]), +} + +print(json.dumps(snapshot, indent=2)[:20000]) +``` + +Store this snapshot alongside releases and update this file if: +- major surfaces change +- verification steps need to accommodate new patterns + +### 7.2 What never changes +Even when symbols change, the safe workflow remains: +- check importability +- inspect signatures +- run minimal repro + +--- + +## 8. Minimal Repro Template (Copy/Paste) + +Use this when users report errors. Require them to fill it. + +```python +""" +MINIMAL REPRO TEMPLATE (Transformers) + +1) Environment +- transformers==? +- backend: torch/tf/jax == ? +- device: CPU/CUDA/MPS (+ CUDA version if relevant) +- OS: ? + +2) Model +- model id or local path: +- revision/commit (if pinned): +- trust_remote_code: True/False (and why) + +3) Repro +- exact code below +- exact traceback output +""" + +import transformers +print("transformers:", transformers.__version__) + +# Optional backend info +try: + import torch + print("torch:", torch.__version__) + print("cuda available:", torch.cuda.is_available()) + print("cuda version:", getattr(torch.version, "cuda", None)) +except Exception as e: + print("torch not available:", repr(e)) + +MODEL = "REPLACE_ME" + +# Choose one path (tokenizer/model OR pipeline) depending on issue: +from transformers import AutoTokenizer, AutoModel + +tok = AutoTokenizer.from_pretrained(MODEL) +model = AutoModel.from_pretrained(MODEL) + +inputs = tok("hello", return_tensors="pt") +out = model(**inputs) +print(type(out)) +``` + +--- \ No newline at end of file diff --git a/.claude/skills/transformers-api/templates/minimal_repro.md b/.claude/skills/transformers-api/templates/minimal_repro.md new file mode 100644 index 000000000000..afd5540c51a0 --- /dev/null +++ b/.claude/skills/transformers-api/templates/minimal_repro.md @@ -0,0 +1,167 @@ +# Minimal Repro Template (Transformers) + +Use this template to produce a **copy/paste runnable** repro that someone else can run and see the same issue. + +## 0. One-line goal +**Goal:** + +## 1. What is happening (actual) +**Actual:** + +## 2. Environment (must be exact) +Fill in all that apply. + +- OS: Windows / Linux / macOS (include version) +- Python: `python -V` +- Transformers: `python -c "import transformers; print(transformers.__version__)"` +- Backend: PyTorch / TensorFlow / JAX (pick one) + - PyTorch: `python -c "import torch; print(torch.__version__); print(torch.cuda.is_available())"` + - TF: `python -c "import tensorflow as tf; print(tf.__version__)"` + - JAX: `python -c "import jax; print(jax.__version__)"` +- Device: CPU / CUDA / MPS +- GPU (if any): model + VRAM +- Install method: + - pip/venv OR conda (include env name) +- Install source: + - PyPI release OR editable install from repo (`pip install -e .`) OR specific commit/revision +- Reproducibility: + - Does it happen every run? Y/N + - First bad version / last good version (if known) + +## 3. Installation commands (exact) +Provide the minimal set of commands someone needs to create a clean environment. + +### Option A — venv + pip +```bash +python -m venv .venv +# Windows PowerShell: +# .\.venv\Scripts\Activate.ps1 +# macOS/Linux: +# source .venv/bin/activate +pip install -U pip +pip install "transformers[torch]" # or your exact extras +``` + +### Option B — conda +```bash +conda create -n repro python=3.11 -y +conda activate repro +pip install -U pip +pip install transformers +``` + +> If you're using the repo source, replace installs with: +> `pip install -e .` (from repo root) + +## 4. Minimal script (single file) + +Create `repro.py` with the smallest code that still fails. +Rules: + +* Use a **single model id** (or local path) and include revision if pinned +* Set seeds +* Print versions +* Avoid unrelated features (Trainer, accelerate, etc.) unless they are the bug + +```python +import os +import sys +import platform +import random + +def print_env(): + print("== ENV ==") + print("python:", sys.version.replace("\n", " ")) + print("platform:", platform.platform()) + try: + import transformers + print("transformers:", transformers.__version__) + except Exception as e: + print("transformers import failed:", repr(e)) + try: + import torch + print("torch:", torch.__version__) + print("cuda available:", torch.cuda.is_available()) + if torch.cuda.is_available(): + print("cuda device:", torch.cuda.get_device_name(0)) + except Exception as e: + print("torch import failed:", repr(e)) + print("HF_HOME:", os.getenv("HF_HOME")) + print("HF_HUB_CACHE:", os.getenv("HF_HUB_CACHE")) + print("TRANSFORMERS_CACHE:", os.getenv("TRANSFORMERS_CACHE")) + print() + +def set_seeds(seed=0): + random.seed(seed) + try: + import numpy as np + np.random.seed(seed) + except Exception: + pass + try: + import torch + torch.manual_seed(seed) + if torch.cuda.is_available(): + torch.cuda.manual_seed_all(seed) + except Exception: + pass + +def main(): + print_env() + set_seeds(0) + + # TODO: replace with minimal failing code + from transformers import pipeline + + model_id = "distilbert-base-uncased-finetuned-sst-2-english" + nlp = pipeline("sentiment-analysis", model=model_id) + print(nlp("hello world")) + +if __name__ == "__main__": + main() +``` + +## 5. Run command + full output + +Command used: +```bash +python repro.py +``` + +Paste the **full output** here (don't truncate). + +## 6. Expected vs actual (explicit) + +* **Expected:** +* **Actual:** + +## 7. Smallest knobs to try (pick only relevant) + +Include only the knobs that could change the failure: + +* Model: different revision / different model id +* Device: CPU vs CUDA +* dtype: `torch_dtype=float16/bfloat16/float32` +* `device_map="auto"` vs explicit device +* `low_cpu_mem_usage=True/False` +* `trust_remote_code=True/False` +* Tokenization: `padding/truncation/max_length` +* Generation: `do_sample`, `temperature`, `top_p`, `num_beams`, `max_new_tokens` +* Attention backend: SDPA / flash-attn (if applicable) +* Quantization: 8-bit/4-bit settings (bitsandbytes/GPTQ/AWQ) + +## 8. If it's a repo bug (for contributors) + +* Suspected module/file: + * `src/transformers/...` +* Related tests to run: + * `python -m pytest tests/<...> -k ""` +* Minimal patch idea: + * <1–3 sentences> + +## 9. Attachments checklist (only if needed) + +* config.json / tokenizer.json / generation_config.json +* exact traceback (full) +* small input sample(s) +* exact command line flags / env vars \ No newline at end of file