Skip to content

🚨 [v5] Delete feature extractors used for vision#41174

Merged
zucchini-nlp merged 7 commits intohuggingface:mainfrom
zucchini-nlp:delete-image-feat-extractors
Oct 1, 2025
Merged

🚨 [v5] Delete feature extractors used for vision#41174
zucchini-nlp merged 7 commits intohuggingface:mainfrom
zucchini-nlp:delete-image-feat-extractors

Conversation

@zucchini-nlp
Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp commented Sep 26, 2025

What does this PR do?

As per title, let's clean up for v5

These were supposed to be deleted anyway and we had the warning logged for a long time, even before I joined. Feature Extractors are now reserved for audio models only

@zucchini-nlp zucchini-nlp requested a review from gante September 26, 2025 09:09
@molbap
Copy link
Copy Markdown
Contributor

molbap commented Sep 29, 2025

Nice, IIRC there were a few utils in processing handling that involved using the feature_extractor keyword as well

@zucchini-nlp
Copy link
Copy Markdown
Member Author

Yeah, the tests are failing, I will do one more round of cleaning

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

@zucchini-nlp
Copy link
Copy Markdown
Member Author

Done

Copy link
Copy Markdown
Contributor

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🤗 🧹 🧹 🧹

(probably needs a few more deletions, looking at the CI issues)

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp zucchini-nlp merged commit ae879f6 into huggingface:main Oct 1, 2025
25 checks passed
vijayabhaskar-ev pushed a commit to vijayabhaskar-ev/transformers that referenced this pull request Oct 2, 2025
* bye bye

* remove from docs

* do not use feature extractor here

* fix docs

* do not delete it

* forgot these
yuchenxie4645 pushed a commit to yuchenxie4645/transformers that referenced this pull request Oct 4, 2025
* bye bye

* remove from docs

* do not use feature extractor here

* fix docs

* do not delete it

* forgot these
@zucchini-nlp zucchini-nlp mentioned this pull request Oct 9, 2025
AhnJoonSung pushed a commit to AhnJoonSung/transformers that referenced this pull request Oct 12, 2025
* bye bye

* remove from docs

* do not use feature extractor here

* fix docs

* do not delete it

* forgot these
ssaliceTT added a commit to tenstorrent/tt-xla that referenced this pull request Mar 18, 2026
### Ticket
N/A

### Problem description
Uplift the transformers library from `4.57.1` to `5.2.0` to broaden
model support and enable new models such as GLM-5 to run on our stack.
Transformers 5.x is a major version with several breaking changes that
required fixes across both tt-xla and tt-forge-models.

### What's changed

#### Transformers 5.x breaking changes and how we addressed them

**Flax/JAX backend removed (transformers 5.0, [PR
#40760](huggingface/transformers#40760
All `FlaxXxx` model classes were removed from the library. As a result:
- All JAX tests backed by `FlaxPreTrainedModel` are now marked
`NOT_SUPPORTED_SKIP` (82 test entries updated in
`test_config_inference_single_device.yaml`). Affected model families:
albert, bart, beit, bert/masked_lm, longt5, mt5, t5, regnet, resnet,
vit, dinov2, bloom, clip, distilbert, electra, gpt_j, gpt_neo, gpt_sw3,
mistral, opt, roberta, roformer, squeezebert, wav2vec2, whisper, xglm,
xlm_roberta, marian_mt, mbart50, bigbird, pegasus,
vision_text_dual_encoder
- Removed `FlaxPreTrainedModel` from the `Model` type alias in
`types.py` and from `isinstance` checks and parameter handling in
`jax_model_tester.py` and `dynamic_jax_model_tester.py`
- Four mamba tensor-parallel test entries removed from
`test_config_inference_tensor_parallel.yaml` (Flax mamba model class was
removed)
- EasyDel-based JAX models (falcon, phi1, phi1_5, phi2, phi3, gpt2, qwen
2.5/coder/3, llama, whisper) remain functional and are pinned to
`transformers==4.57.1` via per-model `requirements.txt` in
tt-forge-models, since EasyDel itself requires the older transformers
API

**Legacy cache format removed (transformers 5.0–5.2, [PR
#41378](huggingface/transformers#41378), [PR
#43168](huggingface/transformers#43168
`to_legacy_cache()`, `from_legacy_cache()`, `get_usable_length()`, and
all deprecated `Cache` subclasses were removed. Changes made:
- Updated `kimi_k2/modeling_deepseek.py`: replaced
`DynamicCache.from_legacy_cache()` with a manual layer-by-layer
construction, replaced `to_legacy_cache()` with a manual tuple, and
replaced `get_usable_length()` with `get_seq_length()`
- Updated `kimi_k2/test_kimi_k2.py`: replaced tuple-indexed shard spec
keys (`args[3][0][0]`) with the new layer attribute API
(`args[3].layers[0].compressed_kv`), and added `lazy_initialization()`
calls for `StaticCache` layers

**Unified attention interface (transformers 5.x)**
Attention modules no longer return `attn_weights` when using the unified
SDPA/flash/eager dispatch path, and require `_attn_implementation` to be
set explicitly on the config. Updated Gemma and Mistral attention tests
to:
- Set `config._attn_implementation = "sdpa"` before constructing
attention modules
- Drop `attn_weights` from the return value of the inner attention call

**`XXXFeatureExtractor` classes removed (transformers 5.0, [PR
#41174](huggingface/transformers#41174
All legacy vision `FeatureExtractor` classes were replaced by
`ImageProcessor` equivalents. Updated in tt-forge-models:
- `detr`: `DetrFeatureExtractor` → `DetrImageProcessor`
- `maskformer`: `MaskFormerFeatureExtractor` →
`MaskFormerImageProcessor`
- `yolos_small`: `YolosFeatureExtractor` → `YolosImageProcessor`

**`encode_plus()` / `batch_encode_plus()` removed in favour of
`__call__()` (transformers 5.0)**
The legacy tokenizer encoding methods were formally removed. Changes
made:
- tt-forge-models (`huggyllama`, `mistral`, `roberta`):
`tokenizer.encode_plus(...)` → `tokenizer(...)`
- `examples/pytorch/sdxl-pipeline.py`:
`tokenizer.batch_encode_plus(...)` → `tokenizer(...)`
- `tests/torch/models/llama3/test_llama_step_n300.py`:
`tokenizer.encode_plus(...)` → `tokenizer._encode_plus(...)` (private
method still present in 5.x as the internal implementation; should
ideally be `tokenizer(...)`)
- `tests/torch/quality/image_gen/sdxl/pipeline.py`: replaced the private
`tokenizer._encode_plus(...)` call (which broke in 5.x for list inputs
with `padding="max_length"`) with the public `tokenizer(...)` interface
with explicit `padding="max_length"`, `truncation=True`, and
`return_tensors="pt"`. The old code produced mismatched sequence lengths
for conditioned vs unconditioned tokens causing a `torch.cat` shape
mismatch error.

**`trust_remote_code` no longer needed for phi3 (transformers 5.x)**
The phi3 model was upstreamed into the official transformers library and
`trust_remote_code=True` is now unnecessary. Removed from
`AutoTokenizer.from_pretrained`, `AutoConfig.from_pretrained`, and
`model_kwargs` in the phi3 loader.

**`torch.fx` support dropped (transformers 5.0, [PR
#41683](huggingface/transformers#41683
`is_torch_fx_available()`, `is_torch_greater_or_equal_than_1_13`, and
all `torch.fx` tracing guards were removed. Updated:
- `deepseek_r1` (deepseekv2) loader in tt-forge-models
- `kimi_k2/modeling_deepseek.py`: removed `is_torch_fx_available` import
and the `_prepare_4d_causal_attention_mask` FX wrap block; replaced
`rope_scaling["type"]` dict access with `.get()` to guard against
missing keys in newer config formats

**VLM sub-module path changed (transformers 5.x, [PR
#42156](huggingface/transformers#42156
Vision-language models no longer expose `model.language_model` directly
at the top level; it is now accessed via `model.model.language_model`.
Updated `mistral/pixtral` loader to add `_get_language_model()` and
`_get_vision_tower()` helpers that handle both paths when building shard
specs.

**`AutoProcessor` with `trust_remote_code` removed for custom processors
(transformers 5.x)**
`AutoProcessor.from_pretrained(trust_remote_code=True)` no longer works
for models with custom processing classes not registered in the
transformers auto-mapping. Updated `openvla_oft` to explicitly
instantiate `PrismaticImageProcessor` and `PrismaticProcessor` from the
local `openvla/pytorch/src/` source.

**`tie_weights()` signature changed (transformers 5.x)**
`PreTrainedModel.tie_weights()` now passes through `**kwargs`. Updated
the `tie_weights` override in
`openvla/pytorch/src/modeling_prismatic.py` to accept and forward
`**kwargs` to avoid a `TypeError` on model init.

**`XLMRobertaSdpaSelfAttention` removed (transformers 5.x)**
The separate SDPA attention class was consolidated into the unified
attention dispatch. Rewrote `XLMRobertaSelfAttentionWithAdapters` in
`sentencizer/pytorch/src/adapter_utils.py` to conform to the new
`forward()` signature using `eager_attention_forward` from transformers.

**`HfFolder.get_token()` removed (huggingface_hub)**
`HfFolder` was removed in recent `huggingface_hub` versions. Updated
`sentencizer/pytorch/src/utils.py` to use `HfApi().token` instead.

**mamba2 JAX loader removed**
`mamba2/causal_lm/jax` was removed as it was non-functional and
incompatible with the pinned EasyDel version used by other JAX models.

#### tt-xla infrastructure changes

- **`transformers` removed from `_JAX_PURGE_SKIP`**
(`tests/runner/requirements.py`): `transformers` was previously excluded
from the `sys.modules` purge that `RequirementsManager` performs after a
per-model pip install. This meant that when an EasyDel model installed
`transformers==4.57.1`, the venv's 5.2.0 stayed cached in memory and the
newly installed version was never visible to imports. Removing
`transformers` from the skip list (keeping only `flax`, which has
genuine module-level imports in JAX infra) ensures the installed version
is correctly used. All JAX infra files were audited to confirm none hold
module-level `transformers` references.

- **Sparse MLP router output fix**
(`python_package/tt_torch/sparse_mlp.py`): `GptOssTopKRouter` was
updated to return a 3-tuple `(router_logits, router_scores,
router_indices)` instead of 2. Updated all three MoE dispatch paths
(`SparseMLP`, `A2aSparseMLP`, `A2aSparseStackedMlp`) to unpack
accordingly and simplified the weighted-sum logic to use the compact
scores tensor directly, removing a workaround that used `torch.gather` /
one-hot einsum.

- **Performance benchmark matrix**
(`.github/workflows/perf-bench-matrix.json`): Updated all PyTorch
benchmark entries from `transformers==4.57.1` to `transformers==5.2.0`.
The `resnet_jax` and `bge_m3_encode` entries are intentionally kept at
`transformers==4.57.1` — `FlaxResNetForImageClassification` was removed
in 5.x, and `FlagEmbedding` (used by bge_m3) is not yet compatible with
5.x.

- **LLM benchmark version check**
(`tests/benchmark/benchmarks/llm_benchmark.py`): Updated
`check_transformers_version()` to require exactly `5.2.0` instead of `<=
4.57.1`. Also removed the now-unnecessary `check_transformers_version()`
guard from `examples/pytorch/llama.py`.

- **Resnet codegen examples skipped**
(`tests/examples/test_examples.py`): Added XFAIL entries for
`jax/codegen/cpp/resnet.py` and `jax/codegen/python/resnet.py` since
`FlaxResNetModel` was removed in transformers 5.x.

- **`surya-ocr` unpinned** (`venv/requirements-dev.txt`): Removed the
`surya-ocr==0.17.0` version pin.

#### tt-forge models PR:
tenstorrent/tt-forge-models#529

### CI tests for reference:
Manual Release test:
https://github.com/tenstorrent/tt-xla/actions/runs/23179435697
Manual Manylinux release test:
https://github.com/tenstorrent/tt-xla/actions/runs/23179426382

### Checklist
- [x] Fix `gpt_oss` failure
- [x] Fix JAX-only CI workflows

---------

Co-authored-by: Vladimir Zeljkovic <vzeljkovic@tenstorrent.com>
ppadjinTT pushed a commit to tenstorrent/tt-xla that referenced this pull request Mar 31, 2026
### Ticket
N/A

### Problem description
Uplift the transformers library from `4.57.1` to `5.2.0` to broaden
model support and enable new models such as GLM-5 to run on our stack.
Transformers 5.x is a major version with several breaking changes that
required fixes across both tt-xla and tt-forge-models.

### What's changed

#### Transformers 5.x breaking changes and how we addressed them

**Flax/JAX backend removed (transformers 5.0, [PR
#40760](huggingface/transformers#40760
All `FlaxXxx` model classes were removed from the library. As a result:
- All JAX tests backed by `FlaxPreTrainedModel` are now marked
`NOT_SUPPORTED_SKIP` (82 test entries updated in
`test_config_inference_single_device.yaml`). Affected model families:
albert, bart, beit, bert/masked_lm, longt5, mt5, t5, regnet, resnet,
vit, dinov2, bloom, clip, distilbert, electra, gpt_j, gpt_neo, gpt_sw3,
mistral, opt, roberta, roformer, squeezebert, wav2vec2, whisper, xglm,
xlm_roberta, marian_mt, mbart50, bigbird, pegasus,
vision_text_dual_encoder
- Removed `FlaxPreTrainedModel` from the `Model` type alias in
`types.py` and from `isinstance` checks and parameter handling in
`jax_model_tester.py` and `dynamic_jax_model_tester.py`
- Four mamba tensor-parallel test entries removed from
`test_config_inference_tensor_parallel.yaml` (Flax mamba model class was
removed)
- EasyDel-based JAX models (falcon, phi1, phi1_5, phi2, phi3, gpt2, qwen
2.5/coder/3, llama, whisper) remain functional and are pinned to
`transformers==4.57.1` via per-model `requirements.txt` in
tt-forge-models, since EasyDel itself requires the older transformers
API

**Legacy cache format removed (transformers 5.0–5.2, [PR
#41378](huggingface/transformers#41378), [PR
#43168](huggingface/transformers#43168
`to_legacy_cache()`, `from_legacy_cache()`, `get_usable_length()`, and
all deprecated `Cache` subclasses were removed. Changes made:
- Updated `kimi_k2/modeling_deepseek.py`: replaced
`DynamicCache.from_legacy_cache()` with a manual layer-by-layer
construction, replaced `to_legacy_cache()` with a manual tuple, and
replaced `get_usable_length()` with `get_seq_length()`
- Updated `kimi_k2/test_kimi_k2.py`: replaced tuple-indexed shard spec
keys (`args[3][0][0]`) with the new layer attribute API
(`args[3].layers[0].compressed_kv`), and added `lazy_initialization()`
calls for `StaticCache` layers

**Unified attention interface (transformers 5.x)**
Attention modules no longer return `attn_weights` when using the unified
SDPA/flash/eager dispatch path, and require `_attn_implementation` to be
set explicitly on the config. Updated Gemma and Mistral attention tests
to:
- Set `config._attn_implementation = "sdpa"` before constructing
attention modules
- Drop `attn_weights` from the return value of the inner attention call

**`XXXFeatureExtractor` classes removed (transformers 5.0, [PR
#41174](huggingface/transformers#41174
All legacy vision `FeatureExtractor` classes were replaced by
`ImageProcessor` equivalents. Updated in tt-forge-models:
- `detr`: `DetrFeatureExtractor` → `DetrImageProcessor`
- `maskformer`: `MaskFormerFeatureExtractor` →
`MaskFormerImageProcessor`
- `yolos_small`: `YolosFeatureExtractor` → `YolosImageProcessor`

**`encode_plus()` / `batch_encode_plus()` removed in favour of
`__call__()` (transformers 5.0)**
The legacy tokenizer encoding methods were formally removed. Changes
made:
- tt-forge-models (`huggyllama`, `mistral`, `roberta`):
`tokenizer.encode_plus(...)` → `tokenizer(...)`
- `examples/pytorch/sdxl-pipeline.py`:
`tokenizer.batch_encode_plus(...)` → `tokenizer(...)`
- `tests/torch/models/llama3/test_llama_step_n300.py`:
`tokenizer.encode_plus(...)` → `tokenizer._encode_plus(...)` (private
method still present in 5.x as the internal implementation; should
ideally be `tokenizer(...)`)
- `tests/torch/quality/image_gen/sdxl/pipeline.py`: replaced the private
`tokenizer._encode_plus(...)` call (which broke in 5.x for list inputs
with `padding="max_length"`) with the public `tokenizer(...)` interface
with explicit `padding="max_length"`, `truncation=True`, and
`return_tensors="pt"`. The old code produced mismatched sequence lengths
for conditioned vs unconditioned tokens causing a `torch.cat` shape
mismatch error.

**`trust_remote_code` no longer needed for phi3 (transformers 5.x)**
The phi3 model was upstreamed into the official transformers library and
`trust_remote_code=True` is now unnecessary. Removed from
`AutoTokenizer.from_pretrained`, `AutoConfig.from_pretrained`, and
`model_kwargs` in the phi3 loader.

**`torch.fx` support dropped (transformers 5.0, [PR
#41683](huggingface/transformers#41683
`is_torch_fx_available()`, `is_torch_greater_or_equal_than_1_13`, and
all `torch.fx` tracing guards were removed. Updated:
- `deepseek_r1` (deepseekv2) loader in tt-forge-models
- `kimi_k2/modeling_deepseek.py`: removed `is_torch_fx_available` import
and the `_prepare_4d_causal_attention_mask` FX wrap block; replaced
`rope_scaling["type"]` dict access with `.get()` to guard against
missing keys in newer config formats

**VLM sub-module path changed (transformers 5.x, [PR
#42156](huggingface/transformers#42156
Vision-language models no longer expose `model.language_model` directly
at the top level; it is now accessed via `model.model.language_model`.
Updated `mistral/pixtral` loader to add `_get_language_model()` and
`_get_vision_tower()` helpers that handle both paths when building shard
specs.

**`AutoProcessor` with `trust_remote_code` removed for custom processors
(transformers 5.x)**
`AutoProcessor.from_pretrained(trust_remote_code=True)` no longer works
for models with custom processing classes not registered in the
transformers auto-mapping. Updated `openvla_oft` to explicitly
instantiate `PrismaticImageProcessor` and `PrismaticProcessor` from the
local `openvla/pytorch/src/` source.

**`tie_weights()` signature changed (transformers 5.x)**
`PreTrainedModel.tie_weights()` now passes through `**kwargs`. Updated
the `tie_weights` override in
`openvla/pytorch/src/modeling_prismatic.py` to accept and forward
`**kwargs` to avoid a `TypeError` on model init.

**`XLMRobertaSdpaSelfAttention` removed (transformers 5.x)**
The separate SDPA attention class was consolidated into the unified
attention dispatch. Rewrote `XLMRobertaSelfAttentionWithAdapters` in
`sentencizer/pytorch/src/adapter_utils.py` to conform to the new
`forward()` signature using `eager_attention_forward` from transformers.

**`HfFolder.get_token()` removed (huggingface_hub)**
`HfFolder` was removed in recent `huggingface_hub` versions. Updated
`sentencizer/pytorch/src/utils.py` to use `HfApi().token` instead.

**mamba2 JAX loader removed**
`mamba2/causal_lm/jax` was removed as it was non-functional and
incompatible with the pinned EasyDel version used by other JAX models.

#### tt-xla infrastructure changes

- **`transformers` removed from `_JAX_PURGE_SKIP`**
(`tests/runner/requirements.py`): `transformers` was previously excluded
from the `sys.modules` purge that `RequirementsManager` performs after a
per-model pip install. This meant that when an EasyDel model installed
`transformers==4.57.1`, the venv's 5.2.0 stayed cached in memory and the
newly installed version was never visible to imports. Removing
`transformers` from the skip list (keeping only `flax`, which has
genuine module-level imports in JAX infra) ensures the installed version
is correctly used. All JAX infra files were audited to confirm none hold
module-level `transformers` references.

- **Sparse MLP router output fix**
(`python_package/tt_torch/sparse_mlp.py`): `GptOssTopKRouter` was
updated to return a 3-tuple `(router_logits, router_scores,
router_indices)` instead of 2. Updated all three MoE dispatch paths
(`SparseMLP`, `A2aSparseMLP`, `A2aSparseStackedMlp`) to unpack
accordingly and simplified the weighted-sum logic to use the compact
scores tensor directly, removing a workaround that used `torch.gather` /
one-hot einsum.

- **Performance benchmark matrix**
(`.github/workflows/perf-bench-matrix.json`): Updated all PyTorch
benchmark entries from `transformers==4.57.1` to `transformers==5.2.0`.
The `resnet_jax` and `bge_m3_encode` entries are intentionally kept at
`transformers==4.57.1` — `FlaxResNetForImageClassification` was removed
in 5.x, and `FlagEmbedding` (used by bge_m3) is not yet compatible with
5.x.

- **LLM benchmark version check**
(`tests/benchmark/benchmarks/llm_benchmark.py`): Updated
`check_transformers_version()` to require exactly `5.2.0` instead of `<=
4.57.1`. Also removed the now-unnecessary `check_transformers_version()`
guard from `examples/pytorch/llama.py`.

- **Resnet codegen examples skipped**
(`tests/examples/test_examples.py`): Added XFAIL entries for
`jax/codegen/cpp/resnet.py` and `jax/codegen/python/resnet.py` since
`FlaxResNetModel` was removed in transformers 5.x.

- **`surya-ocr` unpinned** (`venv/requirements-dev.txt`): Removed the
`surya-ocr==0.17.0` version pin.

#### tt-forge models PR:
tenstorrent/tt-forge-models#529

### CI tests for reference:
Manual Release test:
https://github.com/tenstorrent/tt-xla/actions/runs/23179435697
Manual Manylinux release test:
https://github.com/tenstorrent/tt-xla/actions/runs/23179426382

### Checklist
- [x] Fix `gpt_oss` failure
- [x] Fix JAX-only CI workflows

---------

Co-authored-by: Vladimir Zeljkovic <vzeljkovic@tenstorrent.com>
vzeljkovicTT added a commit to tenstorrent/tt-xla that referenced this pull request Apr 14, 2026
### Ticket
N/A

### Problem description
Uplift the transformers library from `4.57.1` to `5.2.0` to broaden
model support and enable new models such as GLM-5 to run on our stack.
Transformers 5.x is a major version with several breaking changes that
required fixes across both tt-xla and tt-forge-models.

### What's changed

#### Transformers 5.x breaking changes and how we addressed them

**Flax/JAX backend removed (transformers 5.0, [PR
#40760](huggingface/transformers#40760
All `FlaxXxx` model classes were removed from the library. As a result:
- All JAX tests backed by `FlaxPreTrainedModel` are now marked
`NOT_SUPPORTED_SKIP` (82 test entries updated in
`test_config_inference_single_device.yaml`). Affected model families:
albert, bart, beit, bert/masked_lm, longt5, mt5, t5, regnet, resnet,
vit, dinov2, bloom, clip, distilbert, electra, gpt_j, gpt_neo, gpt_sw3,
mistral, opt, roberta, roformer, squeezebert, wav2vec2, whisper, xglm,
xlm_roberta, marian_mt, mbart50, bigbird, pegasus,
vision_text_dual_encoder
- Removed `FlaxPreTrainedModel` from the `Model` type alias in
`types.py` and from `isinstance` checks and parameter handling in
`jax_model_tester.py` and `dynamic_jax_model_tester.py`
- Four mamba tensor-parallel test entries removed from
`test_config_inference_tensor_parallel.yaml` (Flax mamba model class was
removed)
- EasyDel-based JAX models (falcon, phi1, phi1_5, phi2, phi3, gpt2, qwen
2.5/coder/3, llama, whisper) remain functional and are pinned to
`transformers==4.57.1` via per-model `requirements.txt` in
tt-forge-models, since EasyDel itself requires the older transformers
API

**Legacy cache format removed (transformers 5.0–5.2, [PR
#41378](huggingface/transformers#41378), [PR
#43168](huggingface/transformers#43168
`to_legacy_cache()`, `from_legacy_cache()`, `get_usable_length()`, and
all deprecated `Cache` subclasses were removed. Changes made:
- Updated `kimi_k2/modeling_deepseek.py`: replaced
`DynamicCache.from_legacy_cache()` with a manual layer-by-layer
construction, replaced `to_legacy_cache()` with a manual tuple, and
replaced `get_usable_length()` with `get_seq_length()`
- Updated `kimi_k2/test_kimi_k2.py`: replaced tuple-indexed shard spec
keys (`args[3][0][0]`) with the new layer attribute API
(`args[3].layers[0].compressed_kv`), and added `lazy_initialization()`
calls for `StaticCache` layers

**Unified attention interface (transformers 5.x)**
Attention modules no longer return `attn_weights` when using the unified
SDPA/flash/eager dispatch path, and require `_attn_implementation` to be
set explicitly on the config. Updated Gemma and Mistral attention tests
to:
- Set `config._attn_implementation = "sdpa"` before constructing
attention modules
- Drop `attn_weights` from the return value of the inner attention call

**`XXXFeatureExtractor` classes removed (transformers 5.0, [PR
#41174](huggingface/transformers#41174
All legacy vision `FeatureExtractor` classes were replaced by
`ImageProcessor` equivalents. Updated in tt-forge-models:
- `detr`: `DetrFeatureExtractor` → `DetrImageProcessor`
- `maskformer`: `MaskFormerFeatureExtractor` →
`MaskFormerImageProcessor`
- `yolos_small`: `YolosFeatureExtractor` → `YolosImageProcessor`

**`encode_plus()` / `batch_encode_plus()` removed in favour of
`__call__()` (transformers 5.0)**
The legacy tokenizer encoding methods were formally removed. Changes
made:
- tt-forge-models (`huggyllama`, `mistral`, `roberta`):
`tokenizer.encode_plus(...)` → `tokenizer(...)`
- `examples/pytorch/sdxl-pipeline.py`:
`tokenizer.batch_encode_plus(...)` → `tokenizer(...)`
- `tests/torch/models/llama3/test_llama_step_n300.py`:
`tokenizer.encode_plus(...)` → `tokenizer._encode_plus(...)` (private
method still present in 5.x as the internal implementation; should
ideally be `tokenizer(...)`)
- `tests/torch/quality/image_gen/sdxl/pipeline.py`: replaced the private
`tokenizer._encode_plus(...)` call (which broke in 5.x for list inputs
with `padding="max_length"`) with the public `tokenizer(...)` interface
with explicit `padding="max_length"`, `truncation=True`, and
`return_tensors="pt"`. The old code produced mismatched sequence lengths
for conditioned vs unconditioned tokens causing a `torch.cat` shape
mismatch error.

**`trust_remote_code` no longer needed for phi3 (transformers 5.x)**
The phi3 model was upstreamed into the official transformers library and
`trust_remote_code=True` is now unnecessary. Removed from
`AutoTokenizer.from_pretrained`, `AutoConfig.from_pretrained`, and
`model_kwargs` in the phi3 loader.

**`torch.fx` support dropped (transformers 5.0, [PR
#41683](huggingface/transformers#41683
`is_torch_fx_available()`, `is_torch_greater_or_equal_than_1_13`, and
all `torch.fx` tracing guards were removed. Updated:
- `deepseek_r1` (deepseekv2) loader in tt-forge-models
- `kimi_k2/modeling_deepseek.py`: removed `is_torch_fx_available` import
and the `_prepare_4d_causal_attention_mask` FX wrap block; replaced
`rope_scaling["type"]` dict access with `.get()` to guard against
missing keys in newer config formats

**VLM sub-module path changed (transformers 5.x, [PR
#42156](huggingface/transformers#42156
Vision-language models no longer expose `model.language_model` directly
at the top level; it is now accessed via `model.model.language_model`.
Updated `mistral/pixtral` loader to add `_get_language_model()` and
`_get_vision_tower()` helpers that handle both paths when building shard
specs.

**`AutoProcessor` with `trust_remote_code` removed for custom processors
(transformers 5.x)**
`AutoProcessor.from_pretrained(trust_remote_code=True)` no longer works
for models with custom processing classes not registered in the
transformers auto-mapping. Updated `openvla_oft` to explicitly
instantiate `PrismaticImageProcessor` and `PrismaticProcessor` from the
local `openvla/pytorch/src/` source.

**`tie_weights()` signature changed (transformers 5.x)**
`PreTrainedModel.tie_weights()` now passes through `**kwargs`. Updated
the `tie_weights` override in
`openvla/pytorch/src/modeling_prismatic.py` to accept and forward
`**kwargs` to avoid a `TypeError` on model init.

**`XLMRobertaSdpaSelfAttention` removed (transformers 5.x)**
The separate SDPA attention class was consolidated into the unified
attention dispatch. Rewrote `XLMRobertaSelfAttentionWithAdapters` in
`sentencizer/pytorch/src/adapter_utils.py` to conform to the new
`forward()` signature using `eager_attention_forward` from transformers.

**`HfFolder.get_token()` removed (huggingface_hub)**
`HfFolder` was removed in recent `huggingface_hub` versions. Updated
`sentencizer/pytorch/src/utils.py` to use `HfApi().token` instead.

**mamba2 JAX loader removed**
`mamba2/causal_lm/jax` was removed as it was non-functional and
incompatible with the pinned EasyDel version used by other JAX models.

#### tt-xla infrastructure changes

- **`transformers` removed from `_JAX_PURGE_SKIP`**
(`tests/runner/requirements.py`): `transformers` was previously excluded
from the `sys.modules` purge that `RequirementsManager` performs after a
per-model pip install. This meant that when an EasyDel model installed
`transformers==4.57.1`, the venv's 5.2.0 stayed cached in memory and the
newly installed version was never visible to imports. Removing
`transformers` from the skip list (keeping only `flax`, which has
genuine module-level imports in JAX infra) ensures the installed version
is correctly used. All JAX infra files were audited to confirm none hold
module-level `transformers` references.

- **Sparse MLP router output fix**
(`python_package/tt_torch/sparse_mlp.py`): `GptOssTopKRouter` was
updated to return a 3-tuple `(router_logits, router_scores,
router_indices)` instead of 2. Updated all three MoE dispatch paths
(`SparseMLP`, `A2aSparseMLP`, `A2aSparseStackedMlp`) to unpack
accordingly and simplified the weighted-sum logic to use the compact
scores tensor directly, removing a workaround that used `torch.gather` /
one-hot einsum.

- **Performance benchmark matrix**
(`.github/workflows/perf-bench-matrix.json`): Updated all PyTorch
benchmark entries from `transformers==4.57.1` to `transformers==5.2.0`.
The `resnet_jax` and `bge_m3_encode` entries are intentionally kept at
`transformers==4.57.1` — `FlaxResNetForImageClassification` was removed
in 5.x, and `FlagEmbedding` (used by bge_m3) is not yet compatible with
5.x.

- **LLM benchmark version check**
(`tests/benchmark/benchmarks/llm_benchmark.py`): Updated
`check_transformers_version()` to require exactly `5.2.0` instead of `<=
4.57.1`. Also removed the now-unnecessary `check_transformers_version()`
guard from `examples/pytorch/llama.py`.

- **Resnet codegen examples skipped**
(`tests/examples/test_examples.py`): Added XFAIL entries for
`jax/codegen/cpp/resnet.py` and `jax/codegen/python/resnet.py` since
`FlaxResNetModel` was removed in transformers 5.x.

- **`surya-ocr` unpinned** (`venv/requirements-dev.txt`): Removed the
`surya-ocr==0.17.0` version pin.

#### tt-forge models PR:
tenstorrent/tt-forge-models#529

### CI tests for reference:
Manual Release test:
https://github.com/tenstorrent/tt-xla/actions/runs/23179435697
Manual Manylinux release test:
https://github.com/tenstorrent/tt-xla/actions/runs/23179426382

### Checklist
- [x] Fix `gpt_oss` failure
- [x] Fix JAX-only CI workflows

---------

Co-authored-by: Vladimir Zeljkovic <vzeljkovic@tenstorrent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants