LlavaNextProcessor bug in `_get_unpadded_features`

### System Info

- `transformers` version: 4.45.0.dev0
- Platform: Linux-5.15.0-78-generic-x86_64-with-glibc2.35
- Python version: 3.11.9
- Huggingface_hub version: 0.24.6
- Safetensors version: 0.4.4
- Accelerate version: 0.33.0
- Accelerate config:    not found
- PyTorch version (GPU?): 2.4.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA H100 80GB HBM3

### Who can help?

@zu

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

There is a typo in the [following lines](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_next/processing_llava_next.py#L202-L203) in `LlavaNextProcessor` as `current_width` and `current_height` are inverted which can cause errors due to miss match of image feature size computed by the processor and by the vision branch in `LlavaNextForConditionalGeneration`.  I encountered this issue while running the following [example script](https://github.com/huggingface/trl/blob/main/examples/scripts/vsft_llava.py).

Here is a code snippet to reproduce the issue:

```python
from transformers import LlavaNextProcessor
from transformers.models.llava_next.processing_llava_next import select_best_resolution
from transformers.models.llava_next.modeling_llava_next import unpad_image, get_anyres_image_grid_shape
import torch

POSSIBLE_RESOLUTIONS = [
    [
      336,
      672
    ],
    [
      672,
      336
    ],
    [
      672,
      672
    ],
    [
      1008,
      336
    ],
    [
      336,
      1008
    ]
]
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
HEIGHT = 500
WIDTH = 316
VISION_MODEL_INPUT_SIZE = 336
PATCH_SIZE = 14
PATCH_DIM = VISION_MODEL_INPUT_SIZE // PATCH_SIZE


# Reproduce pre-processing steps in the processor
height_best_resolution, width_best_resolution = select_best_resolution(
[HEIGHT, WIDTH], POSSIBLE_RESOLUTIONS
)
scale_height, scale_width = height_best_resolution // VISION_MODEL_INPUT_SIZE, width_best_resolution // VISION_MODEL_INPUT_SIZE
patches_height = VISION_MODEL_INPUT_SIZE // PATCH_SIZE
patches_width = VISION_MODEL_INPUT_SIZE // PATCH_SIZE
unpadded_features, newline_features = processor._get_unpadded_features(HEIGHT, WIDTH, patches_height, patches_width, scale_height, scale_width)
num_unpad_features_from_processor = unpadded_features


# Reproduce computation of unpadded features in the vision branch
# Equivalent to:
# https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_next/modeling_llava_next.py#L676-L684
num_patch_height, num_patch_width = get_anyres_image_grid_shape(
    (HEIGHT, WIDTH),
    POSSIBLE_RESOLUTIONS,
    VISION_MODEL_INPUT_SIZE,
)
unpad_features_from_vision = unpad_image(torch.randn(128, num_patch_height*PATCH_DIM, num_patch_width*PATCH_DIM), (HEIGHT, WIDTH))
num_unpad_features_from_vision = unpad_features_from_vision.shape[1] * unpad_features_from_vision.shape[2]

# Should be equal
assert num_unpad_features_from_processor == num_unpad_features_from_vision, f"Not equal: From processor: {num_unpad_features_from_processor}, from vision {num_unpad_features_from_vision}"
```

### Expected behavior

No assertion error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LlavaNextProcessor bug in `_get_unpadded_features` #33261

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

LlavaNextProcessor bug in _get_unpadded_features #33261

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

LlavaNextProcessor bug in `_get_unpadded_features` #33261