added fast image processor for ZoeDepth and expanded tests accordingly by henrikm11 · Pull Request #38515 · huggingface/transformers

henrikm11 · 2025-06-01T10:55:09Z

Adds new fast image processor for zoedepth, see [Contributions Welcome] Add Fast Image Processors #36978
Expands tests accordingly

potential reviewer: @yonigozlan

yonigozlan

Hi @henrikm11 ! Thanks for working on this, looks almost ready to go! There's still a few things left to simplify in the fast image processor

yonigozlan · 2025-06-02T19:40:28Z

+    def _resize(
+        self,
+        images: "torch.Tensor",
+        size: SizeDict,
+        keep_aspect_ratio: bool = False,
+        ensure_multiple_of: int = 1,
+        resample: PILImageResampling = PILImageResampling.BILINEAR,
+    ) -> "torch.Tensor":
+        """
+        Resize an image or batchd images to target size `(size["height"], size["width"])`. If `keep_aspect_ratio` is `True`, the image
+        is resized to the largest possible size such that the aspect ratio is preserved. If `ensure_multiple_of` is
+        set, the image is resized to a size that is a multiple of this value.
+
+        Args:
+            images (`torch.Tensor`):
+                Images to resize.
+            size (`Dict[str, int]`):
+                Target size of the output image.
+            keep_aspect_ratio (`bool`, *optional*, defaults to `False`):
+                If `True`, the image is resized to the largest possible size such that the aspect ratio is preserved.
+            ensure_multiple_of (`int`, *optional*, defaults to 1):
+                The image is resized to a size that is a multiple of this value.
+            interpoation (`F.InterpolationMode`, *optional*, defaults to `InterpolationMode.BILINEAR`):
+                Defines the resampling filter to use if resizing the image. Otherwise, the image is resized to size
+                specified in `size`.
+        """
+        if not size.height or not size.width:
+            raise ValueError(f"The size dictionary must contain the keys 'height' and 'width'. Got {size}")
+        output_size = get_resize_output_image_size(
+            images,
+            output_size=(size.height, size.width),
+            keep_aspect_ratio=keep_aspect_ratio,
+            multiple=ensure_multiple_of,
+            input_data_format=ChannelDimension.FIRST,
+        )
+        height, width = output_size
+
+        resample_to_mode = {PILImageResampling.BILINEAR: "bilinear", PILImageResampling.BICUBIC: "bicubic"}
+        mode = resample_to_mode[resample]
+
+        resized_images = torch.nn.functional.interpolate(
+            images, (int(height), int(width)), mode=mode, align_corners=True
+        )
+
+        return resized_images


Why not use the resize function from BaseImageProcessorFast?

Isn't align_corners = True an issue here? Afaik that's simply not available in torchvision.transforms.functional / is set to False and thus one gets significantly different results.

I see, in this case let's override the resize function from BaseImageProcessorFast. Also you shouldn't need to define resample_to_mode, just use the interpolation arg in _preprocess instead of resample, as it is already the converted PILImageResampling to torch InterpolationMode. And still no need to override the whole preprocess function

Indeed, thanks for pointing that out, not even sure why I overwrote preprocess in the first place now... All other suggested changes make perfect sense to me, sorry for not being as diligent with cleaning and simplifying the code as I should have been. Will make all changes tomorrow.

No worries at all!

yonigozlan · 2025-06-02T19:45:36Z

+    def _pad_images(
+        self,
+        images: "torch.Tensor",
+        mode: TorchPaddingMode = TorchPaddingMode.REFLECT,


Let's just have

Suggested change

mode: TorchPaddingMode = TorchPaddingMode.REFLECT,

mode="reflect",

So that we can remove TorchPaddingMode altogether

yonigozlan · 2025-06-02T19:55:16Z

+    @auto_docstring
+    def preprocess(
+        self,
+        images: ImageInput,
+        *args,
+        **kwargs: Unpack[ZoeDepthFastImageProcessorKwargs],
+    ) -> BatchFeature:
+        validate_kwargs(captured_kwargs=kwargs.keys(), valid_processor_keys=self.valid_kwargs.__annotations__.keys())
+        # Set default kwargs from self. This ensures that if a kwarg is not provided
+        # by the user, it gets its default value from the instance, or is set to None.
+
+        for kwarg_name in self.valid_kwargs.__annotations__:
+            kwargs.setdefault(kwarg_name, getattr(self, kwarg_name, None))
+
+        # Extract parameters that are only used for preparing the input images
+        do_convert_rgb = kwargs.pop("do_convert_rgb")
+        input_data_format = kwargs.pop("input_data_format")
+        device = kwargs.pop("device")
+
+        # Prepare input images
+        images = self._prepare_input_images(
+            images=images, do_convert_rgb=do_convert_rgb, input_data_format=input_data_format, device=device
+        )
+        # return list of torch.Tensor in ChannelDimension.FIRST format
+        # Update kwargs that need further processing before being validated
+        kwargs = self._further_process_kwargs(**kwargs)
+
+        # Validate kwargs
+        self._validate_preprocess_kwargs(**kwargs)
+
+        # Pop kwargs that are not needed in _preprocess
+        kwargs.pop("default_to_square")
+        kwargs.pop("data_format")
+        kwargs.pop("do_center_crop")
+        kwargs.pop("crop_size")
+
+        return self._preprocess(images, *args, **kwargs)


I don't think there's a need to override anything here, especially if we use BaseImageProcessorFast resize. You can just have this to account for ZoeDepthFastImageProcessorKwargs in the signature

Suggested change

@auto_docstring

def preprocess(

self,

images: ImageInput,

*args,

**kwargs: Unpack[ZoeDepthFastImageProcessorKwargs],

) -> BatchFeature:

validate_kwargs(captured_kwargs=kwargs.keys(), valid_processor_keys=self.valid_kwargs.__annotations__.keys())

# Set default kwargs from self. This ensures that if a kwarg is not provided

# by the user, it gets its default value from the instance, or is set to None.

for kwarg_name in self.valid_kwargs.__annotations__:

kwargs.setdefault(kwarg_name, getattr(self, kwarg_name, None))

# Extract parameters that are only used for preparing the input images

do_convert_rgb = kwargs.pop("do_convert_rgb")

input_data_format = kwargs.pop("input_data_format")

device = kwargs.pop("device")

# Prepare input images

images = self._prepare_input_images(

images=images, do_convert_rgb=do_convert_rgb, input_data_format=input_data_format, device=device

)

# return list of torch.Tensor in ChannelDimension.FIRST format

# Update kwargs that need further processing before being validated

kwargs = self._further_process_kwargs(**kwargs)

# Validate kwargs

self._validate_preprocess_kwargs(**kwargs)

# Pop kwargs that are not needed in _preprocess

kwargs.pop("default_to_square")

kwargs.pop("data_format")

kwargs.pop("do_center_crop")

kwargs.pop("crop_size")

return self._preprocess(images, *args, **kwargs)

@auto_docstring

def preprocess(self,images: ImageInput, *args, **kwargs: Unpack[ZoeDepthFastImageProcessorKwargs]) -> BatchFeature:

return super().preprocess(images, **kwargs)

yonigozlan · 2025-06-02T19:55:33Z

+            if do_rescale:
+                stacked_images = self.rescale(stacked_images, rescale_factor)
+            if do_pad:
+                stacked_images = self._pad_images(images=stacked_images, mode=PaddingMode.REFLECT)


No need to provide PaddingMode.REFLECT, as there's no way to specify the mode in the slow image processor.

yonigozlan · 2025-06-02T19:55:58Z

+                stacked_images = self._pad_images(images=stacked_images, mode=PaddingMode.REFLECT)
+            if do_resize:
+                stacked_images = self._resize(stacked_images, size, keep_aspect_ratio, ensure_multiple_of, resample)
+            print(stacked_images.dtype)


Suggested change

print(stacked_images.dtype)

yonigozlan · 2025-06-02T19:56:06Z

+            print(stacked_images.dtype)
+            if do_normalize:
+                stacked_images = self.normalize(stacked_images, image_mean, image_std)
+            print(stacked_images.dtype)


Suggested change

print(stacked_images.dtype)

yonigozlan · 2025-06-02T19:56:53Z

+        # Group images by size for further processing
+        # Needed in case do_resize is False, or resize returns images with different sizes
+        grouped_images, grouped_images_index = group_images_by_shape(resized_images)


Suggested change

# Group images by size for further processing

# Needed in case do_resize is False, or resize returns images with different sizes

grouped_images, grouped_images_index = group_images_by_shape(resized_images)

yonigozlan · 2025-06-02T19:58:47Z

+
+        return F.pad(images, padding=(pad_width, pad_height), padding_mode=mode)
+
+    @filter_out_non_signature_kwargs()


Suggested change

@filter_out_non_signature_kwargs()

yonigozlan · 2025-06-02T20:07:32Z

+            modified_dict = self.image_processor_dict
+            modified_dict["size"] = 42
+            image_processor = image_processing_class(**modified_dict)
+            print(self.image_processor_dict)


Suggested change

print(self.image_processor_dict)

yonigozlan · 2025-06-02T20:10:57Z

-        self.assertEqual(list(pixel_values.shape), [1, 3, 512, 672])
-        self.assertTrue(pixel_values.shape[2] % multiple == 0)
-        self.assertTrue(pixel_values.shape[3] % multiple == 0)
+        self.image_processor_tester.do_pad = True


Suggested change

self.image_processor_tester.do_pad = True

yonigozlan

Very nice! two very small things left to change, after that LGTM!

yonigozlan · 2025-06-04T15:54:45Z

+        size: SizeDict,
+        keep_aspect_ratio: bool = False,
+        ensure_multiple_of: int = 1,
+        interpolation: "F.InterpolationMode" = InterpolationMode.BILINEAR,


I think having InterpolationMode.BILINEAR as a default can cause some issues with the CI when torchvision is not available... Since we have resample = PILImageResampling.BILINEAR, as default class attributes, we shouldn't need to set a default here and can just set it to None. That way we can also remove the from torchvision.transforms import InterpolationMode above

Removing the default is no problem at all, the import is still required, however, because post_process_depth_estimation uses it.

yonigozlan · 2025-06-04T15:55:35Z

+                If `True`, the image is resized to the largest possible size such that the aspect ratio is preserved.
+            ensure_multiple_of (`int`, *optional*, defaults to 1):
+                The image is resized to a size that is a multiple of this value.
+            interpoation (`F.InterpolationMode`, *optional*, defaults to `InterpolationMode.BILINEAR`):


Suggested change

interpoation (`F.InterpolationMode`, *optional*, defaults to `InterpolationMode.BILINEAR`):

interpolation (`F.InterpolationMode`, *optional*, defaults to `InterpolationMode.BILINEAR`):

HuggingFaceDocBuilderDev · 2025-06-04T16:12:08Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…y, hopefully fixed repo consistency issue too now

henrikm11 · 2025-06-04T21:04:21Z

The two failed test are entirely unrelated to the changes I have made, do you have any suggestions how to deal with this? - Even worse, one fails because of a request timeout, the other one because of a max difference of (I assume random) tensors being 5e-05 and not 1e-05, both of which do not seem to be fully deterministic conditions...maybe you can just rerun the tests @yonigozlan?

yonigozlan · 2025-06-04T22:46:42Z

The two failed test are entirely unrelated to the changes I have made, do you have any suggestions how to deal with this? - Even worse, one fails because of a request timeout, the other one because of a max difference of (I assume random) tensors being 5e-05 and not 1e-05, both of which do not seem to be fully deterministic conditions...maybe you can just rerun the tests @yonigozlan?

The CI can be a bit flaky sometimes, running the tests again and merging when they pass!

huggingface#38515) * added fast image processor for ZoeDepth and expanded tests accordingly * added fast image processor for ZoeDepth and expanded tests accordingly, hopefully fixed repo consistency issue too now * final edits for zoedept fast image processor * final minor edit for zoedepth fast imate procesor

yonigozlan reviewed Jun 2, 2025

View reviewed changes

henrikm11 force-pushed the fast-image-processor-zoedepth branch from 8fe819e to bde23db Compare June 3, 2025 19:35

yonigozlan approved these changes Jun 4, 2025

View reviewed changes

henrikm11 added 4 commits June 4, 2025 22:38

added fast image processor for ZoeDepth and expanded tests accordingly

2807566

added fast image processor for ZoeDepth and expanded tests accordingl…

a4471bd

…y, hopefully fixed repo consistency issue too now

final edits for zoedept fast image processor

20dc991

final minor edit for zoedepth fast imate procesor

be46058

henrikm11 force-pushed the fast-image-processor-zoedepth branch from 64444e5 to be46058 Compare June 4, 2025 20:38

yonigozlan enabled auto-merge (squash) June 4, 2025 22:46

yonigozlan merged commit 1fed616 into huggingface:main Jun 4, 2025
20 checks passed

yonigozlan mentioned this pull request Jun 5, 2025

[Contributions Welcome] Add Fast Image Processors #36978

Closed

81 tasks

henrikm11 deleted the fast-image-processor-zoedepth branch June 23, 2025 19:24

	mode: TorchPaddingMode = TorchPaddingMode.REFLECT,
	mode="reflect",

	# Group images by size for further processing
	# Needed in case do_resize is False, or resize returns images with different sizes
	grouped_images, grouped_images_index = group_images_by_shape(resized_images)


		return F.pad(images, padding=(pad_width, pad_height), padding_mode=mode)

		@filter_out_non_signature_kwargs()

	interpoation (`F.InterpolationMode`, optional, defaults to `InterpolationMode.BILINEAR`):
	interpolation (`F.InterpolationMode`, optional, defaults to `InterpolationMode.BILINEAR`):

Conversation

henrikm11 commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yonigozlan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

henrikm11 Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yonigozlan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Jun 4, 2025

Uh oh!

henrikm11 commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yonigozlan commented Jun 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

henrikm11 commented Jun 1, 2025 •

edited

Loading

henrikm11 Jun 2, 2025 •

edited

Loading

henrikm11 commented Jun 4, 2025 •

edited

Loading