Add ViTPose#30530
Conversation
|
@ArthurZucker please review whenever you have time! The notebook to run: https://colab.research.google.com/drive/1e8fcby5rhKZWcr9LSN8mNbQ0TU4Dxxpo?usp=sharing Checkpoints: |
|
@ArthurZucker Reminder ping :) |
ArthurZucker
left a comment
There was a problem hiding this comment.
Wondering if you are certain we need to split these. I think we can just add all in vitpose no? Why do we need to have the backbone separate?
Otherwise good work!
| # create image processor | ||
| image_processor = VitPoseImageProcessor() | ||
|
|
||
| # verify image processor | ||
| image = prepare_img() | ||
| boxes = [[[412.8, 157.61, 53.05, 138.01], [384.43, 172.21, 15.12, 35.74]]] | ||
| pixel_values = image_processor(images=image, boxes=boxes, return_tensors="pt").pixel_values | ||
|
|
||
| filepath = hf_hub_download(repo_id="nielsr/test-image", filename="vitpose_batch_data.pt", repo_type="dataset") | ||
| original_pixel_values = torch.load(filepath, map_location="cpu")["img"] | ||
| assert torch.allclose(pixel_values, original_pixel_values, atol=1e-1) | ||
|
|
||
| dataset_index = torch.tensor([0]) | ||
|
|
||
| with torch.no_grad(): | ||
| # first forward pass | ||
| outputs = model(pixel_values, dataset_index=dataset_index) | ||
| output_heatmap = outputs.heatmaps | ||
|
|
||
| # second forward pass (flipped) | ||
| # this is done since the model uses `flip_test=True` in its test config | ||
| pixel_values_flipped = torch.flip(pixel_values, [3]) | ||
| outputs_flipped = model( | ||
| pixel_values_flipped, | ||
| dataset_index=dataset_index, | ||
| flip_pairs=torch.tensor([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12], [13, 14], [15, 16]]), | ||
| ) | ||
| output_flipped_heatmap = outputs_flipped.heatmaps | ||
|
|
||
| outputs.heatmaps = (output_heatmap + output_flipped_heatmap) * 0.5 | ||
|
|
||
| # Verify pose_results | ||
| pose_results = image_processor.post_process_pose_estimation(outputs, boxes=boxes)[0] | ||
|
|
||
| if model_name == "vitpose-base-simple": | ||
| assert torch.allclose( | ||
| pose_results[1]["keypoints"][0], | ||
| torch.tensor([3.98180511e02, 1.81808380e02]), | ||
| atol=5e-2, | ||
| ) | ||
| assert torch.allclose( | ||
| pose_results[1]["scores"][0], | ||
| torch.tensor([8.66642594e-01]), | ||
| atol=5e-2, | ||
| ) | ||
| elif model_name == "vitpose-base": | ||
| assert torch.allclose( | ||
| pose_results[1]["keypoints"][0], | ||
| torch.tensor([3.9807913e02, 1.8182812e02]), | ||
| atol=5e-2, | ||
| ) | ||
| assert torch.allclose( | ||
| pose_results[1]["scores"][0], | ||
| torch.tensor([8.8235235e-01]), | ||
| atol=5e-2, | ||
| ) | ||
| elif model_name == "vitpose-base-coco-aic-mpii": | ||
| assert torch.allclose( | ||
| pose_results[1]["keypoints"][0], | ||
| torch.tensor([3.98305542e02, 1.81741592e02]), | ||
| atol=5e-2, | ||
| ) | ||
| assert torch.allclose( | ||
| pose_results[1]["scores"][0], | ||
| torch.tensor([8.69966745e-01]), | ||
| atol=5e-2, | ||
| ) | ||
| elif model_name == "vitpose-plus-base": | ||
| assert torch.allclose( | ||
| pose_results[1]["keypoints"][0], | ||
| torch.tensor([3.98201294e02, 1.81728302e02]), | ||
| atol=5e-2, | ||
| ) | ||
| assert torch.allclose( | ||
| pose_results[1]["scores"][0], | ||
| torch.tensor([8.75046968e-01]), | ||
| atol=5e-2, | ||
| ) | ||
| else: | ||
| raise ValueError("Model not supported") | ||
| print("Conversion successfully done.") |
There was a problem hiding this comment.
as always let's put this in tests rather than here. it's not part of the conversion
There was a problem hiding this comment.
Do you mean erase in this convert.py? and put it in the test? or put both here and test?
There was a problem hiding this comment.
Yes, let's erase it here and leave it only in tests
There was a problem hiding this comment.
I'm not a fan of this, I'd always put these in conversion scripts.... especially since we are still tweaking the architecture and things can break + the number of users using the ViTPose conversion script afterwards is a number very close to 0.
another reason is that you have a single source of truth to showcase to people who want to know how we obtained equivalent logits.
By just adding an integration test, you lose all that information, causing breaking changes such as this one. For models like Pixtral but also SAM, it's unclear to me how equivalent logits were obtained, and I'm not even sure we have them as it's not proven in the conversion script!
There was a problem hiding this comment.
I also agree with Niels but let me know which way to refactor
There was a problem hiding this comment.
That sounds fair to me, thanks for explaining it. I would also add that we typically don't have all checkpoints tested in integration tests, and having this kind of test in the conversion script also allows to iterate faster while porting the model, as we don't have to run tests all the time.
the number of users using the ViTPose conversion script afterwards is a number very close to 0.
Yeah, but things tend to propagate to newer models too. So maybe it's worth adding a flag to check logits or not, to allow porting other checkpoints as well. What do you think?
There was a problem hiding this comment.
Yes adding a boolean flag is fine for me
There was a problem hiding this comment.
Okay, sounds good, let's add a flag in that case, it's the best of both worlds 😉
| if self.do_affine_transform: | ||
| new_images = [] | ||
| for image, image_boxes in zip(images, boxes): | ||
| for box in image_boxes: | ||
| center, scale = box_to_center_and_scale( | ||
| box, | ||
| image_width=size["width"], | ||
| image_height=size["height"], | ||
| normalize_factor=self.normalize_factor, | ||
| ) | ||
| transformed_image = self.affine_transform( | ||
| image, center, scale, rotation=0, size=size, input_data_format=input_data_format | ||
| ) | ||
| new_images.append(transformed_image) | ||
| images = new_images | ||
|
|
||
| # For batch processing, the number of boxes must be consistent across all images in the batch. | ||
| # When using a list input, the number of boxes can vary dynamically per image. | ||
| # The image processor creates pixel_values of shape (batch_size*num_persons, num_channels, height, width) | ||
|
|
||
| if self.do_rescale: | ||
| images = [ | ||
| self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format) | ||
| for image in images | ||
| ] | ||
| if self.do_normalize: | ||
| images = [ | ||
| self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format) | ||
| for image in images | ||
| ] | ||
|
|
||
| images = [ | ||
| to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images | ||
| ] | ||
|
|
There was a problem hiding this comment.
why don't we have a single for loop here ?
There was a problem hiding this comment.
For the clarity:
if self.do_rescale:
images = [
self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
for image in images
]
if self.do_normalize:
images = [
self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
for image in images
]transform to:
preprocessed_images = []
for image in images:
...
if self.do_normalize:
image = self.normalize(image, ...)
if self.do_rescale:
image = self.resclae(image, ...)
preprocessed_images.append(image)
...There was a problem hiding this comment.
That would make it inconsistent with any of the other image processors we have, e.g. here for ViTImageProcessor
There was a problem hiding this comment.
Some models have had this order refactored already (clip, bit, chameleon, llava*, ...). It's not a big change, but I believe it improves the readability of the code.
There was a problem hiding this comment.
It is also supposed to be faster 😉
There was a problem hiding this comment.
Ok let's refactor the other ones in that case :)
| experts = [nn.Linear(hidden_features, part_features) for _ in range(num_experts)] | ||
| self.experts = nn.ModuleList(experts) | ||
|
|
||
| def forward(self, hidden_state: torch.Tensor, indices: torch.Tensor) -> torch.Tensor: |
There was a problem hiding this comment.
let's go with SwitchTransformersSparseMLP implementation, it should be more efficient see #31173
There was a problem hiding this comment.
it's not a must but it's good for both standardization and performances in general!
There was a problem hiding this comment.
@ArthurZucker I have read the PR that you linked and understood the concept of reducing the amount of unused expert that used in the inference stage.
This case is little bit different since we have additional input to handle which experts to be handled. However, I have little changed the code
| hidden_states = attention_output + hidden_states | ||
|
|
||
| layer_output = self.layernorm_after(hidden_states) | ||
| if self.num_experts == 1: |
There was a problem hiding this comment.
are there any released model that use MOE?
There was a problem hiding this comment.
Yes they have several models that contains MoE parameters
There was a problem hiding this comment.
Looks alright, but we should be using modular transformers. This will make it easier when we refactor attention or whatnot !
There was a problem hiding this comment.
Also, you know better than me the other vision models, but let's push standardization as much as we can for thiings that can be translated to another thing!
|
@ArthurZucker Thanks for the review, I will change the thumbs-up reaction. For the question of backbone part, I think there is no necessary reason for dividing into two independent architecture (This work was done by Niels at the first and he designed this part). However, I assume that this is because of the intention of the author and Niels. This paper is aimed for human pose estimation based on very strong common backbone and just switching last MLP-like (e.g. head) layer for architecture efficiency to different task and datasets. For instance you can see that many model share same backbone architecture. So I think that this is why Niels build it this way. (@NielsRogge If there is a specific reason for dividing backbone architecture?) https://github.com/facebookresearch/mae |
|
It's mainly because ViTPose itself is a framework which could leverage various different backbones, so I made it compatible with the |
This reverts commit 2c56a48.
ArthurZucker
left a comment
There was a problem hiding this comment.
Thanks for iterating! Just 2 things to check, 1 blocking and the other one is not.
In general @qubvel and @NielsRogge I think we have an effort to make to bring standardization to vision models and both of you know a lot more than me about that!
-> Things like patch embedding, classic decoder layers, etc that are re-implemented everywhere could benefit from that. I might be wrong tho! 😉
Anyways great work everyone! @qubvel feel free to merge once these are adressed and you are happy with the changes!
| # When using a list input, the number of boxes can vary dynamically per image. | ||
| # The image processor creates pixel_values of shape (batch_size*num_persons, num_channels, height, width) | ||
|
|
||
| if self.do_rescale: |
| # add backbone attributes | ||
| if not hasattr(self.backbone.config, "hidden_size"): | ||
| raise ValueError("The backbone should have a hidden_size attribute") | ||
| if not hasattr(self.backbone.config, "image_size"): | ||
| raise ValueError("The backbone should have an image_size attribute") | ||
| if not hasattr(self.backbone.config, "patch_size"): | ||
| raise ValueError("The backbone should have a patch_size attribute") |
There was a problem hiding this comment.
config checks should be done in the config, not sure this has a lot of value / can be removed!
| if not hasattr(self.backbone.config, "patch_size"): | ||
| raise ValueError("The backbone should have a patch_size attribute") | ||
|
|
||
| self.head = VitPoseSimpleDecoder(config) if config.use_simple_decoder else VitPoseClassicDecoder(config) |
There was a problem hiding this comment.
are both used in the released checkpoints?
| experts = [nn.Linear(hidden_features, part_features) for _ in range(num_experts)] | ||
| self.experts = nn.ModuleList(experts) | ||
|
|
||
| def forward(self, hidden_state: torch.Tensor, indices: torch.Tensor) -> torch.Tensor: |
There was a problem hiding this comment.
it's not a must but it's good for both standardization and performances in general!
|
@qubvel @NielsRogge Hi can you check the changes and also the checkpoints? Always thanks for the support. |
|
Hi @SangbumChoi! Thanks for the fixes, I will push the final fixes and merge the PR! Checkpoints can be found here: |
| initializer_range: float = 0.02, | ||
| scale_factor: int = 4, | ||
| use_simple_decoder: bool = True, | ||
| **kwargs, |
There was a problem hiding this comment.
num_labels is the default argument of PretrainedCondig
What does this PR do?
This PR adds ViTPose as introduced in ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation.
Here's a demo notebook - note that the API might change:
To do: