diff --git a/docs/source/en/model_doc/swinv2.md b/docs/source/en/model_doc/swinv2.md index a709af9712e3..0f71023e382f 100644 --- a/docs/source/en/model_doc/swinv2.md +++ b/docs/source/en/model_doc/swinv2.md @@ -14,37 +14,74 @@ rendered properly in your Markdown viewer. --> +
+
+ PyTorch +
+
+ # Swin Transformer V2 -
-PyTorch -
+[Swin Transformer V2](https://huggingface.co/papers/2111.09883) is a 3B parameter model that focuses on how to scale a vision model to billions of parameters. It introduces techniques like residual-post-norm combined with cosine attention for improved training stability, log-spaced continuous position bias to better handle varying image resolutions between pre-training and fine-tuning, and a new pre-training method (SimMIM) to reduce the need for large amounts of labeled data. These improvements enable efficiently training very large models (up to 3 billion parameters) capable of processing high-resolution images. + +You can find official Swin Transformer V2 checkpoints under the [Microsoft](https://huggingface.co/microsoft?search_models=swinv2) organization. + +> [!TIP] +> Click on the Swin Transformer V2 models in the right sidebar for more examples of how to apply Swin Transformer V2 to vision tasks. + + + -## Overview +```py +import torch +from transformers import pipeline -The Swin Transformer V2 model was proposed in [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo. +pipeline = pipeline( + task="image-classification", + model="microsoft/swinv2-tiny-patch4-window8-256", + torch_dtype=torch.float16, + device=0 +) +pipeline(images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg") +``` -The abstract from the paper is the following: + -*Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings. This paper aims to explore large-scale models in computer vision. We tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images. Through these techniques, this paper successfully trained a 3 billion-parameter Swin Transformer V2 model, which is the largest dense vision model to date, and makes it capable of training with images of up to 1,536×1,536 resolution. It set new performance records on 4 representative vision tasks, including ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. Also note our training is much more efficient than that in Google's billion-level visual models, which consumes 40 times less labelled data and 40 times less training time.* + -This model was contributed by [nandwalritik](https://huggingface.co/nandwalritik). -The original code can be found [here](https://github.com/microsoft/Swin-Transformer). +```py +import torch +import requests +from PIL import Image +from transformers import AutoModelForImageClassification, AutoImageProcessor -## Resources +image_processor = AutoImageProcessor.from_pretrained( + "microsoft/swinv2-tiny-patch4-window8-256", +) +model = AutoModelForImageClassification.from_pretrained( + "microsoft/swinv2-tiny-patch4-window8-256", + device_map="auto" +) -A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Swin Transformer v2. +url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" +image = Image.open(requests.get(url, stream=True).raw) +inputs = image_processor(image, return_tensors="pt").to(model.device) - +with torch.no_grad(): + logits = model(**inputs).logits -- [`Swinv2ForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). -- See also: [Image classification task guide](../tasks/image_classification) +predicted_class_id = logits.argmax(dim=-1).item() +predicted_class_label = model.config.id2label[predicted_class_id] +print(f"The predicted class label is: {predicted_class_label}") +``` -Besides that: + + -- [`Swinv2ForMaskedImageModeling`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining). +## Notes -If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. +- Swin Transformer V2 can pad the inputs for any input height and width divisible by `32`. +- Swin Transformer V2 can be used as a [backbone](../backbones). When `output_hidden_states = True`, it outputs both `hidden_states` and `reshaped_hidden_states`. The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than `(batch_size, sequence_length, num_channels)`. ## Swinv2Config