feat: implement rae autoencoder.#13046
Conversation
|
@bytetriper if you could take a look? |
|
nice works @Ando233 checking |
|
off the bat,
lets sort out these things and then re-look |
|
Agree with @kashif . Also if possible we can bake all the params into config so we can enable .from_pretrained(), which is more elegant and aligns with diffusers usage. I can help convert our released ckpt to hgf format afterwards |
|
@Ando233 we're happy to provide assistance if needed. |
|
@Ando233 the one remaining thing is the use of the |
|
@bytetriper could you kindly try to run the conversion scripts and upload the diffusers style weights to your huggingface hub for the checkpoints you have? |
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
|
@dg845 resolved issues thanks |
| self.decoder_embed = nn.Linear(hidden_size, decoder_hidden_size, bias=True) | ||
| self.register_buffer("decoder_pos_embed", torch.zeros(1, num_patches + 1, decoder_hidden_size)) | ||
| self.register_buffer( | ||
| "decoder_pos_embed", torch.zeros(1, num_patches + 1, decoder_hidden_size), persistent=False |
There was a problem hiding this comment.
What happens if we directly initialize this to what we're doing in the initialize_weights() function? Could we get rid of the explicit device placement in return x_rec.to(device=z.device) then?
There was a problem hiding this comment.
I guess a simpler solution would be to directly assign the pos_embed value (we are initializing through initalize_weights()) and just persist it in the state dict. That way, we can skip explicit device placements like return x_rec.to(device=z.device)?
This would require opening PRs to the RAE repos on the Hub, though.
@dg845 WDYT?
There was a problem hiding this comment.
I think something like
class RAEDecoder(nn.Module):
...
def __init__(...):
...
grid_size = int(num_patches**0.5)
pos_embed = get_2d_sincos_pos_embed(
decoder_hidden_size,
grid_size,
cls_token=True,
extra_tokens=1,
output_type="pt",
)
self.register_buffer("decoder_pos_embed", pos_embed.unsqueeze(0), persistent=False)
...is generally how we would implement this, see e.g. here:
diffusers/src/diffusers/models/transformers/transformer_wan.py
Lines 392 to 393 in ab6040a
I think this should handle device placement automatically and also avoid the need to change the Hub repo (although maybe the conversion script might need to be changed?).
There was a problem hiding this comment.
For test_model_parallelism specifically, the error I encountered earlier in #13046 (comment) may be the result of an accelerate bug where the decoder_pos_embed buffer doesn't end up on the device_map, so AutoencoderRAE.from_pretrained(..., device_map="auto") doesn't know where to put it and gives that error. I've opened an issue for this at huggingface/accelerate#3956.
There was a problem hiding this comment.
Thanks! Then let's prefer thi solution.
|
Thanks a lot @kashif for shipping RAEs! |
What does this PR do?
This PR adds a new representation autoencoder implementation, AutoencoderRAE, to diffusers.
Implements diffusers.models.autoencoders.autoencoder_rae.AutoencoderRAE with a frozen pretrained vision encoder (DINOv2 / SigLIP2 / ViT-MAE) and a ViT-MAE style decoder.
The decoder implementation is aligned with the RAE-main GeneralDecoder parameter structure, enabling loading of existing trained decoder checkpoints (e.g. model.pt) without key mismatches when encoder/decoder settings are consistent.
Adds unit/integration tests under diffusers/tests/models/autoencoders/test_models_autoencoder_rae.py.
Registers exports so users can import directly via from diffusers import AutoencoderRAE.
Fixes #13000
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Usage
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.