Video Generation via Tokens

If we tokenise frames of a video with a VQGAN, we can autoregressively predict the next token using our current language model. More specifically, using our current context of 2 million tokens, we could fit 2048 frames (~34 minutes at 1 FPS) with current state-of-the-art image quantisation models.
This issue is about implementing such a model end-to-end and having a working demo.