og_kalu
Autoregressive Transformers scale and learn much better/faster when trained to predict the "next-resolution"/"next-scale", i.e start very small and gradually scale up the resolution and size vs being trained to predict the next image token/patch.

Related: VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling - https://arxiv.org/abs/2408.01181

STAR: Scale-wise Text-to-image generation via Auto-Regressive representations https://arxiv.org/abs/2406.10797