Apple has recently unveiled Autoregressive Image Models (AIM), a groundbreaking set of vision models that have been pre-trained with an autoregressive objective. These models represent a significant advancement in the field of large-scale vision models, taking inspiration from their textual counterparts, Large Language Models (LLMs), and demonstrating similar scalability.
The introduction of AIM signifies a scalable approach to pre-training vision models without the need for supervision. During the pre-training phase, the researchers utilized a generative autoregressive objective and proposed technical enhancements to tailor it for downstream transfer tasks. They found that the performance of visual features scales with both the model capacity and the quantity of data, highlighting the correlation between the value of the objective function and the model’s performance on subsequent tasks.
Check out the GitHub repository here.
In a practical demonstration, the team pre-trained a 7 billion parameter AIM on 2 billion images, achieving an impressive 84.0% accuracy on ImageNet-1k with a frozen trunk. Notably, even at this substantial scale, there were no indications of performance saturation. This pre-training methodology of AIM closely mirrors that of LLMs, eliminating the need for image-specific strategies to stabilize training at scale.
Key characteristics of AIM include its ability to scale to 7 billion parameters using a vanilla transformer implementation, without requiring stability-inducing techniques or extensive hyperparameter adjustments. AIM’s performance on the pre-training task demonstrates a strong correlation with downstream performance, surpassing state-of-the-art methods like MAE and narrowing the gap between generative and joint embedding pre-training approaches.
Furthermore, the researchers observed no signs of saturation as models scaled, suggesting the potential for further performance improvements with larger models trained over extended schedules. Apple’s Autoregressive Image Models represent a significant step forward in the development of large-scale vision models, offering new possibilities for the future of computer vision.