Generate AI Images in Seconds with New Model from Hugging Face
Hugging Face’s aMUSEd model creates images in seconds, far faster than rivals like Stable Diffusion
At a Glance
- Hugging Face introduced a new AI model called aMUSEd that can generate images within seconds.
- It uses a Masked Image Model architecture rather than latent diffusion, which reduces inferencing steps.
One of the biggest problems with AI image generation models is speed: It can take several minutes to create an image using ChatGPT or Stable Diffusion. Even Meta CEO Mark Zuckerberg complained about image generation speeds at last year’s Meta Connect.
The team at Hugging Face is trying to speed things up with a new model, aMUSEd, that can create images in mere seconds.
This lightweight text-to-image model is based on a Google model called MUSE. Standing at around 800 million parameters in size, aMUSEd could be deployed in on-device applications such as mobile phones.
Its speed comes from how it is built. AMUSEd uses a Masked Image Model (MIM) architecture, rather than latent diffusion found in Stable Diffusion and other image generation models. The Hugging Face team said MIM reduces the inferencing steps, which then improves the model’s generation speed and interpretability. Its small size also makes it fast.
You can try aMUSEd out for yourself via the demo on Hugging Face. The model is currently available as a research preview but with an OpenRAIL license, meaning it can be experimented with or tweaked but it is also commercially friendly to adapt.
The quality of the images aMUSEd generates can be further improved, and the team behind it openly acknowledged this, opting to release it to "encourage the community to explore non-diffusion frameworks such as MIM for image generation."
Below are example generations from Hugging Face that it created in just 2.5 seconds with the following prompts: 'A Pikachu fine dining with a view to the Eiffel Tower' (left) and 'a serious capybara at work, wearing a suit' (right).
Credit: Hugging Face
The model can also do image inpainting zero-shot, which Stable Diffusion XL cannot do, according to the Hugging Face team.
How to make AI images in seconds
The MIM method in aMUSEd is similar to techniques used in language modeling, with certain parts of the data hidden (or masked) and the model learns to predict these hidden parts. In aMUSEd’s case, it is images instead of text.
When training the model, the Hugging Face team converted input images into a series of tokens using a tool called VQGAN (Vector Quantized Generative Adversarial Network). The image tokens are then partially masked, with the model trained to predict the masked parts. The predictions are based on the unmasked parts and prompts using a text encoder.
During the inferencing process, the text prompt is converted into a format the model understands using the same text encoder. AMUSEd then begins with a set of randomly masked tokens and progressively refines the image. During each refinement, it predicts parts of the image, keeps the parts it is most confident about, and continues to refine the rest. After a set number of steps, the model's predictions are processed through the VQGAN decoder to produce the final image.
The prediction process in action. Credit: Hugging Face
AMUSEd can then be fine-tuned on custom datasets. Hugging Face showed off the model fine-tuned with the 8-bit Adam optimizer and float16 precision – a process that used just under 11 GBs of GPU VRAM. A training script for fine-tuning the model can be accessed here.
Read more about:
ChatGPT / Generative AIAbout the Author
You May Also Like