August 10, 2022
Type ‘move further down the corridor’ and the model repositions the scene.
Apple is throwing its hat in the AI text-to-image ring with GAUDI, an AI model that can generate 3D scenes from text prompts – and redraws the scene from any angle.
Named after the famous Spanish architect known for his whimsical designs, Antoni Gaudi, Apple’s AI model uses a camera pose decoder that allows it to predict possible camera positions of a scene. The decoder then enables the model to predict the 3D canvas from essentially any angle.
Apple's team showcased GAUDI in a paper reconstructing views of interior scans of rooms on a quality level the researchers suggested was on par with existing 3D scene generation techniques.
GAUDI can also generate new camera movements through 3D indoor scenes via text, such as a user input to ‘go through the corridor.’
According to their paper, Apple’s researchers believe GAUDI “generalizes” previous works of 3D scene generation that focus on single objects by removing the assumption that the camera pose distribution can be shared across samples.
"We show that GAUDI obtains state-of-the-art performance in the unconditional generative setting across multiple datasets and allows for conditional generation of 3D scenes given conditioning variables like sparse image observations or text that describes the scene," the authors wrote.
Co-author Miguel Ángel Bautista, a senior research scientist at Apple, said in a tweet that GAUDI tackles “the problem of learning a generative model of 3D scenes parametrized as radiance fields.”
“Very exciting times ahead for the interplay of powerful generative models and 3D data,” he added.
Apple published GAUDI’s repository to GitHub.
GAUDI, meet GauGAN and GFP-GAN
The model’s application is similar to GauGAN2, developed by Nvidia. GauGAN2 can generate images using text, with users able to type phrases like ‘winter’ and the model able to produce images that match the desired descriptors.
The release of GAUDI comes after researchers from Chinese tech company Tencent published a model that can restore damaged and low-resolution pictures.
GFP-GAN uses a combination of a proprietary model and a pre-trained StyleGAN-2 model from Nvidia to effectively fill in the missing elements of an old image in seconds.