Jazzing up a photo? Google’s AI can do it with just text

Type ‘dog wearing red hat’ − and voila − dog in photo gets a red hat

Ben Wodecki

October 25, 2022

3 Min Read

Type ‘dog wearing red hat’ − and voila − dog in photo gets a red hat

Researchers from Google are introducing UniTune: a generative AI method capable of editing all types of images just through text.

For example, take a photo of a dog that a user wants to embellish to be more festive. Type ‘dog wearing a red hat’ in UniTune and – just like that − a red hat appears on the dog, without the user needing to manipulate any photo editing tools, look for images or add any designs.

UniTune is able to do it because it harnesses text-to-image generators to make objects 'magically' appear as well as enable other design changes, according to the paper, UniTune: Text-Driven Image Editing by Fine Tuning an Image Generation Model on a Single Image.

UniTune takes an image and a textual edit description as inputs. It then uses text and an “intuitive interface for art direction” to make changes while maintaining both visual fidelity and semantic details, they wrote. No additional inputs, such as sketches or masks, are needed.

Google tested the AI method on its own text-to-image Imagen model, but researchers expect it to work with other large-scale models as well.

Figure 1:

While text-to-image generators such as DALL-E, Imagen and Stable Diffusion excel at creating images, when it comes to editing, these models usually require the user to specify masks and often struggle with edits that depend on the masked portion of the image, the authors said.

Using UniTune, these models can edit arbitrary images in complex cross-domain scenes.

The ability to use the method to place objects in a scene or make sweeping edits “makes UniTune useful by casual users” such as speaking into a mobile device to use it, the paper stated. “UniTune, like other image generation models, has a great potential to complement and augment human creativity by creating new tools for professionals and empowering non-professionals with the ability to edit images more easily and in a more intuitive manner.”

Google’s latest unveiling of UniTune follows swiftly on the heels of DreamFusion, a model capable of generating 3D models from text inputs.

How it works

Today, creative folks can use several AI-powered editing tools. They include Luminar (formerly LuminarAI), which is designed for photo editing; Lunacy is app-based and can remove backgrounds and generate text; and the likes of Apple’s GAUDI and Nvidia’s GauGAN can generate scenes from text inputs.

UniTune works by fine-tuning a text-to-image model on pairs of tokens and then samples from the model.

“Using Classifier Free Guidance, the fine-tuned model correctly takes the conditioning into account. When a higher visual fidelity is needed, we also use SDEdit to maintain visual details in the original image,” the authors wrote. “The user is presented with images that use combinations of the parameters mentioned (number of training steps, Classifier Free Guidance, SDEdit) allowing them to pick the most suitable version.”

About the Authors

Ben Wodecki

Assistant Editor

Get the newsletter
From automation advancements to policy announcements, stay ahead of the curve with the bi-weekly AI Business newsletter.