Type ‘dog wearing red hat’ − and voila − dog in photo gets a red hat

Ben Wodecki, Jr. Editor

October 25, 2022

3 Min Read

Type ‘dog wearing red hat’ − and voila − dog in photo gets a red hat

Researchers from Google are introducing UniTune: a generative AI method capable of editing all types of images just through text.

For example, take a photo of a dog that a user wants to embellish to be more festive. Type ‘dog wearing a red hat’ in UniTune and – just like that − a red hat appears on the dog, without the user needing to manipulate any photo editing tools, look for images or add any designs.

UniTune is able to do it because it harnesses text-to-image generators to make objects 'magically' appear as well as enable other design changes, according to the paper, UniTune: Text-Driven Image Editing by Fine Tuning an Image Generation Model on a Single Image.

UniTune takes an image and a textual edit description as inputs. It then uses text and an “intuitive interface for art direction” to make changes while maintaining both visual fidelity and semantic details, they wrote. No additional inputs, such as sketches or masks, are needed.

Google tested the AI method on its own text-to-image Imagen model, but researchers expect it to work with other large-scale models as well.

Figure 1: 2225.jpg

While text-to-image generators such as DALL-E, Imagen and Stable Diffusion excel at creating images, when it comes to editing, these models usually require the user to specify masks and often struggle with edits that depend on the masked portion of the image, the authors said.

Using UniTune, these models can edit arbitrary images in complex cross-domain scenes.

The ability to use the method to place objects in a scene or make sweeping edits “makes UniTune useful by casual users” such as speaking into a mobile device to use it, the paper stated. “UniTune, like other image generation models, has a great potential to complement and augment human creativity by creating new tools for professionals and empowering non-professionals with the ability to edit images more easily and in a more intuitive manner.”

Google’s latest unveiling of UniTune follows swiftly on the heels of DreamFusion, a model capable of generating 3D models from text inputs.

How it works

Today, creative folks can use several AI-powered editing tools. They include Luminar (formerly LuminarAI), which is designed for photo editing; Lunacy is app-based and can remove backgrounds and generate text; and the likes of Apple’s GAUDI and Nvidia’s GauGAN can generate scenes from text inputs.

UniTune works by fine-tuning a text-to-image model on pairs of tokens and then samples from the model.

“Using Classifier Free Guidance, the fine-tuned model correctly takes the conditioning into account. When a higher visual fidelity is needed, we also use SDEdit to maintain visual details in the original image,” the authors wrote. “The user is presented with images that use combinations of the parameters mentioned (number of training steps, Classifier Free Guidance, SDEdit) allowing them to pick the most suitable version.”

About the Author(s)

Ben Wodecki

Jr. Editor

Ben Wodecki is the Jr. Editor of AI Business, covering a wide range of AI content. Ben joined the team in March 2021 as assistant editor and was promoted to Jr. Editor. He has written for The New Statesman, Intellectual Property Magazine, and The Telegraph India, among others. He holds an MSc in Digital Journalism from Middlesex University.

Keep up with the ever-evolving AI landscape
Unlock exclusive AI content by subscribing to our newsletter!!

You May Also Like