Google’s Imagen AI produces photorealistic images from natural text with frightening fidelity

‘A blue jay standing on a large basket of rainbow macarons.’ Credit: Google

About a month after OpenAI announced DALL-E 2, its latest AI system to create images from text, Google has continued the AI “space race” with its own text-to-image diffusion model, Imagen. Google’s results are extremely, perhaps even scarily, impressive.

Using a standard measure, FID, Google Imagen outpaces Open AI’s DALL-E 2 with a score of 7.27 using the COCO dataset. Despite not being trained using COCO, Imagen still performed well here too. Imagen also bests DALL-E 2 and other competing text-to-image methods among human raters. You can read about the full testing results in Google’s research paper.

‘The Toronto skyline with Google brain logo written in fireworks.’

Imagen works by taking a natural language text input, like, ‘A Golden Retriever dog wearing a blue checkered beret and red dotted turtleneck,’ and then using a frozen T5-XXL encoder to turn that input text into embeddings. A ‘conditional diffusion model’ then maps the text embedding into a small 64×64 image. Imagen uses text-conditional super-resolution diffusion models to upsample the 64×64 image into a 256×256 and 1024×1024.

Compared to NVIDIA’s GauGAN2 method from last fall, Imagen is significantly improved in terms of flexibility and results. AI is progressing rapidly. Consider the image below generated from ‘a cute corgi lives in a house made out of sushi.’ It looks believable, like someone really built a dog house from sushi that the corgi, perhaps unsurprisingly, loves.

‘A cute corgi lives in a house made out of sushi.’

It’s a cute creation. Seemingly all of what we’ve seen so far from Imagen is cute. Funny outfits on furry animals, cactuses with sunglasses, swimming teddy bears, royal raccoons, etc. Where are the people?

Whether innocent or ill-intentioned, we know that some users would immediately start typing in all sorts of phrases about people as soon as they had access to Imagen. I’m sure there’d be a lot of text inputs about adorable animals in humorous situations, but there’d also be input text about chefs, athletes, doctors, men, women, children, and much more. What would these people look like? Would doctors mostly be men, would flight attendants mostly be women, and would most people have light skin?

‘A robot couple fine dining with Eiffel Tower in the background.’ What would this couple look like if the text didn’t include the word ‘robot’?

We don’t know how Imagen handles these text strings because Google has elected not to show any people. There are ethical challenges with text-to-image research. If a model can conceivably create just about any image from text, how good is a model at presenting unbiased results? AI models like Imagen are largely trained using datasets scraped from the web. Content on the internet is skewed and biased in ways that we are still trying to understand fully. These biases have negative societal impacts worth considering and, ideally, rectifying. Not just that, but Google used the LAION-400M dataset for Imagen, which is known to ‘contain a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes.’ A subset of the training group was filtered to remove noise and ‘undesirable’ content, but there remains a ‘risk that Imagen has encoded harmful stereotypes and representations, which guides our decision to not release Imagen for public use without further safeguards in place.’

The text strings can become quite complicated. ‘A marble statue of a koala DJ in front of a marble statue of a turntable. The koala is wearing large marble headphones.’

So no, you can’t access Imagen for yourself. On its website, Google lets you click on specific words from a selected group to see results, like ‘a photo of a fuzzy panda wearing a cowboy hat and a black leather jacket playing a guitar on top of a mountain,’ but you can’t search for anything to do with people or potentially problematic actions or items. If you could, you’d find that the model tends to generate images of people with lighter skin tones and reinforce traditional gender roles. Early research also indicates that Imagen reflects cultural biases through its depiction of certain items and events.

‘A Pomeranian is sitting on the Kings throne wearing a crown. Two tiger soldiers are standing next to the throne.’

We know Google is aware of representation issues across its wide range of products and is working on improving realistic skin tone representation and reducing inherent biases. However, AI is still a ‘Wild West’ of sorts. While there are many talented, thoughtful people behind the scenes generating AI models, a model is basically on its own once unleashed. Depending upon the dataset used to train the model, it’s difficult to predict what will happen when users can type in anything they want.

‘A dragon fruit wearing karate belt in the snow.’

It’s not Imagen’s fault, or the fault of any other AI models that have struggled with the same problem. Models are being trained using massive datasets that contain visible and hidden biases, and these problems scale with the model. Even beyond marginalizing specific groups of people, AI models can generate very harmful content. If you asked an illustrator to draw or paint something horrific, many would turn you away in disgust. Text-to-image AI models don’t have moral qualms and will produce anything. It’s a problem, and it’s unclear how it can be addressed.

‘Teddy bears swimming at the Olympics 400mm Butterfly event.’

In the meantime, as AI research teams grapple with the societal and moral implications of their extremely impressive work, you can look at eerily realistic photos of skateboarding pandas, but you can’t input your own text. Imagen is not available to the public, and neither is its code. However, you can learn a lot about the project in a new research paper.

All images courtesy of GooglE

Author:
This article comes from DP Review and can be read on the original site.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_LGX92D8MKV	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_213478817_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Google’s Imagen AI produces photorealistic images from natural text with frightening fidelity

BROKENMOUNT

ABOUT

PARTNERS

Google’s Imagen AI produces photorealistic images from natural text with frightening fidelity

Related Posts

Thypoch announces Simera 35mm & 28mm F1.4 in 4 lens mounts

Google Pixel 8a sample gallery

Sirui releases Night Walker 16mm T1.2 S35 cine lens across 5 mount options

BROKENMOUNT

ABOUT

PARTNERS