Text-to-Image Resources

This page is about image generation that is guided by text prompts. These tools are based on recent new development in the area of deep neural networks. There are several systems that have garnered considerable attention, including Dall-E 2, Imagen, and Stable Diffusion.

A photo of a robot hand drawing, digital art

Main Announcements

DALL-E 2 post from OpenAI.

Imagen blog post from Google.

Parti post from Google.

Stable Diffusion V1 and V2 posts from Stability AI.

Access to Text-to-Image Systems

There are now a number of different sites where you can get either free or for-pay access to text-to-image generation systems:

Dall-E 2 (pay)

Stable Diffusion (pay)

Midjourney (pay)

Craiyon (free)

Because the Stable Diffusion model has been made public (both code and network weights), there are many free systems that you can install and use if your computer has a good GPU. Here is a link to many Stable Diffusion systems. As of late November 2022, the most popular system seems to be from Automatic1111. The same list of systems also includes a number of colab notebooks.

Pulp Fiction played by Muppets

Image Results

Lexica has tons of generated images (along with prompts) from the Stable Diffusion discord.

Many image links scraped from the Wayback Machine: Long Image List

The Dall-E Reddit, with lots of posted images.

Instagram tags.

Twitter tags.

Fairness and Reducing Training Bias

Here are several Dall-E 2 posts about reducing training bias and increasing fariness in image generation: 1 2 3.

An astronaut riding a horse in a photorealistic style

Analysis

Excellent analysis of Dall-E's strengths and weaknesses by Swimmer963. Contains a wealth of examples.

Technical discussion of what DALL-E 2 does differently than GLIDE. It is all about the embedding prior (see part iii).

Mount Everest made of cake, digital art

Live Demos

Video demo by Karen X Cheng.

Demo by Bakz T. Future.

Neural Network Basics

Text to Image generation systems are created using neural network techniques. For those of you who are unfamiliar with these methods, I recommend watching a couple of introductory videos from the awesome YouTube channel 3 Blue, 1 Brown. These videos assume a basic background in algebra:

What is a neural network?

How do neural networks learn?

Once you have a rough understanding of what a neural network is, I highly recommend playing with this demo to get a more intuitive feeling:

A Neural Network Playground.

If you want to learn more about neural networks beyond these basics, you will need to know probability theory and multi-variable calculus. Here are two free on-line resources for learning more about neural nets:

Neural Networks and Deep Learning by Michael Nielsen.

Dive into Deep Learning

Technical Papers Related to Dall-E 2

If you are comfortable with neural network techinques and want to learn about the inner workings of DALL-E 2, below are links to some of the relevant research papers.

DALL-E 2 - full paper on DALL-E 2 / unCLIP that uses an embedding prior.

GLIDE - immediate precursor to DALL-E 2.

Diffusion beats GAN - image generation by diffusion.

DALL-E - First DALL-E paper.

CLIP - embedding images and text into the same space. This is the approach that has enabled much of the recent text-to-image work.

Technical Papers from Google

Imagen - main Imagen paper.

Parti - sibling system to Imagen.

Technical Papers Related to Stable Diffusion

Latent Diffusion - network used to create Stable Diffusion.

Retreival Augmentation - improved synthesis using nearest neighbor image queries.

Training Data (Image & Text Pairs)

The training data for Stable Diffusion is public, and is known as LAION-5B.

Is the Dall-E 2 training data public? No. Where did the Dall-E 2 training data come from? Below is some information.

The DALL-E 2 / unCLIP paper (appendix C, listed above) says their data comes from the CLIP and DALL-E datasets:

When training the encoder, we sample from the CLIP [39] and DALL-E [40] datasets (approximately 650M images in total) with equal probability. When training the decoder, upsamplers, and prior, we use only the DALL-E dataset [40] (approximately 250M images). Incorporating the noisier CLIP dataset while training the generative stack negatively impacted sample quality in our initial evaluations.

The CLIP paper (listed above) describes their data collection process as follows:

Since existing datasets do not adequately reflect this possibility, considering results only on them would underestimate the potential of this line of research. To address this, we constructed a new dataset of 400 million (image, text) pairs collected form a variety of publicly available sources on the Internet. To attempt to cover as broad a set of visual concepts as possible, we search for (image, text) pairs as part of the construction process whose text includes one of a set of 500,000 queries (1). We approximately class balance the results by including up to 20,000 (image, text) pairs per query. The resulting dataset has a similar total word count as the WebText dataset used to train GPT-2. We refer to this dataset as WIT for WebImageText.

(1) The base query list is all words occurring at least 100 times in the English version of Wikipedia. This is augmented with bi-grams with high pointwise mutual information as well as the names of all Wikipedia articles above a certain search volume. Finally all WordNet synsets not already in the query list are added.

Some more information about the training data.

Alternative Text-Based Image Generation Systems

Although DALL-E 2 itself is not available to most of us, there are other text-driven image generation systems that are publically available. I am not going to provide direct links to these systems because they change so rapidly. I will, however, list a few names that you can look for. Most of these systems require that you know how to set up your own Colab notebook.


Go to Greg Turk's Home Page.