Imagen blog post from Google.
Parti post from Google.
Stable Diffusion V1 and V2 posts from Stability AI.
Dall-E 2 (pay)
Stable Diffusion (pay)
Midjourney (pay)
Craiyon (free)
Because the Stable Diffusion model has been made public (both code and network weights), there are many free systems that you can install and use if your computer has a good GPU. Here is a link to many Stable Diffusion systems. As of late November 2022, the most popular system seems to be from Automatic1111. The same list of systems also includes a number of colab notebooks.
Many image links scraped from the Wayback Machine: Long Image List
The Dall-E Reddit, with lots of posted images.
Instagram tags.Twitter tags.
Technical discussion of what DALL-E 2 does differently than GLIDE. It is all about the embedding prior (see part iii).
Demo by Bakz T. Future.
What is a neural network?
How do neural networks learn?
Once you have a rough understanding of what a neural network is, I highly recommend playing with this demo to get a more intuitive feeling:
A Neural Network Playground.
If you want to learn more about neural networks beyond these basics, you will need to know probability theory and multi-variable calculus. Here are two free on-line resources for learning more about neural nets:
Neural Networks and Deep Learning by Michael Nielsen.
DALL-E 2 - full paper on DALL-E 2 / unCLIP that uses an embedding prior.
GLIDE - immediate precursor to DALL-E 2.
Diffusion beats GAN - image generation by diffusion.
DALL-E - First DALL-E paper.
CLIP - embedding images and text into the same space. This is the approach that has enabled much of the recent text-to-image work.
Parti - sibling system to Imagen.
Retreival Augmentation - improved synthesis using nearest neighbor image queries.
Is the Dall-E 2 training data public? No. Where did the Dall-E 2 training data come from? Below is some information.
The DALL-E 2 / unCLIP paper (appendix C, listed above) says their data comes from the CLIP and DALL-E datasets:
When training the encoder, we sample from the CLIP [39] and DALL-E [40] datasets (approximately 650M images in total) with equal probability. When training the decoder, upsamplers, and prior, we use only the DALL-E dataset [40] (approximately 250M images). Incorporating the noisier CLIP dataset while training the generative stack negatively impacted sample quality in our initial evaluations.
The CLIP paper (listed above) describes their data collection process as follows:
Since existing datasets do not adequately reflect
this possibility, considering results only on them would
underestimate the potential of this line of research. To address
this, we constructed a new dataset of 400 million (image,
text) pairs collected form a variety of publicly available
sources on the Internet. To attempt to cover as broad a set
of visual concepts as possible, we search for (image, text)
pairs as part of the construction process whose text includes
one of a set of 500,000 queries (1). We approximately class
balance the results by including up to 20,000 (image, text)
pairs per query. The resulting dataset has a similar total word
count as the WebText dataset used to train GPT-2. We refer to this
dataset as WIT for WebImageText.
(1) The base query list is all words occurring at least 100 times in the English version of Wikipedia. This is augmented with bi-grams with high pointwise mutual information as well as the names of all Wikipedia articles above a certain search volume. Finally all WordNet synsets not already in the query list are added.
Go to Greg Turk's Home Page.