A brief history of diffusion, the tech at the heart of modern image-generating AI

tohox11 (32)in #diffusion • 2 years ago

Text-to-image AI exploded this year as technical advances greatly enhanced the fidelity of art that AI systems could create. Controversial as systems like Stable Diffusion and OpenAI’s DALL-E 2 are, platforms including DeviantArt and Canva have adopted them to power creative tools, personalize branding and even ideate new products.

But the tech at the heart of these systems is capable of far more than generating art. Called diffusion, it’s being used by some intrepid research groups to produce music, synthesize DNA sequences and even discover new drugs.

So what is diffusion, exactly, and why is it such a massive leap over the previous state of the art? As the year winds down, it’s worth taking a look at diffusion’s origins and how it advanced over time to become the influential force that it is today. Diffusion’s story isn’t over — refinements on the techniques arrive with each passing month — but the last year or two especially brought remarkable progress.

The birth of diffusion
You might recall the trend of deepfaking apps several years ago — apps that inserted people’s portraits into existing images and videos to create realistic-looking substitutions of the original subjects in that target content. Using AI, the apps would “insert” a person’s face — or in some cases, their whole body — into a scene, often convincingly enough to fool someone on first glance.

Most of these apps relied on an AI technology called generative adversarial networks, or GANs for short. GANs consist of two parts: a generator that produces synthetic examples (e.g. images) from random data and a discriminator that attempts to distinguish between the synthetic examples and real examples from a training dataset. (Typical GAN training datasets consist of hundreds to millions of examples of things the GAN is expected to eventually capture.) Both the generator and discriminator improve in their respective abilities until the discriminator is unable to tell the real examples from the synthesized examples with better than the 50% accuracy expected of chance.

Stable Diffusion Harry Potter
Sand sculptures of Harry Potter and Hogwarts, generated by Stable Diffusion. Image Credits: Stability AI

Top-performing GANs can create, for example, snapshots of fictional apartment buildings. StyleGAN, a system Nvidia developed a few years back, can generate high-resolution head shots of fictional people by learning attributes like facial pose, freckles and hair. Beyond image generation, GANs have been applied to the 3D modeling space and vector sketches, showing an aptitude for outputting video clips as well as speech and even looping instrument samples in songs.

In practice, though, GANs suffered from a number of shortcomings owing to their architecture. The simultaneous training of generator and discriminator models was inherently unstable; sometimes the generator “collapsed” and outputted lots of similar-seeming samples. GANs also needed lots of data and compute power to run and train, which made them tough to scale.

Enter diffusion.

How diffusion works
Diffusion was inspired by physics — being the process in physics where something moves from a region of higher concentration to one of lower concentration, like a sugar cube dissolving in coffee. Sugar granules in coffee are initially concentrated at the top of the liquid, but gradually become distributed.

Diffusion systems borrow from diffusion in non-equilibrium thermodynamics specifically, where the process increases the entropy — or randomness — of the system over time. Consider a gas — it’ll eventually spread out to fill an entire space evenly through random motion. Similarly, data like images can be transformed into a uniform distribution by randomly adding noise.

Diffusion systems slowly destroy the structure of data by adding noise until there’s nothing left but noise.

In physics, diffusion is spontaneous and irreversible — sugar diffused in coffee can’t be restored to cube form. But diffusion systems in machine learning aim to learn a sort of “reverse diffusion” process to restore the destroyed data, gaining the ability to recover the data from noise.

Stability AI OpenBioML
Image Credits: OpenBioML

Diffusion systems have been around for nearly a decade. But a relatively recent innovation from OpenAI called CLIP (short for “Contrastive Language-Image Pre-Training”) made them much more practical in everyday applications. CLIP classifies data — for example, images — to “score” each step of the diffusion process

based on how likely it is to be classified under a given text prompt (e.g. “a sketch of a dog in a flowery lawn”).