Fun With Diffusion Models!

By: Daniela Fajardo

Overview

In this project, I implemented and deployed diffusion models for image generation.

Part A: Diffusion Models Exploration

Overview

In part A, I explored diffusion models, implemented sampling loops, and applied them for tasks such as inpainting and creating optical illusions.

Part 0: Image Generation with DeepFloyd IF

My seed is 3037150359.

Using the DeepFloyd IF diffusion model, a two stage Diffusion Model trained by Stability AI, I generated 3 images for each one of the 3 provided text prompts. I varied the number of inference steps from 5, 10, 20 and 40 for each one of the images. Below are the results. The generated images align with the provided prompts. We can see that as num_inference_steps increases, the images become more detailed and “artistic”.

Inference Steps 5
Inference Steps 5
Inference Steps 10
Inference Steps 10
Inference Steps 20
Inference Steps 20
Inference Steps 40
Inference Steps 40

1.1 Forward Process

In this step, I implemented the forward process of diffusion, which adds scaled noise to a clean image at different timesteps. Using a test image, I visualized the results for t=250,500,750t = 250, 500, 750t=250,500,750, showing how the image becomes progressively noisier as t increases. These results demonstrate the gradual transformation of a clean image into noise, setting the stage for the reverse process to reconstruct the original image.

Forward Process Visualization

1.2 Classical Denoising

I used Gaussian Blur to try to remove the noise

Gaussian Blur 250
Noisy Campanile at t=250
Gaussian Blur 500
Noisy Campanile at t=500
Gaussian Blur 750
Noisy Campanile at t=750

1.3 One-Step Denoising

In this step, I utilized a pretrained UNet from the diffusion model to denoise images. Using the noisy images generated in Part 1.2 (t=250,500,750t = 250, 500, 750t=250,500,750), I: 1. Estimated the Gaussian noise present in the noisy images using the UNet. 2. Removed the noise to reconstruct an estimate of the original image. I visualized the results for each timestep, showing the progression from the clean original image to the noisy version and then back to the denoised estimate. This demonstrates the model's ability to recover meaningful details from noisy inputs, providing a closer approximation to the original image.

250 500 750

1.4 Iterative Denoising

In this step, I worked on iterative denoising, a powerful technique used in diffusion models to clean up noisy images step by step. Starting with a heavily distorted image, I gradually refined it over multiple steps, following a schedule of decreasing noise levels. This approach proved much better at reconstructing detailed, high-quality images compared to simpler methods like single-step denoising or blurring.

Iterative Denoising Results

1.5 Diffusion Model Sampling

In this step, I used the iterative denoising function to generate images from scratch by starting with random noise. By setting the starting point to pure noise and applying the denoising process iteratively, I was able to sample images corresponding to the prompt "a high quality photo."

Generated Images

1.6 Classifier-Free Guidance

In this step, I implemented Classifier-Free Guidance (CFG) to significantly enhance the quality of images generated from random noise. CFG works by combining two noise estimates: one conditioned on a text prompt and another unconditional estimate. By adjusting the balance between these estimates, we can control the image quality. The images generated from scratch with the prompt "a high quality photo" and CFG showed much higher quality compared to the previous section.

Classifier-Free Guidance Results

1.7 Image-to-Image Translation

In this step, I explored image-to-image translation, where I applied a noise and denoising process to make gradual edits to an image. Starting with a real test image, I added noise and used the iterative_denoise_cfg function to progressively refine the image, simulating edits based on the prompt "a high quality photo." By adjusting the noise levels, I created a series of images that became closer to the original, with the edits becoming more refined as the noise level decreased.

Test Im1
Campanile
Test Im2
House
Test Im3
Pyramid
Test Im1 Results
Campanile Results
Test Im2 Results
House Results
Test Im3 Results
Pyramid Results

1.7.1 Editing Hand-Drawn and Web Images

In this step, I experimented with editing non-realistic images, such as hand-drawn sketches and web images, using the same diffusion model procedure. By applying noise and denoising iteratively, I was able to project these images onto the natural image manifold, transforming them into more realistic versions. I started by downloading a web image and applying the denoising edits at varying noise levels. I also drew my own hand-drawn images, following the same process to see how the model refined them. The results showed how the diffusion model can creatively enhance and modify both hand-drawn and web-based images, gradually making them look more natural and realistic.

Web
Web Image: Venice Painting
Web Results
Results
Painting 1
Hand-Drawn 1
Painting 1 Results
Results
Painting 2
Hand-Drawn 2
Painting 2 Results
Results

1.7.2 Inpainting

In this step, I implemented inpainting, which allows us to modify specific parts of an image while preserving the rest. Using a binary mask, I applied the diffusion model to create new content wherever the mask is active (set to 1), while leaving the other areas of the image untouched. To do this, I started with the original image and the mask, running the diffusion model denoising loop. At each step, I ensured that the pixels outside the mask were replaced with new content, while maintaining the existing content where the mask was 0. I used this technique to inpaint the top of the Campanile and applied similar edits to my own images, experimenting with different masks to create unique results.

Mask1
Campanile Mask
Results
Results
Mask 2
Pyramid Mask
Results
Results
Mask 3
House Mask
Results
Results

1.7.3 Text Conditioned Image-to-Image Translation

In this step, I extended the inpainting process by incorporating text prompts, enabling more controlled edits.

"a rocket ship"

Rocket

"a man wearing a hat"

Hat

"a pencil"

Pencil

1.8 Visual Anagrams

In this part of the project, I implemented Visual Anagrams, using diffusion models to generate optical illusions. The idea was to create an image that, when viewed in one orientation, portrays one scene, but when flipped upside down, reveals an entirely different scene. For this, I used two text prompts: one for the original image and one for the flipped image. The process involved denoising the image normally for one prompt and denoising the flipped version of the image for the second prompt. By averaging the noise estimates from both steps, I achieved a hybrid image that displays different scenes depending on its orientation.

Im1
'an oil painting of an old man' and 'an oil painting of people around a campfire'
Im2
'a man wearing a hat' and 'an oil painting of a snowy mountain village'
Im3
'a photo of a hipster barista' and 'a photo of a dog'

1.9 Hybrid Images

In this part of the project, I implemented Hybrid Images using diffusion models. The goal was to create images that reveal different scenes depending on the viewer's distance. This technique combines low and high-frequency components from two different prompts to produce a composite image with distinct features at varying scales. The process involved using two separate text prompts and obtaining their noise estimates using a diffusion model. For one prompt, the low-frequency components were preserved, while for the other, the high-frequency components were retained. By combining these components, I generated an image that looks like one thing from a distance and transforms into something entirely different when viewed up close.

Im1
Looks like a skull from far away but a waterfall from close up
Im2
Looks like "a rocket ship" from far away but 'a photo of the amalfi cost' from close up
Im3
Looks like 'an oil painting of a snowy mountain village' from far away but 'an oil painting of people around a campfire' from close up

Part B: Training Diffusion Models from Scratch

Overview

In part B, I trained my own diffusion model on the MNIST.

Part 1: Training a Single-Step Denoising UNet

1.1 Implementing the UNet

In this part of the project, I implemented the UNet architecture for image denoising in diffusion models.

UNet Architecture Visualization

1.2 Using the UNet to Train a Denoiser

We aim to solve the following denoising problem: Given a noisy image z, we aim to train a denoiser Dθ such that it maps z to a clean image x. z=x+oe, where e~N(0,I) Here we can visualize the different denoising processes over o = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]

Different noises

1.2.1 Training

I trained a UNet model to denoise images by mapping noisy inputs to clean versions using the MNIST dataset. The model was trained for 5 epochs. I used the Adam optimizer, a hidden dimension of 128 and batch size of 256.

Training Loss Curve

Training Loss Curve
Epoch 1
Results on digits from the test set after 1 epoch of training
Epoch 5
Results on digits from the test set after 5 epochs of training

1.2.2 Out-of-Distribution Testing

See how the denoiser performs on different σ's that it wasn't trained for.

Out of Bounds

Part 2: Training a Diffusion Model

In this section, we changed our UNet to predict the added noise ϵ instead of the clean image x. First I worked on training a diffusion model using a time-conditioned UNet to gradually denoise images and generate realistic samples. Later I introduced class-conditioning to give more control over the generated images.

2.1 Adding Time Conditioning to UNet

I modified the UNet to include time conditioning, embedding the timestep into the network with FCBlocks.

Explanation

2.2 Training the UNet

The time-conditioned UNet was trained on MNIST, learning to predict noise from noisy images. I used the Adam optimizer and exponential learning rate decay, batch size of 128, and a hidden dimension of 64. I trained the model over 20 epochs.

Time-Conditioned UNet training loss curve

Training Curve

2.3 Sampling from the UNet

Starting with pure noise, the trained UNet iteratively denoised images to produce realistic MNIST digits.

Epoch 5
Epoch 5
Epoch 20
Epoch 20

2.4 Adding Class-Conditioning to UNet

To control which digit was generated, I extended the UNet with class conditioning. This involved adding one-hot encoded class vectors (0-9) into the network, with dropout for occasional unconditional generation.

Class-conditioned UNet training loss curve

Training Curve

2.5 Sampling from the Class-Conditioned UNet

The class-conditioned UNet generated specific digits with improved accuracy using classifier-free guidance.

Epoch 5
Epoch 5
Epoch 20
Epoch 20