In this section, I explored text-conditional image-to-image translation, combining the structure of an input image with the creativity of a text prompt.
By varying noise levels (i_start = [1, 3, 5, 7, 10, 20]
), the model interpolates between the input image and the text prompt, blending features
from both. Higher noise levels allow the model to move closer to the text prompt's "location" in the latent space, while lower noise levels retain more of the
original image.
I experimented with three combinations: the Campanile paired with the prompt "a rocket ship," a beach scene paired with "a pencil," and the Manhattan skyline
paired with "a photo of a dog." The results show creative transformations, with the Campanile gradually morphing into a futuristic rocket ship, the beach taking on artistic
pencil-like textures, and the skyline blending cat-like features into its structure. These experiments highlight the versatility and control offered by
text-conditional diffusion models.