I am having an issue to compute gradients after using the Stable diffusion model from hugging face at the input text embeddings. I use StableDiffusionPipeline from hugging face and use the pretrained model : generator = StableDiffusionPipeline.from_pretrained(“runwayml/stable-diffusion-v1-5”). I have text and use the text embeddings (from CLIP tokenizer and CLIPTextModel both from huggingface) as input to the generator. I have a loss, for example the loss is some function of the generated image (lets say i set the loss to sum of all values in the top half of the image and minimize it as close to zero). I want to update the text embedding by.backpropagation of the gradients.However I am unable to obtain gradients for the text embeddng even after requires_grad=True. I get an error which says “The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won’t be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead.” I tried a number of different things but somehow doesnt seem to work. Someone know what the problem could be or a solution for this. thanks.
The error message points towards the error trying to access the .grad
attribute of a non-leaf tensor as seen in this example:
lin1 = nn.Linear(1, 1)
lin2 = nn.Linear(1, 1)
x = torch.randn(1, 1)
intermediate = lin1(x)
out = lin2(intermediate)
out.mean().backward()
# try to access .grad of a non-leaf
print(intermediate.is_leaf)
# False
print(intermediate.grad)
# None
# UserWarning: The .grad attribute of a Tensor that is not a leaf Tensor is being accessed. Its .grad attribute won't be populated during autograd.backward(). If you indeed want the .grad field to be populated for a non-leaf Tensor, use .retain_grad() on the non-leaf Tensor. If you access the non-leaf Tensor by mistake, make sure you access the leaf Tensor instead. See github.com/pytorch/pytorch/pull/30531 for more informations. (Triggered internally at aten/src/ATen/core/TensorBody.h:489.)
# fix it as suggested in the warning
intermediate = lin1(x)
intermediate.retain_grad()
out = lin2(intermediate)
out.mean().backward()
print(intermediate.grad)
# tensor([[0.9695]])
I had tried retain.grad() and it didnt work. Below, I provide a small working example of something similar to what i want to do (most of the code taken from the huggingface example). I am missing something, but still not able to figure it out :
import torch
from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler
from diffusers import LMSDiscreteScheduler
from tqdm.auto import tqdm
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae")
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet")
scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)
torch_device = "cuda"
vae.to(torch_device)
text_encoder.to(torch_device)
unet.to(torch_device)
prompt = ["a photograph of an astronaut"]
height = 512 # default height of Stable Diffusion
width = 512 # default width of Stable Diffusion
num_inference_steps = 100 # Number of denoising steps
guidance_scale = 7.5 # Scale for classifier-free guidance
generator = torch.manual_seed(0) # Seed generator to create the inital latent noise
learning_rate = 0.1
batch_size = len(prompt)
text_input = tokenizer(prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt")
text_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0]
text_embeddings.retain_grad()
latents = torch.randn(
(batch_size, unet.in_channels, height // 8, width // 8),
generator=generator,
)
latents = latents.to(torch_device)
scheduler.set_timesteps(num_inference_steps)
latents = latents * scheduler.init_noise_sigma
scheduler.set_timesteps(num_inference_steps)
unet.train()
for t in tqdm(scheduler.timesteps):
latent_model_input = scheduler.scale_model_input(latents, timestep=t)
# predict the noise residual
with torch.no_grad():
noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
# compute the previous noisy sample x_t -> x_t-1
latents = scheduler.step(noise_pred, t, latents).prev_sample
image = vae.decode(latents).sample
loss = sum(sum(sum(sum(image))))
loss.backward()
print(text_embeddings.grad)
# None
print(text_embeddings.requires_grad)
# True
text_embeddings
is used in a no_grad
context so why would you expect to see gradients in this tensor?
Yes , thats right. i was playing around with the code a lot (initially using the stablediffusionpipeline vs using the unet) so somehow that line ended up not being commented. I overlooked that one, thanks !