Backprop an empty tensor does not save time

import time
from transformers import AutoModelForCausalLM

torch.manual_seed(10)

num_tokens = 500
input = torch.randint(0, 30000, torch.Size([8, num_tokens])).cuda()

model = AutoModelForCausalLM.from_pretrained("gpt2").cuda()

o = model(input_ids=input).logits

##################################
# o[:, :500].detach()
# o = o[:, 500:]
##################################

output = o.sum()

torch.cuda.synchronize()
s = time.time()
output.backward()
torch.cuda.synchronize()
e = time.time()
print(e-s)

I am surprised that uncommenting the two lines between ### does not result in speeding up the backward pass, even though o ended up being an empty tensor. I wonder why that is the case. Thanks!

Calling backward on an empty tensor should raise an error:

lin = nn.Linear(10, 10)
x = torch.randn(1, 10)

out = lin(x)
out = out[:, 10:]
print(out)
# tensor([], size=(1, 0), grad_fn=<SliceBackward0>)
out.backward()
# RuntimeError: grad can be implicitly created only for scalar outputs

so your output seems to be valid.

My output is the sum of an empty tensor. bp with it takes the same time as back-propagating through the sum of a non-empty tensor and I wonder why that is the case.

Ah, that’s interesting as it is creating a valid tensor:

print(out.sum())
tensor(0., grad_fn=<SumBackward0>)

which will then allow you to backpropagate and I would guess calculate all gradients to zero.

Yeah, but I wonder why this happens and if there is a way to not calculate the gradients to speed up backprop.

Let me add @albanD as he might know if and how this optimization could be done (I’m unsure if it’s possible or would break other valid use cases).

For now you might want to skip the backward call if an empty tensor is detected.

1 Like