Backprop an empty tensor does not save time

11178 · October 6, 2022, 9:39pm

import time
from transformers import AutoModelForCausalLM

torch.manual_seed(10)

num_tokens = 500
input = torch.randint(0, 30000, torch.Size([8, num_tokens])).cuda()

model = AutoModelForCausalLM.from_pretrained("gpt2").cuda()

o = model(input_ids=input).logits

##################################
# o[:, :500].detach()
# o = o[:, 500:]
##################################

output = o.sum()

torch.cuda.synchronize()
s = time.time()
output.backward()
torch.cuda.synchronize()
e = time.time()
print(e-s)

I am surprised that uncommenting the two lines between ### does not result in speeding up the backward pass, even though o ended up being an empty tensor. I wonder why that is the case. Thanks!

ptrblck · October 6, 2022, 9:56pm

Calling backward on an empty tensor should raise an error:

lin = nn.Linear(10, 10)
x = torch.randn(1, 10)

out = lin(x)
out = out[:, 10:]
print(out)
# tensor([], size=(1, 0), grad_fn=<SliceBackward0>)
out.backward()
# RuntimeError: grad can be implicitly created only for scalar outputs

so your output seems to be valid.

11178 · October 6, 2022, 10:26pm

My output is the sum of an empty tensor. bp with it takes the same time as back-propagating through the sum of a non-empty tensor and I wonder why that is the case.

ptrblck · October 6, 2022, 10:39pm

Ah, that’s interesting as it is creating a valid tensor:

print(out.sum())
tensor(0., grad_fn=<SumBackward0>)

which will then allow you to backpropagate and I would guess calculate all gradients to zero.

11178 · October 6, 2022, 10:46pm

Yeah, but I wonder why this happens and if there is a way to not calculate the gradients to speed up backprop.

ptrblck · October 6, 2022, 10:59pm

Let me add @albanD as he might know if and how this optimization could be done (I’m unsure if it’s possible or would break other valid use cases).

For now you might want to skip the backward call if an empty tensor is detected.