Tensor in forward depending on batch size

I have a zero tensor in the forward pass of my module, whose size is depending on the batch size. Currently i am creating on each pass the tensor and sending it to the GPU if the input is on the GPU.
Can i somehow put this zero tensor once on the GPU and clear it after each step?
Easiest way would be to declare it as a parameter of my module and switch off requires_gradient.
I checked out : Dynamic parameter declaration in forward function which seems to be what i want, but i am unable to send the tensor to the GPU. The tensor does not require a gradient, it is only there to collect the output.

Hi,

If you save the Tensor like self.foo = torch.zeros(100). It will be registered as a buffer and will be moved the same way parameters of the module are. You can then resize_(new_size).zero_() it at the beginning of each forward pass. Note that the resize is a no-op if the Tensor is already big enough BUT if you reduce the size, the unused memory won’t be free and will remain there until you delete the tensor (or resize it to a bigger size again).

Hi,

thank you, that solved my problem. Easier than i thought (:

Actually it does not solve my problem. I get the error

autograd’s resize can only change the shape of a given tensor, while preserving the number of elements.

which makes sense, since my tensor collects tensors that require a gradient. Do you have any other ideas?

If you need a new Tensor every time, then you will need to create it every time, there is not way around it. You can use torch.zeros(your_size, device="gpu") to create it directly on the gpu and avoid the copy.

If you use it once then you can use resize_().zero_() (after detaching the current Tensor potentially). resize_() is more of an advanced function so maybe the first option above will be good enough for you.

Okay, thank you. Do you know if this can eventually become a bottleneck? I do not know much about GPU <-> CPU data transfer latency.

Ah sorry, did not read your answer correctly. It is already generated on the gpu :slight_smile:

The first option above that uses device= when creating the Tensor actually does not do any CPU -> GPU copy !
The only thing the “advanced” method with resize gives you is to avoid one allocation. But that is very unlikely to be the bottleneck.

1 Like