Hi, looking at the example extension-cpp.
I am wondering why you allocate memory for the output tensors each time you call forward or backward.
Sure, for the forward case you could provide an output tensor.
But for the backward case this is not possible without bad hacks.
Does PyTorch internally take care that old output tensors get reused, or is this not a big issue regarding performance.
My forward benchmark is 1.5% faster without reallocating a new tensor.