CUDA OOM because of tensor gradients?

Based on the stacktrace you are running out of memory in the forward pass, so I don’t think gradients are related to it. You could add debug print statements into the forward pass of your model to check which layer significantly increases the memory usage.
E.g. a forward activation, needed for the gradient computation, could be huge and use a lot of memory. Once you have an idea which layers (or rather activations) increase the memory usage significantly, you could consider offloading them to the CPU as described here.