Optimizer step requires GPU memory

optimizer.step() clears the intermediate activations (if not kept by retain_graph=True), not the gradients.
You can still access the gradients using model.layer.weight.grad.

Since Python has function scoping (not block scoping), you could probably save some memory by creating separate functions for your training and validation as explained in this post (in case you haven’t done it already).