To me, what you are doing is a gradient flow. I would suggest you to have a look at the very good post on Wasserstein gradient flow here : https://www.kernel-operations.io/geomloss/_auto_examples/comparisons/plot_gradient_flows_2D.html#sphx-glr-auto-examples-comparisons-plot-gradient-flows-2d-py
In the gradient flow function, they basically do:
for i in range(Nsteps):
# Compute loss and gradient
L_αβ = loss(x, y)
[g] = torch.autograd.grad(L_αβ, [x])
x.data -= learning_rate * len(x) * g
This way, you do not need to define an optimizer as you do it yourself ! Hope this helps