Loss isn't converging when my loss function contains more than one neural net weights

I am quite new to Pytorch so apologies beforehand if the readability of my post is not so good, I am designing a network using pytorch which solves coupled differential equations. For that, I am designing a neural net for each variable i.e.
NN_x(t), NN_y(t), Now, for this system I am using my custom loss function which is some function of NN_x and NN_y and the differentials of the neural nets.
i.e. loss(NN_x, NN_y, dNN_x, d_NN_y)
I am initializing weights of both the neural nets with requires_grad = true So my loss function is essentially a function of all the weights of both the networks

I am using loss.backward() to get the gradients and then update the weights using gradient desend with momentum but when I see the loss, it never converges, I have tried different learning rates but i got no luck

I wanted to know is it the right approach i.e. will my loss.backward() function take the gradient with respect to all the weights which my loss is a function of? i.e. weights of both the networks. If yes, why is my loss not converging and what else can I try so that my loss converges.

I am putting Some part of code where i am initializing the weights:

Weights initialization for 1 hidden layered network for Neural net NN_x and NN_y

W_x = [Variable(torch.from_numpy(np.random.normal(loc=0.0, scale = np.sqrt(2/(1+20)), size = (1,20))), requires_grad=True), Variable(torch.from_numpy(np.random.normal(loc=0.0, scale = np.sqrt(2/(20+1)), size = (20,1))), requires_grad=True)]

W_y = [Variable(torch.from_numpy(np.random.normal(loc=0.0, scale = np.sqrt(2/(1+20)), size = (1,20))), requires_grad=True), Variable(torch.from_numpy(np.random.normal(loc=0.0, scale = np.sqrt(2/(20+1)), size = (20,1))), requires_grad=True)]

P.s. I am putting grad.data =0 after each step.