Unsupervised minimization with SGD

Hello everyone, I am trying to implement a minimization problem using SGD.
In particular, I have an objective function (or loss) that looks like this:

where q_theta is parametrized by a fully connected NN and has the form:

def objective(p, output):
  x,y = p
  a = minA
  b = minB
  r = 0.1

  XA = 1/2 -1/2 * torch.tanh(100*((x - a[0])**2 + (y - a[1])**2 - (r + 0.02)**2))
  XB = 1/2 -1/2 * torch.tanh(100*((x - b[0])**2 + (y - b[1])**2 - (r + 0.02)**2))
  q = (1-XA)*((1-XB)* output - (XB))
  return q

“output” is the output of the NN, namely the only part of this function that is parametrized.

Now, my training function looks like this:

optimizer = optim.SGD(model.parameters(), lr=learning_rate)

for e in range(epochs) :
  for configuration in total:
    # for each point in the array of independently sampled points 

    #output is q~
    output = model(configuration)

    #loss is the objective function we defined
    #in the paper, objective function is 18

    loss = objective(configuration, output).backward()

Where my model is a simple two-layer fully connected NN, with an input layer equal to 2 (x,y) and one output node corresponding to the parametrized part of the function.
Note that each “configuration” is a point in a 2D space, which is sampled independently from a distribution to perform the sample average, which approximates the expectation in (18).

However, the resulting minimized function does not make any sense. In particular, I am not sure I am handling the objective function correctly. Is .backward() substituting the gradient in (18) or should I compute the gradient with autograd?