Hello everyone, I am trying to implement a minimization problem using SGD.

In particular, I have an objective function (or loss) that looks like this:

where q_theta is parametrized by a fully connected NN and has the form:

```
def objective(p, output):
x,y = p
a = minA
b = minB
r = 0.1
XA = 1/2 -1/2 * torch.tanh(100*((x - a[0])**2 + (y - a[1])**2 - (r + 0.02)**2))
XB = 1/2 -1/2 * torch.tanh(100*((x - b[0])**2 + (y - b[1])**2 - (r + 0.02)**2))
q = (1-XA)*((1-XB)* output - (XB))
return q
```

“output” is the output of the NN, namely the only part of this function that is parametrized.

Now, my training function looks like this:

```
optimizer = optim.SGD(model.parameters(), lr=learning_rate)
for e in range(epochs) :
for configuration in total:
# for each point in the array of independently sampled points
optimizer.zero_grad()
#output is q~
output = model(configuration)
#loss is the objective function we defined
#in the paper, objective function is 18
loss = objective(configuration, output).backward()
optimizer.step()
```

Where my model is a simple two-layer fully connected NN, with an input layer equal to 2 (x,y) and one output node corresponding to the parametrized part of the function.

Note that each “configuration” is a point in a 2D space, which is sampled independently from a distribution to perform the sample average, which approximates the expectation in (18).

However, the resulting minimized function does not make any sense. In particular, I am not sure I am handling the objective function correctly. Is .backward() substituting the gradient in (18) or should I compute the gradient with autograd?