Hey;

I have a large data set X with size N and I want to feed into a network NN(X); I try to divide into batches and run on Cuda parallels.

Right now; my code is

```
loss = 0
for j in range(0,X.shape[0],batch_size):
log_qx = Q.log_prob(X[j:j+ batch_size])
loss += loss_func(trueVal= trueVal[j:j + batch_size],
samples=X[j:j + batch_size],
log_qx=log_qx,
log_px=log_px)
loss.backward(retain_graph=True)
optimizer.step()
```

This is very slow and I’m wondering any tricks can be down to improve the efficiency Thanks