Hi! I am new to manipulating over `torch.autograd`

and the gradient flow. I encounter with a problem of **Acceleration Gradient Calculation**

I have a YOLO net $\mathcal{N}$. It receives a video frame at every timestep and provide some outputs ${\mathbf{d}}_{i} = \mathcal{N}(\mathbf{I})$, where $\mathbf{I}$ is the image tensor of the frame and $\mathbf{d}$ is a detection output, i.e. `[x,y,w,h,...]`

Assume that there are **20** detection boxes after NMS, i.e. len(${\mathbf{d}}$) = 20, now I need to calculate $\frac{\partial x_{i}}{\partial \mathbf{I}}$ for `i = 1:20`

.

Since $\mathbf{d}*{i} = \mathcal{N}(\mathbf{I}),x*{i} = \mathbf{d}*{i}[0]$, $x*{i}$ is included in the computation graph containing $\mathbf{I}$, $\frac{\partial x_{i}}{\partial \mathbf{I}}$ can be obtained by `x_i.backward(retain_graph = True)`

or `torch.autograd.grad(x_i, img_tensor, retain_graph = True)`

(if I’m not mistaken.)

Now I calculate 20 gradients in a **serial manner**.

```
res = []
for x_i in x_list:
_grad = torch.autograd.grad(x_i, img_tensor, retain_graph=True) # bp pass
res.append(_grad[0])
```

If forward pass of $\mathcal{N}$ takes about `T`

seconds(~0.1s), the bp pass takes also approximately `T`

s(~0.15s).

However the serial computations will take `20 x T`

seconds! That’s terrible.

I wonder if there is a **parallel way**, utilizing the power of **GPU** I mean, to shorten `20 x T`

, like

```
res = torch.autograd.grad(x_list, [img_tensor] * len(reg_sum_list))
# _ress[0] is _ress[1]
# True
```

It does not work🤣