Torch.autograd.grad calculate grads of multiple scalars w.r.t the same input tensor in a batch manner

Hi! I am new to manipulating over torch.autograd and the gradient flow. I encounter with a problem of Acceleration Gradient Calculation

I have a YOLO net $\mathcal{N}$. It receives a video frame at every timestep and provide some outputs ${\mathbf{d}}_{i} = \mathcal{N}(\mathbf{I})$, where $\mathbf{I}$ is the image tensor of the frame and $\mathbf{d}$ is a detection output, i.e. [x,y,w,h,...]

Assume that there are 20 detection boxes after NMS, i.e. len(${\mathbf{d}}$) = 20, now I need to calculate $\frac{\partial x_{i}}{\partial \mathbf{I}}$ for i = 1:20.

Since $\mathbf{d}{i} = \mathcal{N}(\mathbf{I}),x{i} = \mathbf{d}{i}[0]$, $x{i}$ is included in the computation graph containing $\mathbf{I}$, $\frac{\partial x_{i}}{\partial \mathbf{I}}$ can be obtained by x_i.backward(retain_graph = True) or torch.autograd.grad(x_i, img_tensor, retain_graph = True)(if I’m not mistaken.)

Now I calculate 20 gradients in a serial manner.

res = []
for x_i in x_list:
    _grad = torch.autograd.grad(x_i, img_tensor, retain_graph=True) # bp pass

If forward pass of $\mathcal{N}$ takes about T seconds(~0.1s), the bp pass takes also approximately Ts(~0.15s).

However the serial computations will take 20 x T seconds! That’s terrible.

I wonder if there is a parallel way, utilizing the power of GPU I mean, to shorten 20 x T, like

res = torch.autograd.grad(x_list, [img_tensor] * len(reg_sum_list))

# _ress[0] is _ress[1]
# True

It does not work🤣