Hello Alexia!

The short answer is to use `requires_grad = True`

on

each of your `d`

input variables one at at time, and calculate

one derivative at a time for each of your `d`

output variables.

Then you sum the derivatives together.

But …

I’m not aware of any easy way to do precisely what you want.

The reason is that there is no automatic way to disentangle

calculation that is shared by your `d`

output variables.

In three-dimensional language, if your three output variables

as functions of your three input variables are `F_u (x, y, z)`

,

`F_v (x, y, z)`

, and `F_w (x, y, z)`

, and those three functions

are completely unrelated to one another, then no efficiency is

is lost by calculating them separately, and using three separate

runs of autograd’s `backward()`

to calculate `d F_u / d x`

,

`d F_v / d u`

, and `d F_w_ / d z`

separately.

If `F_u`

, `F_v`

, and `F_w`

have almost all of their calculational

work shared, very little efficiency will be lost by using one run

of autograd’s `backward()`

to calculate all three components

(`x`

, `y`

, and `z`

) of the gradient at the same time and discarding

the off-diagonal elements. That is, you already had to do

almost all of the calculational work needed for the off-diagonal

elements, so the *incremental* cost of calculating them (and

then not using them) is small.

If `F_x`

, `F_y`

, and `F_z`

have significant shared computation,

but also significant independent computation, you will have to

disentangle the shared computation by hand if you want to

calculate your divergence with maximum efficiency.

If you calculate the three needed partial derivatives independently

(using `requires_grad = True`

one at a time on each variable),

you will be needlessly repeating the shared computation. But if

you calculate the full gradient all at once (performing one

`backward()`

run with `requires_grad = True`

set on all

the input variables at the same time), you will be needlessly

performing the off-diagonal pieces of the “independent”

computation.

Note, in a typical multi-layer neural network, most of the

computation that leads to the value of your output values

will be shared, so, as a practical matter, I would use autograd

to calculate the full gradient all at once, sum the diagonal

elements to get the divergence, and discard the off-diagonal

elements. I would be wasting a little bit of effort to calculate the

off-diagonal elements, but, because most of the calculational

cost was shared in the earlier, upstream layers of the neural

network, the wasted work would likely be small.

Good luck.

K. Frank