When I run the above code, I am getting the below error:
AssertionError: nn criterions don’t compute the gradient w.r.t. targets - please mark these variables as volatile or not requiring gradients.

I guess KL expect second term to be not requiring gradients. But, in JSD M term also contains gradient requiring variable. Is there an easy way of dealing with this? Or, should I write my KLD function?

Hey, Here’s a couple of things I thought worthy of mentioning here. First, both codes are only using:

total_m = 0.5 * (net_1_probs + net_1_probs)

The correct formulation is:

total_m = 0.5 * (net_1_probs + net_2_probs)

.

Also, based on @jeff-hykin and @Aryan_Asadian implementations, here’s mine. It is easier for me to use modules that are instances of nn.Module similar to @Aryan_Asadian’s implementation, because I can have forward/backward hooks.

Note that I am taking softmax before passing p and q to my JSD instance. Also, note that this implementation works with matrices as well, since in the beginning I’m flattening both tensors.
Also, note that I’m passing log_target=True, which means the m should be in log-space. This makes the implementation slightly faster, because we’re computing the m.log() only once. Hope that this helps.

@Amin_Jun you did a wonderful job based on the previous answer. However, your implementation is still slightly problematic, which doesn’t guarantee the range of JS-divergence between 0 to 1. The KL-divergence function in pytorch is counterintuitive. KL(a,b) needs to be written in torch.nn.KLDivLoss()(b,a). So the correct one should be: