Hello, I am attempting to recreate the CURE regularizer (https://arxiv.org/abs/1811.09716) and have a question about how to compute the gradient of the regularization term. Specifically, the regularizer is

and is used in conjunction with the standard cross entropy loss of a DNN classifier. I am following the implementation (from snippet) and may be misunderstanding the key line for the gradient computation. Here is how the repo computes the regularizer:

the correct statement for the gradient computation involved in the regularization term? At first glance, it appears to be allowing for the computation of the gradient of the difference of the losses, rather than the difference of the gradients. But, I may be misinterpreting this line completely.

Any comments on if this is the correct/incorrect way to compute the regularization term and potential solutions or improvements would be greatly appreciated.

The gradient operator ∇ is in itself linear as an operator. This means that ∇a - ∇b = ∇(a-b).
In fact, if you write a minus function m(x1, x2) = x1 - x2 and some loss l(m) calculated from that, then the backpropagation step for this calculation is just feeding dl/dm as grad_out into the backward of x1 and -dl/dm as grad_out into the backward of x2. So really, the two not only agree mathematically but also computationally, with some efficiency gain if the two have common intermediates, so that the graph of the calculation of the intermediates has to be only backpropagated through once.