Potential bug in ADAM for complex parameters?

In the documentation/code of ADAM here I noticed something that seems like an error to me. Even though torch.view_as_real is called on grad, grad.conj() is called later for the computation of exp_avg_sq. Unless I misunderstood the code - this seems wrong to me. We want to compute the absolute value of the gradient here right?

Can anyone clarify if this is a bug? Here is the code in question:

        if torch.is_complex(param):
            grad = torch.view_as_real(grad)
            exp_avg = torch.view_as_real(exp_avg)
            exp_avg_sq = torch.view_as_real(exp_avg_sq)
            if amsgrad:
                max_exp_avg_sqs[i] = torch.view_as_real(max_exp_avg_sqs[i])
            param = torch.view_as_real(param)

        # Decay the first and second moment running average coefficient
        exp_avg.lerp_(grad, 1 - beta1)
        exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2)

Hi Boris!

I don’t really understand what is going on here, but here’s what I think:

(Note, I’ve never used Adam with complex parameters.)

Okay, if param is complex, we will view_as_real() its grad. That seems
reasonable, given what the rest of the code does.

However, if param is not complex, I don’t believe that param.grad can
ever be complex. In general, I don’t believe that pytorch permits this, and
pytorch doesn’t permit calling .backward() on a complex loss, so you
couldn’t use this as a scheme to generate a complex grad for a real param.

Therefore, at this point, whether or not the is_complex (param) if block were
executed,
grad will be real, grad.conj() is a no-op, and you’re just performing
a weighted averaging of grad**2 into exp_avg_sq.

So it looks a bit odd, but I guess it’s correct.

Why might it be coded this way? I suspect that it’s historical. Complex autograd
was designed, to some extent, on the fly. The definition of a complex derivative
wasn’t initially consistent across the various pieces of the implementation, and
the question of differentiating a complex loss hadn’t been sorted out. I suspect
that the grad.conj() was an attempt to do the right thing in the presence of
a complex grad (that I suspect could occur – rightly or wrongly – in some
previous versions of complex autograd), and when the issue was addressed
by other means, nobody noticed (or maybe cared) that grad.conj() was
now a no-op, so it stayed in the code.

(But I could be wrong about all of this. You might want to do the experiment
of taking out the .conj() and see if anything breaks …)

Best.

K. Frank

Hi KFrank,

thank you for the detailed explanation. This would indeed explain things and it would make it a no-op. Was this documented? Because the alternative way of thinking about Adam for complex weights would be to not regard real and imag as disjoint parameters but as one parameter. In this case, one could have to estimate the variance via grad * grad.conj() as implemented but this is invalidated by casting it to complex.

In fact, I found this bug discussion here which would indeed point to this being a temporary way of disabling it?

Best,
Boris

Hi Boris!

Good find!

Digging through some of the github links, it appears that pull request 62946
initially implemented the fix with grad.conj() and then pull request 80279
reimplemented the fix with torch.view_as_real(grad), but left in the
grad.conj().

I’m still of the opinion that the grad.conj() is unnecessary (but harmless),
so I guess that the grad.conj() is indeed a historical artifact.

Best.

K. Frank

Hi K. Frank,

while I agree that the conj call is unnecessary if the parameters are cast to real, what I wonder is, why is this done? Treating them as separate seems to affect the convergence and lead to subpar performance.

This post here illustrates the issue of treating complex parameters as 2d real ones: Adam (and other) optimizers miscalculate the momentum update for complex variables · Issue #30 · keras-team/tf-keras · GitHub

Thanks and best,
Boris