but only the latter gives the same result while the former consistently leads to a performance degradation. The two statements involves (currently) unused parameters in the same way. Is there something wrong with the first statement? (x and xb are respectively of size 512 and 2*num_attrs in the non-batch dimension). Pytorch version is 1.5.0 and the optimizer is adam.

Good point, thanks. I missed that adding channels decreased the fan-in. However, num_attrs is 312 so math.sqrt((512+2*num_attrs)/512) is just a bit less than 1.5. Trying:

still did not give the same result as in the two other cases. I eventually figured out where the difference was coming from while trying this though (fc1b was used elsewhere and combined gradients helped so not using it here had a negative impact). Thanks.