Why does pytorch prompt "[W accumulate_grad.h:170] Warning: grad and param do not obey the gradient layout contract. This is not an error, but may impair performance."?

I’m confused about the meaning of “performance” in this context. Does it refer to the model’s overall accuracy, or specifically its processing speed?

I noticed that calling tensor.contiguous() after using permute eliminated a warning message, but unfortunately, it also caused the model to run significantly slower. What’s the explanation for this trade-off?

Performance is meant here as in speed, not model convergence/accuracy.
You should not see any performance difference by manually calling contiguous or letting the reducer do it as explained here.