Why does pytorch prompt "[W accumulate_grad.h:170] Warning: grad and param do not obey the gradient layout contract. This is not an error, but may impair performance."?

Kim_Anthony · February 22, 2024, 6:21am

I’m confused about the meaning of “performance” in this context. Does it refer to the model’s overall accuracy, or specifically its processing speed?

I noticed that calling tensor.contiguous() after using permute eliminated a warning message, but unfortunately, it also caused the model to run significantly slower. What’s the explanation for this trade-off?

ptrblck · February 22, 2024, 7:14am

Performance is meant here as in speed, not model convergence/accuracy.
You should not see any performance difference by manually calling contiguous or letting the reducer do it as explained here.