Why does pytorch prefer using NCHW?

IIRC, NHWC allows you much better implementation for convolution layers that gives a nice boost in perf (because the value for all channels is accessed for every pixel, so data locality is much better in that format).
The problem is that batchnorm-like layer are so much faster in NCHW, that vision models nowadays do not perform that much better in NHWC (for these layers, you reduce over the HW dimensions. And so having C last breaks the data locality).

But for the original question: I would agree with @ptrblck: for historical reasons mostly and because there is no clear reason that the other format is better in general (at least not worth rewriting all the TH/THC libraries!).

4 Likes