Depth first in 3D Convs?

I noticed that Pytorch documentation suggests that the input dimensions of 3D images are (depth, height, width). In contrast, in TF documentation these dimensions are referred as (dim1, dim2, dim3). Therefore, maybe not but I was wondering if there is any difference (e.g., results, efficiency) in using (depth, height, width) vs. (height, width, depth), since the latter is more natural (at least for me).


PyTorch uses the channels-first layout by default and you could switch to channels-last, in case you are using mixed-precision training and your GPU supports TensorCores to speed up these operations. Note that the user-facing shapes would still be shown in the default channels-first layout as described in the tutorial.