Why use DxHxW for 3d input data instead of HxWxD?

When inputting a 3d array, Conv3d expects as input NxCxDxHxW. Why not use NxCxHxWxD?

This only transposes the input anyway when the dimension is equal for all 3 axes.

What would be the advantage of NCHWD?

It’s probably more intuitive since volumes are usually in the format HWD. In TF they simply specify (dim_1, dim_2, dim_3), so I can’t see the reason for using DHW…

Could you explain more about the “usual” format? I.e. does a specific library return volumes as HWD or is there any other convention? I’m currently seeing it more as CHW (PIL) vs. HWC (OpenCV), which might be annoying sometimes but both memory layouts have certain advantages. I’m not sure what the advantage of HWD vs. DHW would be. E.g. is there any expected performance gain using your layout?