GPU out of memory due to large memory allocation

You are correct that in theory the memory usage should scale linearly. However, you are crossing over a support boundary of cuDNN in your examples, namely 563x1024x1024x3 is smaller than INT_MAX, while 563x1024x1024x3 is greater than INT_MAX. Since cuDNN currently doesn’t support inputs with more than INT_MAX elements, this workload will be dispatched to a native “im2col” style implementation instead which will allocate much more memory to create the “col” tensor. We’ve requested support from cuDNN for these cases but don’t have an estimated completion date yet.

See e.g., this upstream issue for more details: