Memory problem of testing large 8K images on GPUs

111414 · December 6, 2022, 6:50am

Thanks for noticing this post!

I have an image denoising network (it is a CNN) that takes an image of size 1*H*W as input and generates a smoothed image version of size 1*H*W. Now I have four NVIDIA RTX 3090 GPUs and 100 large 8K images of size 1*7680*4320. I want to run a test on them. However, the memory of one GPU seems not enough for such a high resolution. The following shows my requirements and some efforts/questions:

Q1. I want to test all of the 100 large 8K images, i.e., all the images should be passed through my network to get their denoised (smoothed) versions.
Q2. It seems that the test on GPU requires very much memory (maybe about 40~80GB). One 3090 GPU only has about 24~25GB of memory, so I can not accomplish the test on one 3090 GPU.
Q3. I have tried a test on the CPU. Although the memory problem may be avoided, the test speed is too low (about 10 minutes per image). Therefore, I hope I can test the images on GPUs.
Q4. The images should not be split into smaller patches or blocks. In other words, I want to pass the complete (whole) 8K images through my network.
Q5. I have activated all four 3090 GPUs and tried the modules of torch.nn.DataParallel. However, it seems that only the first GPU is utilized. Is there a way to take full use of all the four 3090 GPUs, to support the test of large 8K images?

I have tried torch.cuda.empty_cache(), activated torch.no_grad(), and so on. However, it seems that the memory is still not enough.

I do not know if there are some mechanisms to test one image on the four GPUs or perform calculations on GPUs with the help of local host memory.

My mind may be in a muddle now. So the above things may be not well-organized.
Any suggestion is welcomed.
I am still searching, thinking, and trying …

Thanks a lot for reading such a long post!

ptrblck · December 6, 2022, 7:39am

DataParallel will clone the model to all available devices and will not reduce the memory requirement, so I’m unsure why a) only a single device was used and b) why it was working at all since you were previously running our of memory.

Calling empty_cache() won’t reduce the GPUuu memory usage, but will instead only slow down your code. This operation is used to free the PyTorch internal CUDA cache to allow other processes to use the GPU. PyTorch itself will need to call the expensive (since synchronizing) cudaMalloc calls again to allocate the previously freed memory.

You could try to use pipeline parallelism e.g. via pytorch/tau, could check FSDP, CPU offloading, or checkpointing via torch.utils.checkpoint to trade compute for memory.