I am currently working on a model that takes in a large volume with the shape (1, 200, 300, 300), where the first axis represents the number of channels, and the subsequent axes represent the width, height, and depth of the volume, respectively. The objective is to generate a segmentation map of the same shape as the input.
However, I am facing challenges due to the substantial memory requirements of the activation maps. Despite employing down/upscaling techniques, the combined size of the first and last activation maps amounts to approximately 12GB, which exhausts the memory of a single GPU.
While I am aware of patch-based training, I would prefer to explore alternative approaches for several reasons. Model parallelism seems like a promising solution, but it may be highly inefficient, as it would utilize a single GPU for a limited number of convolutions while handling large volumes.
Ideally, I would like to split the input tensor across multiple GPUs, allowing for parallel processing, and then aggregate the results into a single tensor. However, I have been unable to find any libraries or frameworks that provide such functionality.
As an example, let’s consider applying a 3x3 convolutional filter with a stride of 1 and padding of 1 to a 1x10x10 image. We can split the image into four chunks, resulting in four 1x6x6 images, and send each of them to a different GPU. Next, I can have all GPUs apply the same conv filter to these different “chunks” of the input and, afterward, combine these chunks back into a single tensor.
I would greatly appreciate any guidance or suggestions on how to address this problem more effectively. Thank you in advance for your assistance.