Hi. My code works well with small batch size on single GPU. However, training samples with larger batch size on multiple GPUs, an error will raise when operating BN layers, e.g., “CUDA out of memory. Tried to allocate 418.00 MiB (GPU 0; 11.91 GiB total capacity; 11.13 GiB already allocated; 194.56 MiB free; 11.18 GiB reserved in total by PyTorch)”.
Besides, for some reasons, it is unsuitable to use smaller batch size and other normalization methods (e.g., LayerNorm).
Is there any advice to fix this bug? Thank you in advance.