CUDA out of memory with 8 GPUs using NVIDIA apex

Tanya_Boone · February 3, 2020, 5:41am

Hello all,

I have a server with 8 x GeForce GTX 1080 Ti and I am trying to implement distributed training based on this github collab https://github.com/NVIDIA/apex/blob/master/examples/imagenet/main_amp.py

This was meant for imagenet, but I made a few modifications for cifar10:

I had to eliminate this function, because I couldn’t make it work with cifar10, so I skipped it

def fast_collate(batch, memory_format):

imgs = [img[0] for img in batch]
targets = torch.tensor([target[1] for target in batch], dtype=torch.int64)
w = imgs[0].size[0]
h = imgs[0].size[1]
tensor = torch.zeros( (len(imgs), 3, h, w), dtype=torch.uint8).contiguous(memory_format=memory_format)
for i, img in enumerate(imgs):
    nump_array = np.asarray(img, dtype=np.uint8)
    if(nump_array.ndim < 3):
        nump_array = np.expand_dims(nump_array, axis=-1)
    nump_array = np.rollaxis(nump_array, 2)
    tensor[i] += torch.from_numpy(nump_array)
return tensor, targets

I changed the model to Efficient Net from https://www.kaggle.com/hmendonca/efficientnet-cifar-10-ignite
Batch size is 125 and Image size 224. It is not something excessive for 8 GPUs
I train and I get this error on ALL GPUs:

RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 10.92 GiB total capacity; 10.33 GiB already allocated; 43.50 MiB free; 43.60 MiB cached)
Traceback (most recent call last):

I believe batch size should be divided between the 8 GPUs and if that is right why they’re overflowing or how can I fix it?..I don’t see the advantage of using apex if I am still getting this kind of errors.

Thanks in advance

ptrblck · February 3, 2020, 5:56am

Which opt_level are you using?
Does this batch size fit without using amp? If not, do you know the memory requirement for this particular setup?

Tanya_Boone · February 3, 2020, 6:11am

currently I am using opt_level 3

I started with one GPU no distributed and I could only train with

image size= 112 (half of 224)

batch_size = about 60 (half of what I have)

but now I want to train with 224 and bs 125 and I can’t be possible that I can’t run that with 8 gpus.

I have another server with two Tesla V100 with 32GB each one and actually is running right now. It is weird because I specified batch size to be 125 and the epoch shows 200

python -m torch.distributed.launch --nproc_per_node=2 main_amp_cifar10.py -a resnet50 --b 125 --workers 4 --opt-level O3 ./

Epoch: [2][33/200] Time 0.048 (0.050) Speed 5225.336 (5015.339) Loss 0.8833006620 (0.9562) Prec@1 69.200 (66.533) Prec@5 96.800 (96.921)
Epoch: [2][34/200] Time 0.048 (0.050) Speed 5227.816 (5021.342) Loss 0.8718330264 (0.9537) Prec@1 67.600 (66.565) Prec@5 99.600 (97.000)
Epoch: [2][35/200] Time 0.048 (0.050) Speed 5216.194 (5026.707) Loss 1.0144102573 (0.9555) Prec@1 66.000 (66.549) Prec@5 97.200 (97.006)
Epoch: [2][36/200] Time 0.048 (0.050) Speed 5222.039 (5031.935) Loss 0.9020771980 (0.9540) Prec@1 70.400 (66.656) Prec@5 96.400 (96.989)
Epoch: [2][37/200] Time 0.048 (0.050) Speed 5231.011 (5037.116) Loss 0.9450572729 (0.9537) Prec@1 65.600 (66.627) Prec@5 95.200 (96.941)
Epoch: [2][38/200] Time 0.048 (0.050) Speed 5232.084 (5042.061) Loss 1.0129337311 (0.9553) Prec@1 63.200 (66.537) Prec@5 97.200 (96.947)
Epoch: [2][39/200] Time 0.048 (0.050) Speed 5223.324 (5046.551) Loss 0.9215689898 (0.9544) Prec@1 66.400 (66.533) Prec@5 98.000 (96.974)
Epoch: [2][40/200] Time 0.048 (0.049) Speed 5218.380 (5050.709) Loss 1.0108802319 (0.9559) Prec@1 64.400 (66.480) Prec@5 96.000 (96.950)
Epoch: [2][41/200] Time 0.048 (0.049) Speed 5181.661 (5053.824) Loss 0.9646931887 (0.9561) Prec@1 65.200 (66.449) Prec@5 96.400 (96.937)
Epoch: [2][42/200] Time 0.048 (0.049) Speed 5183.026 (5056.825) Loss 0.8261133432 (0.9530) Prec@1 70.000 (66.533) Prec@5 96.800 (96.933)

So, in conclusion

it doesn’t run with 8 GTX 1080 Ti 12gb

It runs with two Tesla V100-PCIE 32gb

Lornatang · February 4, 2020, 4:30am

Hi, I happen to be using the Efficient model, so let me answer that question
I’ll tell you a rule for computing video memory.

efficientnet-b0 parms is 5.3M.
For details, please see https://github.com/Lornatang/EfficientNet/blob/master/README.md

Tanya_Boone · February 5, 2020, 10:15pm

Thanks…that’s why I was using 8 gpus but still I wasn’t able to run it.