Running out of memory with 8 gpus?


I am training a model using the UNET++ architecture… I have large images that I crop down to make this more manageable but I feel I should be able to handle more data using the dataparallel setup.

Per the title I am running on eight 8GB gpus on a cluster. I am trying to run a model with an image size of 1024x1024 and batch size of 3. I get an out of memory error. I estimate with these settings I would need 3gb per batch not including all the layers in the model. Still I have ~50-60gb of memory to play with…

What is going on ?

I don’t quite understand this statement as it seems you claim to be using these 8 GPUs as “one”?
Data parallel will execute the same model on each device so the memory limitations would still be the same, but you could increase the “global” batch size.

Thanks for that clarification