Thank you for the additional information, @JamesOwers.
So your error message is very telling:
It says that you have 11GB (!) free and it can’t allocate 5MB - that makes no sense.
See this discussion where I tried to diagnose the non-contiguous memory just to discover that nvidia will re-allocate fragmented pages of at least 2MB to make contiguous memory. So unless your code somehow allocates memory that it only consumes a tiny fraction of each 2MB page, fragmenting 12GB of RAM this shouldn’t really happen.
So a few things I’d like to suggest in no particular order:
-
catch that failure and add sleep so that the program doesn’t exit at that point of failure and check what nvidia-smi says about that card’s RAM status - what is the reported used/free memory there. This is to double check that perhaps there is something wrong with the card and that it reports wrong numbers.
-
Since you said it happens 5% of the time, did you observe that it perhaps happens with the same specific card? i.e. again a faulty card?
-
can you reliably reproduce when you hit that 5% situation?
-
reduce your variable size by say half - does it fit into the memory? if not half again and so on - see what fits
-
when that error happens, can you catch it and then try to allocate a simple large tensor say
torch.zeros()
of a few GBs?torch.ones((n*2**18)).cuda().contiguous()
wheren
is the number of desired MBs - and adjustcuda()
to match your setup if neededto(...)
My feeling is that your array of cards has a faulty card. That last suggestion could be the key - allocate 10GB of RAM (say 80% of the card’s capacity) and free it right away at the beginning of your program - if it fails, you don’t want to use that card.