Unable to allocate cuda memory, when there is enough of cached memory

stas · April 24, 2019, 4:13pm

Thank you for the additional information, @JamesOwers.

So your error message is very telling:

It says that you have 11GB (!) free and it can’t allocate 5MB - that makes no sense.

See this discussion where I tried to diagnose the non-contiguous memory just to discover that nvidia will re-allocate fragmented pages of at least 2MB to make contiguous memory. So unless your code somehow allocates memory that it only consumes a tiny fraction of each 2MB page, fragmenting 12GB of RAM this shouldn’t really happen.

So a few things I’d like to suggest in no particular order:

catch that failure and add sleep so that the program doesn’t exit at that point of failure and check what nvidia-smi says about that card’s RAM status - what is the reported used/free memory there. This is to double check that perhaps there is something wrong with the card and that it reports wrong numbers.
Since you said it happens 5% of the time, did you observe that it perhaps happens with the same specific card? i.e. again a faulty card?
can you reliably reproduce when you hit that 5% situation?
reduce your variable size by say half - does it fit into the memory? if not half again and so on - see what fits
when that error happens, can you catch it and then try to allocate a simple large tensor say torch.zeros() of a few GBs? torch.ones((n*2**18)).cuda().contiguous() where n is the number of desired MBs - and adjust cuda() to match your setup if needed to(...)

My feeling is that your array of cards has a faulty card. That last suggestion could be the key - allocate 10GB of RAM (say 80% of the card’s capacity) and free it right away at the beginning of your program - if it fails, you don’t want to use that card.