Hi everyone. I’m implementing an RL algorithm and I’m masking unavailable actions by setting their Q-values really low, so when I apply
max() I never select them. Usually, I do that with
float("-inf") in pure Python. The thing is, in all my experiments, I’m getting a
RuntimeError: CUDA out of memory in the lines where I use this trick, like the example below:
live_next_q_vals = current_q_vals.clone().detach()[:, 1:]
live_next_q_vals[~av_actions_batch[:, 1:]] = float('-inf')
Does anyone know if using
float("-inf") is not recommended with PyTorch tensors? I saw other implementations use large negative numbers, like
-999999, but I didn’t see a reason to do that and it looks like I made the wrong choice, lol.
I believe it’s worth clarifying that this actually happens after several thousand iterations, not in the first one, but always in lines in which I perform masking with
I don’t think the error is raised due to the specific value and assume you would get the same OOM error, if you are using any other value.
The OOM might be raised by the indexing, since temp. tensors might need to be created.
Are you seeing an increased memory usage during training or are you close to an OOM from the beginning?
You’re right @ptrblck, I got an OOM error even when using a large negative value instead of
gloat("-inf"). The tensor I am manipulating in that code snippet has dimensions 32x60x10x16. I don’t know if that’s big in PyTorch terms.
To answer your question, my training script does not consume all of my GPU memory from the beginning, it slowly increases the memory usage with time. It usually starts with 1 GB and goes up until it reaches 3.5 GB. My GPU has 8 GB of memory.
The thing is, because processing is actually low (training a network usually uses 5~20% of my GPU processing), I was being greedy and training multiple networks at the same time in the same GPU. So it’s not like a single process actually consumed all my GPU memory.
I now implemented a few measures to limit GPU memory usage, such as deleting tensors once they are used (especially the big ones), limiting the scope of some tensors with the use of functions and calling
torch.cuda.empty_cache() periodically to free memory to other processes. I am still trying to figure out if these things are making a difference.