Memory usage during training and validation

BarCodeReader · January 9, 2021, 8:02am

Here I observe some difference and just want to make sure:

I am running my model on dual-GPU, during training, each GPU will be used for 4GB.
But during validation (i.e. the training is not finished yet, but just switch to model.eval() to run validation and will switch back to model.train() after validation), I will observe GPU 1 memory usage become 7GB while GPU 2 is still the same.

Then after validation finished and change back to training, the memory usage for GPU1 is not back to 4GB…it locked to 7GB…

is this normal??? I would expect it will change back to 4GB after validation is done…

what is the process for model.eval()? will it clear the memory after it finished? how does PyTorch assign the memory when it switches from train() to eval()? I am so curious!

training epoch 1:
GPU1 ---4GB
GPU2 ---4GB

validation:
GPU1 ---7GB
GPU2 ---4GB

training epoch 2:
GPU1 ---7GB (???)
GPU2 ---4GB

ptrblck · January 18, 2021, 7:42am

This is expected, since PyTorch uses a caching allocator to reuse the memory instead of reallocating it (which would be slow).
If you want to reset the memory, you would have to delete all tensors, which might still be alive from the validation run, and call torch.cuda.empty_cache(). Note that this is only necessary, if you want other applications to use the memory.

BarCodeReader · January 19, 2021, 6:20am

hi @ptrblck,

Thanks for the answer! Appreciate!