I am training CNN model using pytorch on GPU. At initial stage the RAM usage is 25% but during training the memory consumption get increas slowly. After 12 hours 90 % of memory get consumed and OOM error get encounter. I check the memory usage it shows that 80% of memory get consumed by Page Pool and Non-Page pool.
My PC have only one GPU of 16GB. And other configurations are as follows:
- Batch size: 256
- RAM : 32 GB
- GPU: 16GB
- processor: i7 11th gen
Pytorch config :
Number of workers: 4
Training device : Cuda
Number of iterations per epoch : 12000(approx)
Do you use no_grad() when you evaluate your dataset?
Are you running out of host RAM or GPU memory?
The training code is in the image file.
emmm… I mean, when your train is over, after call model.eval(), you should use torch.no_grad() to avoid store grad during eval or inference.
Again I am facing the same issue. The increase in RAM consumption is during the training period. As the the iterations in a single epoch increase the page pool and non page pool memory is getting increase over the time. At initial stage the memory occupied by these pool are few MBs.
Sorry, I have no idea about the memory usage.
Generally speaking pytorch will use GPU memory.