Hi, is there any smooth implementation that could adjust the training/testing batch size automatically to fit the GPU ram without CUDA segmentation fault? The GPUs are on the remote servers that shared with others, so it’s difficult to foresee the exact usage.
Dont know a smooth plug and play implementation, but if you had to do this yourself , you can put the particular code in a
try except block, and if the error is of the particular type that cuda gives when crashing, you could reduce the batch size. although this way you will not know if ram has been freed up and you can increase the batch size again. maybe you ought to try and increase the batch size every few iterations, and if there is a seg fault revert it back again.
Thanks for the reply. I tried to find something useful in pytorch cuda doc but haven’t found a solution yet. Try/except worth a try but it would be expensive to get the best parameters.