Hey @ptrblck and @cbox, I am currently running into this issue and am wondering how it was resolved.
Did you update any libraries between the last successful run?
I am using the same conda environment, so no change in libraries.
Did you specify any devices using
CUDA_VISIBLE_DEVICES
?
I am just specifying the device via: device = torch.device('cuda:4')
I am still pretty green here so I am not really sure what the difference is, however, this is the first time I have run into a situation of running out of memory on one of these particular models. (in fact, a different GPU in the cluster is currently training the exact same model).
Is the GPU completely empty before you run the script?
The GPU I am targeting is idle (10MiB/16280MiB)
I would appreciate any advice. Thanks!
EDIT: I solved the problem via this previous solution Out of memory error when resume training even though my GPU is empty - #2 by ptrblck
For anyone with a similar problem, the code I used for a fix is as follows:
model_path = 'path/to/model.pt'
model = UNet(n_channels = 1, n_classes = 1)
state_dict = torch.load(model_path,map_location='cpu')
model.load_state_dict(state_dict)
model.to(device)
That said, I am still curious why the work-around is necessary…