CUDA error: out of memory when load models

nasTnate · March 2, 2020, 9:21pm

Hey @ptrblck and @cbox, I am currently running into this issue and am wondering how it was resolved.

Did you update any libraries between the last successful run?

I am using the same conda environment, so no change in libraries.

Did you specify any devices using CUDA_VISIBLE_DEVICES ?

I am just specifying the device via: device = torch.device('cuda:4')
I am still pretty green here so I am not really sure what the difference is, however, this is the first time I have run into a situation of running out of memory on one of these particular models. (in fact, a different GPU in the cluster is currently training the exact same model).

Is the GPU completely empty before you run the script?

The GPU I am targeting is idle (10MiB/16280MiB)

I would appreciate any advice. Thanks!

EDIT: I solved the problem via this previous solution Out of memory error when resume training even though my GPU is empty - #2 by ptrblck

For anyone with a similar problem, the code I used for a fix is as follows:

model_path = 'path/to/model.pt'
model = UNet(n_channels = 1, n_classes = 1)
state_dict = torch.load(model_path,map_location='cpu')
model.load_state_dict(state_dict)
model.to(device)

That said, I am still curious why the work-around is necessary…