Segmentation fault when loading weight

david-leon · March 27, 2017, 1:59am

When loading weight from file with
model.load_state_dict(torch.load(model_file))
exception raised:
THCudaCheck FAIL file=/data/users/soumith/builder/wheel/pytorch-src/torch/lib/THC/generic/THCStorage.c line=79 error=2 : out of memory Segmentation fault (core dumped)

Previously this runs with no problem, actually two training processes are still running (on another two GPUs), however this breaks when I want to start an additional training process.

smth · March 27, 2017, 2:41am

it seems that you are running out of memory on the GPUs

david-leon · March 27, 2017, 2:43am

No, the target GPU is idle, and there’re still 22GB memory available on this GPU.

david-leon · March 27, 2017, 3:14am

OK, I think I’ve got where the problem rises: the model weight saved with
torch.save(model.state_dict(), file)
contains device info and
torch.load(model_file) will load the weight directly into the device according to the saved device info rather than load into CPU. So, if the previously used device is short of memory, this loading process will crash.

david-leon · March 27, 2017, 3:50am

I’ve tried this:

    def map_loc(storage, loc):
        if loc.startswith('cuda'):
            return storage.cuda(device)
        else:
            return storage
    print('model weights loading...')
    model.load_state_dict(torch.load(model_file,map_location=map_loc))
    print('model weights loaded')

And still, exception raised:

model weights loading...
THCudaCheck FAIL file=/data/users/soumith/builder/wheel/pytorch-src/torch/csrc/generic/serialization.cpp line=145 error=2 : out of memory
Traceback (most recent call last):
  File "PTR_evaluation_pytorch.py", line 197, in <module>
    model.load_state_dict(torch.load(model_file,map_location=map_loc))
  File "/home/David/App/anaconda3/lib/python3.5/site-packages/torch/serialization.py", line 222, in load
    return _load(f, map_location, pickle_module)
  File "/home/David/App/anaconda3/lib/python3.5/site-packages/torch/serialization.py", line 377, in _load
    deserialized_objects[key]._set_from_file(f, offset)
RuntimeError: cuda runtime error (2) : out of memory at /data/users/soumith/builder/wheel/pytorch-src/torch/csrc/generic/serialization.cpp:145

The target device is idle with over 20GB memory free.

smth · March 27, 2017, 2:03pm

there was a bug in the serialization where remapping devices still used the device memory. this is fixed in master. i am working on binaries of version 0.1.11 and that will have this fix.

david-leon · March 28, 2017, 12:46am

Good to know that, thanks @smth.

Zhang_Wen · June 26, 2017, 3:38am

@david-leon hello, have you solved your problem ? i meet the same one, i want to load model in different device, and there are following errors :

THCudaCheck FAIL file=/py/conda-bld/pytorch_1493676237139/work/torch/lib/THC/generic/THCStorage.c line=79 error=2 : out of memory
[1] 2276 segmentation fault (core dumped) python guitar_rnnsearch_ia.py

will_soon · October 10, 2017, 3:14am

Is because the version of pytorch?

todpole3 · January 17, 2018, 8:12am

Met the same problem today.

torch.version = ‘0.4.0a0+0876bab’

Jonson · July 9, 2018, 2:38am

I got the same problem, do you figure out how to solve it yet?

ptrblck · July 9, 2018, 9:56am

Did you specify the map_location while loading?
What kind of error message did you get?

Jonson · July 9, 2018, 3:03pm

Thanks! I have solved this problem by specifying map_location and using torch.cuda.set_device() to set target GPU manually.

Debatrix · September 7, 2018, 8:14am

Thanks! It’s work.
map_location

map_location – a function, torch.device, string or a dict specifying how to remap storage locations

if torch.cuda.is_available() and cfg.use_gpu is not None:
    device = torch.device(use_gpu)
else:
    device = torch.device("cpu")
checkpoint_data = torch.load(checkpoint_path, map_location=device)
model.load_state_dict(checkpoint_data['model'])
optimizer.load_state_dict(checkpoint_data['optimizer'])

checkpoint_data = dict(
                    optimizer=optimizer.state_dict(),
                    model=model.state_dict(),
                )

Bridget_Murphy · January 17, 2024, 10:51am

Good luck on your project.