Pytorch & pickled tensors

mortazavi · April 9, 2020, 9:19pm

While loading a regular PyTorch (1.4.0) tensor (an encoding output, generated by another network in another training experiment), I run into the following problem . . .

When pickling the output of an encoder, what is the best practice?

For example, does the output, which is just a vector/tensor, need to be detached and copied prior to pickling?

Side notes:

I realize the stack trace below ends in a complaint about multi-processing process forking/spawning incompatibility . . . but that might have been a side-effect of some earlier, serialization recovery attempt . . . The incompatibility of serialization is suspicious because torch version has not changed in my virtualenv . . .

In case the multi-processing complaint is actually genuine,
is it possible that inclusion of tensorboard’s summary writer as follows may also be a culprit
“self.summary_writer = tf.summary.create_file_writer(self.log_dir)”

I can research these questions on my own but it would be great if someone more familiar with the serialization code clears up the first suspicion.

============

File ". . . . ", line . . . , in get_emb
x_emb = pickle.load(x_path)
File “/home/ubuntu/ve/trn/lib/python3.6/site-packages/torch/storage.py”, line 134, in _load_from_bytes
return torch.load(io.BytesIO(b))
File “/home/ubuntu/ve/trn/lib/python3.6/site-packages/torch/serialization.py”, line 529, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File “/home/ubuntu/ve/trn/lib/python3.6/site-packages/torch/serialization.py”, line 702, in _legacy_load
result = unpickler.load()
File “/home/ubuntu/ve/trn/lib/python3.6/site-packages/torch/serialization.py”, line 665, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File “/home/ubuntu/ve/trn/lib/python3.6/site-packages/torch/serialization.py”, line 156, in default_restore_location
result = fn(storage, location)
File “/home/ubuntu/ve/trn/lib/python3.6/site-packages/torch/serialization.py”, line 135, in _cuda_deserialize
with torch.cuda.device(device):
File “/home/ubuntu/ve/trn/lib/python3.6/site-packages/torch/cuda/init.py”, line 254, in enter
self.prev_idx = torch._C._cuda_getDevice()
File “/home/ubuntu/ve/trn/lib/python3.6/site-packages/torch/cuda/init.py”, line 195, in _lazy_init
"Cannot re-initialize CUDA in forked subprocess. " + msg)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the ‘spawn’ start method

mortazavi · April 9, 2020, 11:12pm

The problem persists even if SummaryWriter from torch.utils.tensorboard is used . . .
This could have something to do with not detaching the output tensor from the compute graph prior to saving it . . .

ptrblck · April 10, 2020, 5:02am

Are you using pickle manually or are you trying to load the tensor via torch.load?
Also, in the latter case, would map_location='cpu' help?
Based on the error message it seems you are working with multiprocessing and the tensor loading is trying to re-initialize CUDA. Have you tried to use the suggested spawn method?

mortazavi · August 24, 2020, 7:25pm

I resolved this problem . . . but unfortunately, I don’t recall how I did it. I will have to review the code later and write back.