How to load checkpoints across different versions of pytorch (1.3.1 and 1.6.x) using ppc64le and x86?

Brando_Miranda · September 30, 2020, 3:45pm

As I outlined here Force installing torchvision (on IBM ppc64le) I am stuck using old versions of pytorch and torchvision due to hardware e.g. using ppc64le IBM architectures.

For this reason, I am having issues when sending and receiving checkpoints between different computers, clusters and my personal mac. I wonder if there is any way to load models in a way to avoid this issue? e.g. perhaps saving models in with a old and new format when using 1.6.x. Of course for the 1.3.1 to 1.6.x is impossible but at leat I was hoping something would work.

Any advice? Of course my ideal solution is that I don’t have to worry about it and I can always load and save my checkpoints and everything I usually pickle uniformly across all my hardware.

The first error I got was a zip jit error:

RuntimeError: /home/miranda9/data/f.pt is a zip archive (did you mean to use torch.jit.load()?)

so I used that (and other pickle libraries):

# %%
import torch
from pathlib import Path


def load(path):
    import torch
    import pickle
    import dill

    path = str(path)
    try:
        db = torch.load(path)
        f = db['f']
    except Exception as e:
        db = torch.jit.load(path)
        f = db['f']
        #with open():
        # db = pickle.load(open(path, "r+"))
        # db = dill.load(open(path, "r+"))
        #raise ValueError(f'FAILED: {e}')
    return db, f

p = "~/data/f.pt"
path = Path(p).expanduser()

db, f = load(path)

Din, nb_examples = 1, 5
x = torch.distributions.Normal(loc=0.0, scale=1.0).sample(sample_shape=(nb_examples, Din))

y = f(x)

print(y)
print('Success!\a')

but I get complains of different pytorch versions which I am forced to use:

Traceback (most recent call last):
  File "hal_pg.py", line 27, in <module>
    db, f = load(path)
  File "hal_pg.py", line 16, in load
    db = torch.jit.load(path)
  File "/home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/jit/__init__.py", line 239, in load
    cpp_module = torch._C.import_ir_module(cu, f, map_location, _extra_files)
RuntimeError: version_number <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /opt/anaconda/conda-bld/pytorch-base_1581395437985/work/caffe2/serialize/inline_container.cc:131, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 1. Your PyTorch installation may be too old. (init at /opt/anaconda/conda-bld/pytorch-base_1581395437985/work/caffe2/serialize/inline_container.cc:131)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xbc (0x7fff7b527b9c in /home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1d98 (0x7fff1d293c78 in /home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x88 (0x7fff1d2950d8 in /home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #3: torch::jit::import_ir_module(std::shared_ptr<torch::jit::script::CompilationUnit>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&) + 0x64 (0x7fff1e624664 in /home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #4: <unknown function> + 0x70e210 (0x7fff7c0ae210 in /home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x28efc4 (0x7fff7bc2efc4 in /home/miranda9/.conda/envs/wmlce-v1.7.0-py3.7/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #26: <unknown function> + 0x25280 (0x7fff84b35280 in /lib64/libc.so.6)
frame #27: __libc_start_main + 0xc4 (0x7fff84b35474 in /lib64/libc.so.6)

any ideas how to make everything consistent across the clusters? I can’t even open the pickle files.

crossposted: