What is the saveid in torch.load source code

valiantljk · October 5, 2018, 10:55pm

I noticed the torch.load source code as below:

    def persistent_load(saved_id):
        assert isinstance(saved_id, tuple)
        typename = saved_id[0]
        data = saved_id[1:]

        if typename == 'module':
            # Ignore containers that don't have any sources saved
            if all(data[1:]):
                _check_container_source(*data)
            return data[0]
        elif typename == 'storage':
            data_type, root_key, location, size, view_metadata = data
            if root_key not in deserialized_objects:
                deserialized_objects[root_key] = restore_location(
                    data_type(size), location)
            storage = deserialized_objects[root_key]
            if view_metadata is not None:
                view_key, offset, view_size = view_metadata
                if view_key not in deserialized_objects:
                    deserialized_objects[view_key] = storage[offset:offset + view_size]
                return deserialized_objects[view_key]
            else:
                return storage
        else:
            raise RuntimeError("Unknown saved id type: %s" % saved_id[0])

Can anyone tell me what is the saved_id?

valiantljk · October 18, 2018, 9:10pm

The above question is not important to me. The following one is what suspect as an IO bottleneck in the torch.load when loading pickle file

 if f_should_read_directly and f.tell() == 0:
        # legacy_load requires that f has fileno()
        # only if offset is zero we can attempt the legacy tar file loader
        try:
            return legacy_load(f)
        except tarfile.TarError:
            # if not a tarfile, reset file offset and proceed
            f.seek(0)

The legacy_load is to open the file with taropen, but in case of not a tar file, this path seems costly, and some non-necessary IO operations are triggered.

adding a condition check seems helpful:

 if f_should_read_directly and f.tell() == 0 and tarfile.is_tarfile(fn):

but using is_tarfile requires _load to accept a new parameter, which filename: fn.

Check my benchmark: https://github.com/NERSC/pyprob/blob/distributed/torch_load_bench.py