Pytorch save, is saving more than asked for. Bug?

tueboesen · March 4, 2021, 9:21pm

I’m currently preprocessing a lot of data and I noticed that some of my data files ended up being really really big, which surprised me.
I then realized that when I take a tensor and only select a slice of it, as done in the code below and then save that slice, it does not only save that slice, but actually still saves the entire tensor! Though when I load the tensor it only shows me the expected shape.

The reason why I noticed it was because some of my files ended up being 10GB large, even though they only contain a single integer tensor, with a dimension of 30000 x 326, which should only take up around 78 MB as int64 or around 10MB as int8.

I’m assuming a copy of the part of the tensor can fix this. But is this a known problem or what is the normal workaround for this?

for i, a2mfile in enumerate(a2mfiles):
    t1 = time.time()
    msas = read_a2m_gz_file(a2mfile, AA_LIST=AA_LIST, unk_idx=unk_idx, max_seq_len=max_seq_len, min_seq_len=min_seq_len,
                            verbose=True)
    if msas is None:
        continue
    else:
        seq = msas[-1,:]
        msas = msas[:-1,:].to(dtype=torch.int8)
        n_msa = msas.shape[0]
        seq_len = msas.shape[1]
        print("{:}/{:} MSAs = {:}, sequence length {:}, time taken: {:2.2f}s,ETA: {:2.2f}h".format(i+1,nfiles,n_msa, seq_len, time.time() - t1,(nfiles-(i+1))*(time.time()-t0)/(i+1)/3600))

        if n_msa > max_samples:
            indices = torch.randperm(n_msa)
            msas_s = msas[indices[:max_samples],:]
        else:
            msas_s = msas

        data = {'seq':seq,
                'msas': msas_s,
                'seq_len': seq_len,
                'n_msas_org': n_msa
                }

        filename = Path(Path(a2mfile).stem).stem

        torch.save(data, "{:}{:}.pt".format(folder_out,filename))

Here is a minimal example showcasing it more clearly:

import torch

n = 1000000
l = 300

data = torch.ones((n,l),dtype=torch.int64)
data_small = data[:30000,:] #.to(dtype=torch.int8)
torch.save(data_small,'test')
data_small2 = torch.ones_like(data_small)
torch.save(data_small2,'test2')

When I run this minimal example, test will have a size of 2.4 GB, while test2 takes 72 MB.

ptrblck · March 5, 2021, 7:12am

Are you able to reproduce this behavior in the latest 1.8.0 release?
If so, could you create an issue on GitHub so that we could track and fix it, please?

tueboesen · March 5, 2021, 10:54pm

Just installed 1.8 now, and yeah the problem still exist there. So I will open an issue on github.