I’m currently preprocessing a lot of data and I noticed that some of my data files ended up being really really big, which surprised me.
I then realized that when I take a tensor and only select a slice of it, as done in the code below and then save that slice, it does not only save that slice, but actually still saves the entire tensor! Though when I load the tensor it only shows me the expected shape.
The reason why I noticed it was because some of my files ended up being 10GB large, even though they only contain a single integer tensor, with a dimension of 30000 x 326, which should only take up around 78 MB as int64 or around 10MB as int8.
I’m assuming a copy of the part of the tensor can fix this. But is this a known problem or what is the normal workaround for this?
for i, a2mfile in enumerate(a2mfiles):
t1 = time.time()
msas = read_a2m_gz_file(a2mfile, AA_LIST=AA_LIST, unk_idx=unk_idx, max_seq_len=max_seq_len, min_seq_len=min_seq_len,
verbose=True)
if msas is None:
continue
else:
seq = msas[-1,:]
msas = msas[:-1,:].to(dtype=torch.int8)
n_msa = msas.shape[0]
seq_len = msas.shape[1]
print("{:}/{:} MSAs = {:}, sequence length {:}, time taken: {:2.2f}s,ETA: {:2.2f}h".format(i+1,nfiles,n_msa, seq_len, time.time() - t1,(nfiles-(i+1))*(time.time()-t0)/(i+1)/3600))
if n_msa > max_samples:
indices = torch.randperm(n_msa)
msas_s = msas[indices[:max_samples],:]
else:
msas_s = msas
data = {'seq':seq,
'msas': msas_s,
'seq_len': seq_len,
'n_msas_org': n_msa
}
filename = Path(Path(a2mfile).stem).stem
torch.save(data, "{:}{:}.pt".format(folder_out,filename))
Here is a minimal example showcasing it more clearly:
import torch
n = 1000000
l = 300
data = torch.ones((n,l),dtype=torch.int64)
data_small = data[:30000,:] #.to(dtype=torch.int8)
torch.save(data_small,'test')
data_small2 = torch.ones_like(data_small)
torch.save(data_small2,'test2')
When I run this minimal example, test will have a size of 2.4 GB, while test2 takes 72 MB.