What is the recommended format to save data in pytorch?

This is mainly out of curiosity (since one can always use the transform ToTensor() in the dataloader).

But when storing data (e.g. pre-processed data, synthetic data etc.), is there any advantage to saving the data already as a torch.tensor rather than as an image, PIL image thing, numpy array, etc?

fully functionable example e.g.

#%%

import torch

from pathlib import Path

path = Path('~/data/tmp/').expanduser()

tensor_a = torch.rand(2,3)
tensor_b = torch.rand(1,3)

db = {'a': tensor_a, 'b': tensor_b}

torch.save(db, path/'torch_db')
loaded = torch.load(path/'torch_db')
print( loaded['a'] == tensor_a )
print( loaded['b'] == tensor_b )

Q:

  • if it’s already in tensor format, does that cause issues with ToTensor() Op?

Note: that we probably can’t preload things to speed things in gpu since the dataloader loading of data is subtle due to cuda multithtreading subtlities.


related: Save a tensor to file

2 Likes

Seems that the ToTensor op gives problems. Most likely the best way to save data is in numpy format so that transforms do not complain.

Error:

TypeError: pic should be PIL Image or ndarray. Got <class 'torch.Tensor'>

sample code that gave error:


# saving torch tensors

import torch
import torch.nn as nn
import torchvision

from pathlib import Path
from collections import OrderedDict

path = Path('~/data/tmp/').expanduser()
path.mkdir(parents=True, exist_ok=True)

tensor_a = torch.rand(2,3)
tensor_b = torch.rand(1,3)

db = {'a': tensor_a, 'b': tensor_b}

torch.save(db, path/'torch_db')
loaded = torch.load(path/'torch_db')
print( loaded['a'] == tensor_a )
print( loaded['b'] == tensor_b )

# testing if ToTensor() screws things up
lb, ub = -1, 1
N, Din, Dout = 3, 1, 1
x = torch.distributions.Uniform(low=lb, high=ub).sample((N, Din))
print(x)

f = nn.Sequential(OrderedDict([
    ('f1', nn.Linear(Din,Dout)),
    ('out', nn.SELU())
]))
y = f(x)

transform = torchvision.transforms.transforms.ToTensor()
y_proc = transform(y)
print(y_proc)

reference for saving data in numpy:


# saving data in numpy

import numpy as np
import pickle
from pathlib import Path

path = Path('~/data/tmp/').expanduser()
path.mkdir(parents=True, exist_ok=True)

lb,ub = -1,1
num_samples = 5
x = np.random.uniform(low=lb,high=ub,size=(1,num_samples))
y = x**2 + x + 2

# using save (to npy), savez (to npz)
np.save(path/'x', x)
np.save(path/'y', y)
np.savez(path/'db', x=x, y=y)
with open(path/'db.pkl', 'wb') as db_file:
    pickle.dump(obj={'x':x, 'y':y}, file=db_file)

## using loading npy, npz files
x_loaded = np.load(path/'x.npy')
y_load = np.load(path/'y.npy')
db = np.load(path/'db.npz')
with open(path/'db.pkl', 'rb') as db_file:
    db_pkl = pickle.load(db_file)

print(x is x_loaded)
print(x == x_loaded)
print(x == db['x'])
print(x == db_pkl['x'])
print('done')

reference: https://stackoverflow.com/a/62883249/1601580