How partially load big tensor from file?

kenenbek · June 19, 2020, 4:06pm

I can load a tensor from file like this:
X = torch.load(filename)
This tensor has a shape torch.Size([30000000]).

But I can’t load dataset fully due to memory limits.

How can I load only first 3000 numbers from file?
And then second portion of 3000 numbers without full load?

Scott_Hoang · June 19, 2020, 4:18pm

I guess an easy solution is to physically partition your file.
another options is to read the file as incoming byte streams, stop reading when you reach desired size, and resume reading once you need more data.

kenenbek · June 23, 2020, 3:22pm

Thank you, it’s a good idea.
How can I transform a byte stream to pytorch tensor?

Scott_Hoang · June 23, 2020, 5:27pm

the tricky part about reading your data as a byte stream is that there is no data structure.
let say x = (BxCxWxH), then your bytes stream has a size of x = (B * C *W * H). Then you must know the type of your data, i.e int, float, double …etc, and figure out of many bytes each type are to read correctly.
here is some code to get you started

f = open(<your file path>, 'rb') #  rb is readbytes.
# let assume your matrix is int_8. i.e 8 bits or 1 bytes per int
# loading one Batch
batch = np.frombuffer(f.read(1 * C*W*H), dtype=np.uint8)
batch = batch.reshape((C,W,H))
tensor_batch = torch.from_numpy(batch)
f.close() 
# Im writing this code on the fly. might have some bugs... but you should get the idea.

kenenbek · June 26, 2020, 7:12pm

I asked here

Scott_Hoang · June 26, 2020, 8:03pm

make sure that there is no meta data saved before the actual data.
also:
f = open(filepath, ‘rb’)

kenenbek · June 26, 2020, 8:23pm

You are right, I missed ‘rb’ flag.
Do you know what type numpy type correspond to LongTensor?

Scott_Hoang · June 26, 2020, 9:04pm

it should be np.int64

kenenbek · June 26, 2020, 9:25pm

# read data
x = np.arange(10, dtype=np.int64)
np.save("1.npy", x)

# open data
f = open("1.npy", "rb")
print(np.frombuffer(f.read(8), dtype=np.int64))
f.close()

print(np.load("1.npy"))

Gives some wrong number.

But np.load(“1.npy”) works as expected.

Scott_Hoang · June 26, 2020, 9:31pm

The reason why is because np.save stored meta data before the actual data.
I think this is a more elegant solution for your problem