How partially load big tensor from file?

I can load a tensor from file like this:
X = torch.load(filename)
This tensor has a shape torch.Size([30000000]).

But I can’t load dataset fully due to memory limits.

How can I load only first 3000 numbers from file?
And then second portion of 3000 numbers without full load?

I guess an easy solution is to physically partition your file.
another options is to read the file as incoming byte streams, stop reading when you reach desired size, and resume reading once you need more data.

Thank you, it’s a good idea.
How can I transform a byte stream to pytorch tensor?

the tricky part about reading your data as a byte stream is that there is no data structure.
let say x = (BxCxWxH), then your bytes stream has a size of x = (B * C *W * H). Then you must know the type of your data, i.e int, float, double …etc, and figure out of many bytes each type are to read correctly.
here is some code to get you started

f = open(<your file path>, 'rb') #  rb is readbytes.
# let assume your matrix is int_8. i.e 8 bits or 1 bytes per int
# loading one Batch
batch = np.frombuffer(f.read(1 * C*W*H), dtype=np.uint8)
batch = batch.reshape((C,W,H))
tensor_batch = torch.from_numpy(batch)
f.close() 
# Im writing this code on the fly. might have some bugs... but you should get the idea.
2 Likes

I asked here

make sure that there is no meta data saved before the actual data.
also:
f = open(filepath, ‘rb’)

1 Like

You are right, I missed ‘rb’ flag.
Do you know what type numpy type correspond to LongTensor?

it should be np.int64

# read data
x = np.arange(10, dtype=np.int64)
np.save("1.npy", x)

# open data
f = open("1.npy", "rb")
print(np.frombuffer(f.read(8), dtype=np.int64))
f.close()

print(np.load("1.npy"))

Gives some wrong number.

But np.load(“1.npy”) works as expected.

The reason why is because np.save stored meta data before the actual data.
I think this is a more elegant solution for your problem