Quickly loading raw binary files

I need to load raw int16 binary files on-the-fly during training. In Lua Torch I could just do the following:

img = torch.ShortStorage(filename)

It appears that that functionality does not exist in PyTorch. So instead I am loading into a numpy array as follows:

img = np.fromfile(filename, 'int16')

The problem is that np.fromfile is extremely slow. For my large data files (>100MB), np.fromfile loads four orders of magnitude slower than the old torch.ShortStorage method. How can I get the fast load speeds in PyTorch?

We don’t support memory mapping files right now. I’ve opened an issue for that

Apparently numpy has (fast) memory mapping:

img = np.memmap(filename, dtype='int16', mode='r').__array__()

That gives me a numpy array of int16. I can’t use .numpy() to immediately convert that object into a torch tensor, because conversion from int16 is not supported. That’s okay in my case – I can do the preprocessing in numpy – but thought it was worth pointing out.

That’s weird, int16 should be equivalent to torch.ShortTensor. Not sure why it doesn’t work for you. Can you please print an error? Are you sure you’ve used the correct function? .numpy() is a torch method, you won’t find that in numpy. You should use torch.from_numpy.

I meant torch.from_numpy. The error is quite clear:

RuntimeError: can’t convert a given np.ndarray to a tensor - it has an invalid type. The only supported types are: double, float, int64, int32, and uint8.

Huh, we should add int16 to. I’ll open an issue.