How to make use of all the data if out of memory

Hello! I have this code:

import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd
import torch 
import torch.nn as nn 
import torch.nn.functional as F 
import torch.optim as optim 

f = pd.read_hdf("data.h5")
dt = f.values
data = torch.tensor(dt[0:700000])
data_train_x = data.t()[0:2100].float().cuda()
data_train_y = data.t()[2100].float().cuda()

So I want to load the data to use it for training of a NN. When I run this part I am getting this error:

RuntimeError                              Traceback (most recent call last)
<ipython-input-5-cb8efd990293> in <module>
----> 1 data_train_x = data.t()[0:2100].float().cuda()
      2 data_train_y = data.t()[2100].float().cuda()

RuntimeError: [enforce fail at CPUAllocator.cpp:56] posix_memalign(&data, gAlignment, nbytes) == 0. 12 vs 0

I assume that somehow I am running out of memory. The training data is (for the x variable) 700,000 arrays of length 2100 each. For the y variable 700,000 arrays of length 1 each (1 or 0). Is there a way to get around this and use my whole data? Maybe somehow load the data during the training or something? I assume i can use less than 700,000 but I kinda use all the available data. Any advice? Thank you!

Would you try to create the tensor via torch.from_numpy(dt[0:700000])? Your code would most likely copy the data currently, which seems to cause this error. How much RAM does your machine have and how big is the data, i.e. how many features for each sample in which format?

Thank you for your reply. I am not totally sure I understand all the questions, but I will try my best. The code I posted is all I do so far, I am not passing the array through numpy, it goes from panda dataframe to pytorch tensor directly. I tried through numpy, too, but I am getting the same error. The machine has 30 GB of RAM. I tried the same code as above on another machine, where I can set the amount of memory to use and it fails at 30 GB there, too. If I increase it to 40 GB it seems to work, so it is clearly a memory problem. My main question is how can I get around this. I assume people training on large set (like Imagenet) would have the same problem, maybe a lot worse than me, so I thought there may be smart ways around this. As I said the data is floats, each input is a 1D tensor of length 2100 and the output is a 1 or 0. In total I have 700,000 such input and output pairs. Please let me know if I didn’t answer any of your question.

@smu226 To answer your question on training large datasets like Imagenet, such large datasets are usually loaded in batches(using in Pytorch). Images are usually saved in disks and loaded during training one by one instead of loading one huge numpy/hdf5 file. If you have the luxury of affording more RAM, then loading all at once into memory and using them during training is always good since it much faster(avoids disk read latency) compared to loading on the fly for every batch.

Since you say that RAM being the issue here, you can try using low precision datatype such as float16 which can save you some memory during tensor conversion.

The DataFrame should store the data as numpy arrays.
If you call torch.tensor(np.array(...)), you’ll create a copy of the data, so that’s why I suggested to use torch.from_numpy to just point to the numpy array and use the same memory.
Did it still yield the OOM issue?

Regarding the ImageNet question, @smu226 answered it nicely :wink:

Thank you for this! I guess this is what I was looking for. I am not sure how to load it efficiently during training. Could you please point me towards in example of doing this? Thank you!

Thank a lot! This works now. About the Imagenet (as I already asked in the other reply) could you please point me towards an example of efficiently loading the data during training?

Good to hear it’s working now!
The Data Loading Tutorial might be a good starter. :slight_smile: