Loading huge data functionality

Yu_Yu · June 16, 2017, 9:18pm

Hi, Do you solve the “unable to mmap memory: you tried to mmap 0GB” problem? How do you solve it? I have the same problem now.

lan2720 · June 19, 2017, 11:04am

apaszke:

class MyDataset(torch.utils.Dataset):
def init(self):
self.data_files = os.listdir(‘data_dir’)
sort(self.data_files)
def __getindex__(self, idx):
    return load_file(self.data_files[idx])

def __len__(self):
    return len(self.data_files)
dset = MyDataset()
loader = torch.utils.DataLoader(dset, num_workers=8)

This way is good for images but isn’t fit for text. In NLP, data is usually in one file with multiple lines instead of one image in one file. So how to customize Dataset accordingly?

Zizhuo · June 19, 2017, 8:53pm

Meet the same problem here. Have you solved the problem yet?

will_soon · October 7, 2017, 12:29pm

I have the same problem here. Have you solved it?

QuantScientist · October 8, 2017, 5:34am

This has been reported here:

vidyasagarr7 · April 9, 2018, 9:21pm

Hi NgPDat,

Thanks for your response. I am running into a situation where, I am trying to load data from many csv files that contain parts of data. Each csv file contain say for eg, 2000 rows and I have 4000 such files. I am trying to load data into batches by iterating through the csv files. I implemented load_csv() function returning contents of the data, and wrote a custom Dataset. But for each iteration, when I specify batch size = 50, it loads 50 files in memory and returns 502000300 ( files * rows in each file * no. columns ). What changes should I make here so that Dataloader only returns 50 rows from a file rather than data from 50 csv files ?

arc144 · October 29, 2018, 12:23pm

By doing this, the data is loaded asynchronously while the GPU is training? Or is it sequentially, e.g. waits for GPU training to complete and then loads the next batch?

ptrblck · October 29, 2018, 12:35pm

If you use multiple workers in your DataLoader, each worker will load a batch in the background using multiprocessing while the GPU is busy.

traviskaufman · December 14, 2018, 1:57pm

Out of curiosity, have you tried using ConcatDataset? I noticed that seemed to be the answer in a similar question someone had.

fermat97 · May 8, 2019, 12:08pm

Hi, all my data is included in a multi-data.pt file, which includes all my training and validation input and labels. Do you have any idea how I can write a dataset class without loading the whole multi-data.pt file into memory?

visoft · May 8, 2019, 12:32pm

What .pt stands for? Do you have a reader that can open it “out of core”? That is, load in memory only certain part of the file (eg the one needed to be read). If that reader is a library make sure it is thread/process safe (Hint: I (and others) learned the hard way that HDF5 is not).

While HDF5 have the above features (out of core reading/writting) it does not work with multiprocessing.

I am using memory mapped files (from numpy) but I am using fast.ai and have a custom “loader” that works with fast.ai: https://forums.fast.ai/t/out-of-core-data-block-itemlist-backed-up-by-memmap-files/39566 Hope it helps.

fermat97 · May 8, 2019, 1:59pm

It is a pytorch file. It can symply be loaded by torch.load, but I dont want to load all into memory.

praateekmahajan · June 18, 2019, 5:48pm

For those trying to load text data efficiently, you could leverage linecache and subprocess.
This works for the case when we have one huge file, lets say 100+ GB, where every row is one training example.

class LazyTextDataset(Dataset):
    def __init__(self, filename):
        self._filename = filename
        self._total_data = 0
        self._total_data = int(subprocess.check_output("wc -l " + filename, shell=True).split()[0])

    def __getitem__(self, idx):
        line = linecache.getline(self._filename, idx + 1)
        csv_line = csv.reader([line])
        return next(csv_line)
      
    def __len__(self):
        return self._total_data

soulmachine · June 20, 2019, 10:56pm

Hi @lan2720, have you figured out a way to load a large text file? I have a huge CSV file of 5.9G, in which each line is a text with its label, after converting the whole file to embeddings it consume over 50G memory, which is not practical for me.

I can save each line to a file like images, but I doubt that this way is suitable for image data instead of text data.

Leonardo_Ma · June 28, 2019, 4:52am

correlated question
Hi I think I meet a problem correlated with what you wrote. Could you please take a look?

Suhaspsr · July 24, 2019, 1:05am

I am trying to load 2 Terabyte data of 1200 images, and each tensor image is of size 2 Gigs. I am using a typical Dataloader method to load data with 12 batch size for six GPUs. It is trying to load all the image on ram and then batch it on to GPUs. It is a waste of resources as GPUs are just waiting for all the images to load. Is there a better way to deal with such situations such that as soon as the batch size is loaded the GPUs can start to compute rather than waiting.

ptrblck · July 24, 2019, 6:43pm

You could try to lazily load each data sample in order to avoid preloading the whole dataset.
Using multiple workers might hide the loading time, so that your GPU won’t be starving.

Suhaspsr · July 27, 2019, 12:24am

Hi, I tried that yesterday, and it seems worse. It is taking almost 2 hours to complete one epoch with the mean computation time of 3 secs for each batch. What I found out is that with an increase in the number of workers to 24, it is taking more time to load than at 0.

ptrblck · July 27, 2019, 7:35pm

A very high number of workers might decrease the performance.
Do you see any speedup using less workers, e.g. 4?

Suhaspsr · July 30, 2019, 9:47pm

I tried with worker 4. Same time to load data. But loading large batch size for worker 4 gives os.fork() memory allocation error. Also how to work around not using input.cuda(). I get OOM error since 24*2 Gb is more than my GPU memory. How to split the data before passing the input.cuda() to model.