Loading huge data functionality

Thanks so much. I literally just modified a very small part of the original example code main.py

--- a/word_language_model/main.py
+++ b/word_language_model/main.py
@@ -57,8 +57,9 @@ def batchify(data, bsz):
     nbatch = data.size(0) // bsz
     data = data.narrow(0, 0, nbatch * bsz)
     data = data.view(bsz, -1).t().contiguous()
-    if args.cuda:
-        data = data.cuda()
+
+    # if args.cuda:
+    #     data = data.cuda()
     return data
 
 eval_batch_size = 10
@@ -103,6 +104,9 @@ def get_batch(source, i, evaluation=False):
     seq_len = min(args.bptt, len(source) - 1 - i)
     data = Variable(source[i:i+seq_len], volatile=evaluation)
     target = Variable(source[i+1:i+1+seq_len].view(-1))
+    if args.cuda:
+        data = data.cuda()
+        target = target.cuda()
     return data, target
1 Like

I’ve started a script running @Morpheus_Hsieh’s script, I’ll try the demo later. Thanks!

1 Like

for the code snippet that you provided.

class MyDataset(torch.utils.Dataset):
    def __init__(self):
        self.data_files = os.listdir('data_dir')
        sort(self.data_files)

    def __getindex__(self, idx):
        return load_file(self.data_files[idx])

    def __len__(self):
        return len(self.data_files)

Are the file paths stored in self.data_files suppose to represent each batch of data (or data per loop) returned by iterating loader?

it is data per instance of the loop

Hi, Do you solve the “unable to mmap memory: you tried to mmap 0GB” problem? How do you solve it? I have the same problem now.

This way is good for images but isn’t fit for text. In NLP, data is usually in one file with multiple lines instead of one image in one file. So how to customize Dataset accordingly?

3 Likes

Meet the same problem here. Have you solved the problem yet?

I have the same problem here. Have you solved it?

This has been reported here:

Hi NgPDat,

Thanks for your response. I am running into a situation where, I am trying to load data from many csv files that contain parts of data. Each csv file contain say for eg, 2000 rows and I have 4000 such files. I am trying to load data into batches by iterating through the csv files. I implemented load_csv() function returning contents of the data, and wrote a custom Dataset. But for each iteration, when I specify batch size = 50, it loads 50 files in memory and returns 502000300 ( files * rows in each file * no. columns ). What changes should I make here so that Dataloader only returns 50 rows from a file rather than data from 50 csv files ?

By doing this, the data is loaded asynchronously while the GPU is training? Or is it sequentially, e.g. waits for GPU training to complete and then loads the next batch?

If you use multiple workers in your DataLoader, each worker will load a batch in the background using multiprocessing while the GPU is busy.

2 Likes

Out of curiosity, have you tried using ConcatDataset? I noticed that seemed to be the answer in a similar question someone had.

Hi, all my data is included in a multi-data.pt file, which includes all my training and validation input and labels. Do you have any idea how I can write a dataset class without loading the whole multi-data.pt file into memory?

What .pt stands for? Do you have a reader that can open it “out of core”? That is, load in memory only certain part of the file (eg the one needed to be read). If that reader is a library make sure it is thread/process safe (Hint: I (and others) learned the hard way that HDF5 is not).

While HDF5 have the above features (out of core reading/writting) it does not work with multiprocessing.

I am using memory mapped files (from numpy) but I am using fast.ai and have a custom “loader” that works with fast.ai: https://forums.fast.ai/t/out-of-core-data-block-itemlist-backed-up-by-memmap-files/39566 Hope it helps.

It is a pytorch file. It can symply be loaded by torch.load, but I dont want to load all into memory.

For those trying to load text data efficiently, you could leverage linecache and subprocess.
This works for the case when we have one huge file, lets say 100+ GB, where every row is one training example.

class LazyTextDataset(Dataset):
    def __init__(self, filename):
        self._filename = filename
        self._total_data = 0
        self._total_data = int(subprocess.check_output("wc -l " + filename, shell=True).split()[0])

    def __getitem__(self, idx):
        line = linecache.getline(self._filename, idx + 1)
        csv_line = csv.reader([line])
        return next(csv_line)
      
    def __len__(self):
        return self._total_data
6 Likes

Hi @lan2720, have you figured out a way to load a large text file? I have a huge CSV file of 5.9G, in which each line is a text with its label, after converting the whole file to embeddings it consume over 50G memory, which is not practical for me.

I can save each line to a file like images, but I doubt that this way is suitable for image data instead of text data.

correlated question
Hi I think I meet a problem correlated with what you wrote. Could you please take a look?

I am trying to load 2 Terabyte data of 1200 images, and each tensor image is of size 2 Gigs. I am using a typical Dataloader method to load data with 12 batch size for six GPUs. It is trying to load all the image on ram and then batch it on to GPUs. It is a waste of resources as GPUs are just waiting for all the images to load. Is there a better way to deal with such situations such that as soon as the batch size is loaded the GPUs can start to compute rather than waiting.