Loading huge data functionality


(Xia Yandi) #1

Do you have any plan on implementing big data files loading functionality?

Suppose I have 300G data files for training, and I can’t load them all into memory.

For now, I am using TensorFlow, and they provide producer/consumer style data loading:
https://www.tensorflow.org/how_tos/reading_data/

With it, I don’t need to read all data set into memory at once, and I can load data in parallel fashion.

Any plan on similar functionality?

Thanks.


(Ng P Dat) #2

You can already do that with Pytorchnet.
Concretely, you pass a list of data files into tnt.ListDataset, then wrap it with torch.utils.DataLoader.
Example code:

def load_func(line):
    # a line in 'list.txt"

    # Implement how you load a single piece of data here

    # assuming you already load data into src and target respectively
    return {'src': src, 'target': target} # you can return a tuple or whatever you want it to

def batchify(batch):
    # batch will contain a list of {'src', 'target'}, or how you return it in load_func.

    # Implement method to batch the list above into Tensor here

    # assuming you already have two tensor containing batched Tensor for src and target
    return {'src': batch_src, 'target': batch_target} # you can return a tuple or whatever you want it to


dataset = ListDataset('list.txt', load_func) #list.txt contain list of datafiles, one per line
dataset = DataLoader(dataset=dataset, batch_size=50, num_workers=8, collate_fn=batchify) #This will load data when needed, in parallel, up to <num_workers> thread.

for x in dataset: #iterate dataset
    print(x)

There are surely other way to do it. Hope this helps.


(Adam Paszke) #3

Just define a Dataset object, that only loads a list of files in __init__, and loads them every time __getindex__ is called. Then, wrap it in a torch.utils.DataLoader with multiple workers, and you’ll have your files loaded lazily in parallel.

class MyDataset(torch.utils.Dataset):
    def __init__(self):
        self.data_files = os.listdir('data_dir')
        sort(self.data_files)

    def __getindex__(self, idx):
        return load_file(self.data_files[idx])

    def __len__(self):
        return len(self.data_files)


dset = MyDataset()
loader = torch.utils.DataLoader(dset, num_workers=8)

DataLoaders - Multiple files, and multiple rows per column with lazy evaluation
(Xia Yandi) #4

That is so cool! Thanks!


(Morpheus Hsieh) #5

Sorry, new to PyTorch. How might one adapt the above method to pytorch/examples/word_language_model/? Currently, it seems to load the entire data into cuda, which is causing OOM errors.


(Adam Paszke) #6

It would require rewriting the whole data loading part. It would need to go over all of it to gather all the tokens, and then lazily load only the batches you request. You’d need to add a proper torch.utils.Dataset subclass that does it.


(Morpheus Hsieh) #7

Thanks. I did something that may sound stupid.
I removed data.cuda() from batchify() and added
data = data.cuda()
target = target.cuda()
in get_batch().
It seems to be running now, but I wonder if this is a quick (and correct) fix.


(Adam Paszke) #8

Yeah, that’s correct too!


(Morpheus Hsieh) #9

That is a relief! However, I am noticing a huge and growing amount of memory usage now, after running for 40+ hours. Is there anything that I am missing? Perhaps every time when get_batch() is run, it is creating Variables and they are kept in memory after the batch is finished? Thanks.


(Adam Paszke) #10

Is that CPU memory or GPU memory? Everything should get freed from time to time. Are you just running the example with that single modification?


(Xia Yandi) #11

I wrote something follow your instruction. But it doesn’t work for me.

Here is what I do:

    def _load_hdf5_file(hdf5_file):
        f = h5py.File(hdf5_file, "r")
        data = []
        for key in f.keys():
            data.append(f[key])
        return tuple(data)


    class HDF5Dataset(Dataset):
        def __init__(self, data_files):
            self.data_files = sorted(data_files)

        def __getitem__(self, index):
            return _load_hdf5_file(self.data_files[index])

        def __len__(self):
            return len(self.data_files)

    train_set = HDF5Dataset(train_files) # there is only one file in train_files, i.e. train_files = ["foo_1"]
    train_loader = DataLoader(dataset=train_set,
                              batch_size=train_batch_size,
                              shuffle=True,
                              num_workers=2)

And during iteration, I got this error:

    Traceback (most recent call last):
      File "/usr/lib/python2.7/multiprocessing/util.py", line 274, in _run_finalizers
      File "/usr/lib/python2.7/multiprocessing/util.py", line 207, in __call__
      File "/usr/lib/python2.7/shutil.py", line 239, in rmtree
      File "/usr/lib/python2.7/shutil.py", line 237, in rmtree
    OSError: [Errno 24] Too many open files: '/tmp/pymp-Y6oJsO'
    Process Process-1:
    Traceback (most recent call last):
      File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
      File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
      File "/home/ts-yandixia01/.local/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 36, in _worker_loop
      File "/usr/lib/python2.7/multiprocessing/queues.py", line 392, in put
      File "/home/ts-yandixia01/.local/lib/python2.7/site-packages/torch/multiprocessing/queue.py", line 17, in send
      File "/usr/lib/python2.7/pickle.py", line 224, in dump
      File "/usr/lib/python2.7/pickle.py", line 286, in save
      File "/usr/lib/python2.7/pickle.py", line 554, in save_tuple
      File "/usr/lib/python2.7/pickle.py", line 286, in save
      File "/usr/lib/python2.7/pickle.py", line 606, in save_list
      File "/usr/lib/python2.7/pickle.py", line 639, in _batch_appends
      File "/usr/lib/python2.7/pickle.py", line 286, in save
      File "/usr/lib/python2.7/pickle.py", line 606, in save_list
      File "/usr/lib/python2.7/pickle.py", line 639, in _batch_appends
      File "/usr/lib/python2.7/pickle.py", line 286, in save
      File "/usr/lib/python2.7/pickle.py", line 606, in save_list
      File "/usr/lib/python2.7/pickle.py", line 639, in _batch_appends
      File "/usr/lib/python2.7/pickle.py", line 286, in save
      File "/usr/lib/python2.7/multiprocessing/forking.py", line 67, in dispatcher
      File "/usr/lib/python2.7/pickle.py", line 401, in save_reduce
      File "/usr/lib/python2.7/pickle.py", line 286, in save
      File "/usr/lib/python2.7/pickle.py", line 554, in save_tuple
      File "/usr/lib/python2.7/pickle.py", line 286, in save
      File "/usr/lib/python2.7/multiprocessing/forking.py", line 66, in dispatcher
      File "/home/ts-yandixia01/.local/lib/python2.7/site-packages/torch/multiprocessing/reductions.py", line 116, in reduce_storage
      File "/usr/lib/python2.7/multiprocessing/reduction.py", line 145, in reduce_handle
    OSError: [Errno 24] Too many open files

And the program never terminates.

Did I do anything wrong? Thanks

By the way, I don’t know if it is appropriate to ask here, how can I post python style code here like you did? Thanks.


(Adam Paszke) #12

It seems that your system allows only for a small number of open files. Can you try adding torch.multiprocessing.set_sharing_strategy('file_system') at the top of your script and try again?

Just append python after the three backticks to add syntax highlighting.


(Xia Yandi) #13

I added the line, and I got this error:

Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ts-yandixia01/.local/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 36, in _worker_loop
    data_queue.put((idx, samples))
  File "/usr/lib/python2.7/multiprocessing/queues.py", line 392, in put
    return send(obj)
  File "/home/ts-yandixia01/.local/lib/python2.7/site-packages/torch/multiprocessing/queue.py", line 17, in send
    ForkingPickler(buf, pickle.HIGHEST_PROTOCOL).dump(obj)
  File "/usr/lib/python2.7/pickle.py", line 224, in dump
    self.save(obj)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 554, in save_tuple
    save(element)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 606, in save_list
    self._batch_appends(iter(obj))
  File "/usr/lib/python2.7/pickle.py", line 639, in _batch_appends
    save(x)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 606, in save_list
    self._batch_appends(iter(obj))
  File "/usr/lib/python2.7/pickle.py", line 639, in _batch_appends
    save(x)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 606, in save_list
    self._batch_appends(iter(obj))
  File "/usr/lib/python2.7/pickle.py", line 639, in _batch_appends
    save(x)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 67, in dispatcher
    self.save_reduce(obj=obj, *rv)
  File "/usr/lib/python2.7/pickle.py", line 401, in save_reduce
    save(args)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 554, in save_tuple
    save(element)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/multiprocessing/forking.py", line 66, in dispatcher
    rv = reduce(obj)
  File "/home/ts-yandixia01/.local/lib/python2.7/site-packages/torch/multiprocessing/reductions.py", line 109, in reduce_storage
    metadata = storage._share_filename_()
RuntimeError: $ Torch: unable to mmap memory: you tried to mmap 0GB. at /home/soumith/local/builder/wheel/pytorch-src/torch/lib/TH/THAllocator.c:317

Thanks


(Morpheus Hsieh) #14

That was CPU memory that I noticed using htop. Besides the above modification, I used another corpus which is about 9 GB. Anyway, it finished without OOM so I guess it’s fine :stuck_out_tongue:.


(Adam Paszke) #15

That was probably an out of memory error. If the data or the code is public (or if you could just isolate the data loading into a separate script), I could run it myself and make sure it doesn’t leak. But we have some tests for that.


(Xia Yandi) #16

Hi,

I just wrote a simple demo code to reproduce the error. The code is here:

The data is randomly generated, but everything is the same as mine except the actual values.

You could run main.py to reproduce the error.

I don’t know if this will be useful, my system is:
Ubuntu 16.04, TitanX, cuda 8.0, pip installed pytorch

Thanks!


(Morpheus Hsieh) #17

Thanks so much. I literally just modified a very small part of the original example code main.py

--- a/word_language_model/main.py
+++ b/word_language_model/main.py
@@ -57,8 +57,9 @@ def batchify(data, bsz):
     nbatch = data.size(0) // bsz
     data = data.narrow(0, 0, nbatch * bsz)
     data = data.view(bsz, -1).t().contiguous()
-    if args.cuda:
-        data = data.cuda()
+
+    # if args.cuda:
+    #     data = data.cuda()
     return data
 
 eval_batch_size = 10
@@ -103,6 +104,9 @@ def get_batch(source, i, evaluation=False):
     seq_len = min(args.bptt, len(source) - 1 - i)
     data = Variable(source[i:i+seq_len], volatile=evaluation)
     target = Variable(source[i+1:i+1+seq_len].view(-1))
+    if args.cuda:
+        data = data.cuda()
+        target = target.cuda()
     return data, target

(Adam Paszke) #18

I’ve started a script running @Morpheus_Hsieh’s script, I’ll try the demo later. Thanks!


(itzjustricky) #19

for the code snippet that you provided.

class MyDataset(torch.utils.Dataset):
    def __init__(self):
        self.data_files = os.listdir('data_dir')
        sort(self.data_files)

    def __getindex__(self, idx):
        return load_file(self.data_files[idx])

    def __len__(self):
        return len(self.data_files)

Are the file paths stored in self.data_files suppose to represent each batch of data (or data per loop) returned by iterating loader?


#20

it is data per instance of the loop