Loading huge data functionality

NgPDat · February 5, 2017, 7:22am

You can already do that with Pytorchnet.
Concretely, you pass a list of data files into tnt.ListDataset, then wrap it with torch.utils.DataLoader.
Example code:

def load_func(line):
    # a line in 'list.txt"

    # Implement how you load a single piece of data here

    # assuming you already load data into src and target respectively
    return {'src': src, 'target': target} # you can return a tuple or whatever you want it to

def batchify(batch):
    # batch will contain a list of {'src', 'target'}, or how you return it in load_func.

    # Implement method to batch the list above into Tensor here

    # assuming you already have two tensor containing batched Tensor for src and target
    return {'src': batch_src, 'target': batch_target} # you can return a tuple or whatever you want it to


dataset = ListDataset('list.txt', load_func) #list.txt contain list of datafiles, one per line
dataset = DataLoader(dataset=dataset, batch_size=50, num_workers=8, collate_fn=batchify) #This will load data when needed, in parallel, up to <num_workers> thread.

for x in dataset: #iterate dataset
    print(x)

There are surely other way to do it. Hope this helps.