You can already do that with Pytorchnet.
Concretely, you pass a list of data files into tnt.ListDataset
, then wrap it with torch.utils.DataLoader
.
Example code:
def load_func(line):
# a line in 'list.txt"
# Implement how you load a single piece of data here
# assuming you already load data into src and target respectively
return {'src': src, 'target': target} # you can return a tuple or whatever you want it to
def batchify(batch):
# batch will contain a list of {'src', 'target'}, or how you return it in load_func.
# Implement method to batch the list above into Tensor here
# assuming you already have two tensor containing batched Tensor for src and target
return {'src': batch_src, 'target': batch_target} # you can return a tuple or whatever you want it to
dataset = ListDataset('list.txt', load_func) #list.txt contain list of datafiles, one per line
dataset = DataLoader(dataset=dataset, batch_size=50, num_workers=8, collate_fn=batchify) #This will load data when needed, in parallel, up to <num_workers> thread.
for x in dataset: #iterate dataset
print(x)
There are surely other way to do it. Hope this helps.