I’m working with a dataset where each sample is 120x1024 dimensions. I’m hoping to use a large batch size 4096 (each batch is 4096x120x1024) but am experiencing very slow dataloading even with num_workers=20.
Below is an example of my code. I’m getting 4 iterations/second on my machine. This is a bit too slow to finish training in a reasonable amount of time.
I’ve narrowed the bottleneck down to the
torch.stack() call in
collate. This seems to be 10x slower than everything else. I believe what is happening is the data is being copied to a contiguous block, hence the slowdown?
Is there any way to speed up dataloading in my case?
def __getitem__(self, index):
return torch.ones([120, 1024])
train_dataset = MyDataset()
data = torch.stack([b for b in batch])
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(
train_dataset, sampler=train_sampler, batch_size=4096,
num_workers=20, pin_memory=True, collate_fn=collate
Note that too many workers might slow down your system, so you should test different values for your current setup.
That being said, to avoid the
torch.stack call, you could use a
BatchSampler to pass a batch of indices to your
Dataset.__getitem__, preallocate the final tensor via
torch.empty(batch_size, 120, 1024) and copy the tensors into it.
Thanks @ptrblck! The idea of copying the data into the preallocated tensor helps (it’s now 2x faster) but it seems to get slowed down because I need to do a for-loop copy.
for i, b in eumerate(batch):
final_tensor[i] = b
I think if I try to construct a (batch_size, 120, 1024) tensor at any point it will be slow so it seems like I actually need to loop over individual small tensors (1, 120, 1024)? I’m not sure how to use BatchSampler here such that it would help.
I think the approach should be right, since you would need the loop at one point anyway to load and process each image, no?
In the standard approach, the
DataLoader will load each image one by one and use the
collate_fn to create the batch. Now you could push this loop into the
__getitem__ to load each sample in the loop and copy the data into the preallocated tensor.
Let me know, if I misunderstood the use case.
If we push the for-loop into the
__getitem__ function, does that mean it’s 1 worker doing the loop? Would that be faster than just doing the for-loop in collate?
I was wondering if it is possible to assign a row index of the preallocated tensor to each worker and have them copy the data into that index of the tensor in
Yes, but also in the default setup, where you are using a single index, a single worker will create the batch, so there shouldn’t be a difference I assume (each worker creates its own batch).
Multiple workers do not create the same batch, but each worker will build its own batch.
There is a feature request letting multiple workers work on the same batch, but I don’t think it’s ready yet.