Hi, I am trying to use the Data Loader’s in-built num_workers to parallelize my batch processing but I can’t see any significant gains by using Data Loader’s inbuilt num_workers. I did a small toy example for that where I am just trying to see the effect of num_workers on the runtime of my code. Here is my Code

import numpy as np

import time

import torch
from import TensorDataset
from import DataLoader

num_workers = 0

T = 10
num_samples = 500000
num_features = 100
batch_size = 1024
num_epochs = 20

X = np.random.uniform(size=(num_samples,T,num_features))
y = np.random.uniform(size=(num_samples))

X = torch.Tensor(X)
y = torch.Tensor(y)

dataset = TensorDataset(X,y)

dataloader = DataLoader(dataset, batch_size=batch_size,shuffle=True,num_workers=num_workers)

tic = time.clock()
for epoch in range(num_epochs):
for batch_idx, (x,target) in enumerate(dataloader):
print("==> Epoch:",epoch)

toc = time.clock()
print(“Run Time:”,toc-tic)

Now I ran this code, with various configurations of num_workers,batch size and computed the run time:

Batch Size = 128, Num_workers = 0 , Run Time = 26.278781
Batch Size = 128, Num_workers = 2, Run Time = 44.073031
Batch Size = 128, Num_workers = 4, Run Time = 45.135034
Batch Size = 128, Num_workers = 128,Run Time = 102.223168

Batch Size = 256, Num_workers = 0 , Run Time = 28.837365
Batch Size = 256, Num_workers = 2, Run Time = 28.169192
Batch Size = 256, Num_workers = 4, Run Time = 29.175953
Batch Size = 256, Num_workers = 128,Run Time = 85.561222

Batch Size = 1024, Num_workers = 0 , Run Time = 35.239104
Batch Size = 1024, Num_workers = 2, Run Time = 14.877822
Batch Size = 1024, Num_workers = 4, Run Time = 17.189713
Batch Size = 1024, Num_workers = 128,Run Time = 73.567457

Can someone explain how num_workers effects the runtime? Ideally increasing the num_workers should decrease the data loading time. But why is the run time increasing in some cases and decreasing in other cases?

In my experience, worker threads depend on the number of cores you have in cpu. When you have more threads than your cpu can handle, there would be scheduling bottleneck. The process/cpu might spend more time to schedule the worker threads than doing any useful work.

1 Like

In the case where the batch size is 128, the performance drops when I use num_workers =2. How would you explain that? I am pretty sure my CPU has more than 2 Cores.
Also, why is the performance dependent on Batch Size, shouldn’t it be independent of the batch size?


I feel the time should be dependent on batch size as well as number of cores, speed of each core, io speed, ram, cache limit etc etc. Don’t you think so?

The number of cores limit the number of threads that can be run simultaneously. The number of images in batch tells us that the dataloader can return the data only when all the images in the batch are loaded.

I haven’t had a close look on a minute level to compare the effect of batch size Vs. Number of workers. But from a naive straight forward view, to me, the performance depends on both. Maybe, let’s see if any other experts/experienced user’s have a different view on this.