Data loading time is nearly proportional to batchsize

There is one weird phenomenon that after I doubled my batchsize in a training process, the data loading time is almost doubled ( about 1.8 times long as original).
I measure the data loading time by measure the interval between the end of each iter and beginning of next iter. ( That is when the for loop to fetch next batch).
I simplified the logic as follows:

import ... 

end = time.time() 
For (input , label) in loader:
data_load_time = time.time() - end
output = model(input)
loss = crit(output, label) 
end = time.time() 

I conducted experiments on a single gpu to compare bch32 and bch64 settings:

gpu 1 batch 32
Time 2.505s (2.551s) Speed 12.8 samples/s Data 1.808s (1.849s)
gpu 1 bch 64 (cpu 5, worker 24)
Time 4.707s (4.897s) Speed 13.6 samples/s Data 3.453s (3.599s)

It shows that with larger batchsize, the dataLoading time is nearly doubled. (around 1.8).
With multiple-workers, the loading process should be conducted before hand and will be ready when it is needed.
So with enough workers, larger batch should not delay much comparing to a small batch setting as the loading is handled parallelly with multiple cpu thread.

This causes troubles. It means more gpu resource (either more gpus or larger G-mem) will not help to reduce the training time, as the data loading time is proportional to batchsize. Even you get enough GPU to handle more data at same time, the dataloading will delay everything up.

Multiple workers try to load the data in the background, while the model training is executed.
If the workers are not fast enough or bottlenecked by e.g. a local (spinning) HDD, they won’t be able to preload the next batch, which seems to be the case for your training routine.
Have a look at this post for more information.

It makes sense that IOPS could limit the reading speed as the data are lots of images.

Here is what I think:
Let’s assume the IOPS is a fix number N.
If the data reading workload is under the drive’s capability (let’s assume the corresponding batch size is n), the data loading time cost should be similar when the batch size is lower than n.
Or we can assume that the batch/time curve is linear(proportional) when batch>n and is flattened when batch is really small batch<=n.

So I did a test with varying batch size numbers:

gpu 1 batch 32
Time 2.505s (2.551s) Speed 12.8 samples/s Data 1.808s (1.849s)
gpu 1 bch 64 (cpu 5, worker 24)
Time 4.707s (4.897s) Speed 13.6 samples/s Data 3.453s (3.599s)
bch 16
Time 1.349s (1.351s) Speed 11.9 samples/s Data 0.900s (0.917s)
proportional to batchsize??

bch 8
Time 0.814s (0.832s) Speed 9.8 samples/s Data 0.471s (0.476s)

bch 4
Time 0.542s (0.590s) Speed 7.4 samples/s Data 0.242s (0.259s)

It is surprising that even the batch size is as small as 4, the reading time is still almost follows the linear curve trend instead of flattened.

Does it mean the reading is very slow and even 4 batch size is already large enough to make the data loading the bottle neck.

BTW, I am working on a cluster. The shared file system could cause this issue.

Might be the case and you could try to profile the data reading from your storage separately to get an idea of the max. throughput.
The shared file system sound interesting. Is it a network drive? If so, this would most likely limit the loading speed.

I have profiled the data reading with a python loop for loop.
For cleanness, only the loop section is presented.

	for i, db in enumerate(train_dataset.db):
		st = time.time()
		image_file = db['image']
		data_numpy = cv2.imread(
		smplRd_tm.update(time.time()- st)
		if i % 100 == 0 :       # print out the time
			print('Time cost {tm.val:.6f}s ({tm.avg:.6f})'.format(tm=smplRd_tm))

The images are supposed to be read continuously without any processing. So it can roughly measure how fast each image is read in from the file system. What I get is:

Time cost 0.009754s (0.022841)
Time cost 0.015513s (0.022823)
Time cost 0.034945s (0.022812)
Time cost 0.012610s (0.022782)
Time cost 0.017248s (0.022775)
Time cost 0.030351s (0.022780)
Time cost 0.030921s (0.022800)

So each image reading costs roughly 0.02 s.
I found that bottleneck comes from the debug image saving in the training loop.
I reduced frequency and it is much better now.