I realize that to some extent this comes down to experimentation, but are there any general guidelines on how to choose the num_workers for a DataLoader object? Should num_workers be equal to the batch size? Or the number of CPU cores in my machine? Or to the number of GPUs in my data-parallelized model? Is there a tradeoff with using more workers due to overhead? Also, is there ever a reason to leave num_workers as 0 instead of setting it at least to 1?
Having more workers will increase the memory usage and that’s the most serious overhead. I’d just experiment and launch approximately as many as are needed to saturate the training. It depends on the batch size, but I wouldn’t set it to the same number - each worker loads a single batch and returns it only once it’s ready.
num_workers equal 0 means that it’s the main process that will do the data loading when needed, num_workers equal 1 is the same as any n, but you’ll only have a single worker, so it might be slow
I use multi subprocesses to load data(num_workers =8) and with the increase of epoch,I notice that the (RAM, but not GPU) memory increases.
I thought may be I can kill subprocesses after a few of epochs and then reset new subprocesses to continue train the network,but I don’t know how to kill the subprocesses in the main processes.
And I set num_workers = 0,the (RAM, but not GPU) memory remains stable with the increase of epoch.
Can you give me some suggestions or instructions about the problem?
Thank you so much.
Are you sure that memory usage is the most serious overhead ? What about IO usage ?
Setting too many workers might cause seriously high IO usage which can become very uneffective.
I would love to get your advice about the recommended way to deal with my data - I feed my CNN with large batches (256/512/1024…) of small patches of size 50x50. I intend to use the ImageFolder DataLoader for that, but I’m afraid that it would be very uneffective to load from disk a lot of small images in high frequency.
I experimented with this a bit. I found that we should use the formula:
num_worker = 4 * num_GPU .
Though a factor of 2 and 8 also work good but lower factor (<2) significantly reduces overall performance. Here, worker has no impact on GPU memory allocation. Also, nowadays there are many CPU cores in a machine with few GPUs (<8), so the above formula is practical.
Is it right to estimate this from data throughput?
data load by CPU per batch == data process by GPU per batch
entry_KB * batch_size * num_worker = num_GPU * GPU_throughput
if the data set is small like cifar10, why doesn’t the whole data set stay in the GPU the whole time? Why would # workers do anything?
The more data you put into the GPU memory, the less memory is available for the model.
If your model and data is small, it shouldn’t be a problem. Otherwise I would rather use the
DataLoader to load and push the samples onto the GPU than to make my model smaller.
For example, if one worker loads a single batch expends 1.5s and one iteration in GPU expends 0.5s. Are there 3 workers optimal in your opinion?
Are you saying that if the data and model are both small the dataloader class isn’t the right thing to use?
I don’t think its ever possible to tell if its optimal…just try things and once it stops improving just use that.
If your dataset is really small and you don’t need batching, you can just push the data onto the GPU and simply apply your training procedure.
However, since I like the concept of a
DataLoder, I would still use a
DataLoader in such a use case just to be able to easily extend the dataset and use batching, shuffling etc.
you could check how many cpus and cores u have with
lscpu if u want an initial guess without doing benchmarking…
it could be known that:
number_worker is the subprocess count. If memory_pin not true, it only increase the CPU DDR memory rather the GPU memory. If memory_pin is true, the GPU memory would increase also.
Correct me if you have a different opinion.
I’m not sure about the increase in GPU memory.
As I understand, pinned memory is used as a staging area on the host side (CPU).
pin_memory=True, the data will be directly copied to the pinned memory and from there to the GPU.
pin_memory=False, the data will be allocated in pageable memory, transferred to the pinned memory, and then to the GPU.
See the NVIDIA devblog on pinned memory.
Just wanted to mention something I noticed;
setting num_workers=1 gave me a “cuda runtime error (2) out of memory” exception, and increasing it helped.
Also for unknown reason i notic increasing the num_workers give me nan in my loss
Is there any one has met this situation that setting num_workers = 4 could make the train stop? Recently, I tested a RFBnet project, and find when I set num_workers= 4 will stop training at epoch = 2. However, num_workers=0 will be fine.
Not sure what is the reason but I am quite often getting
MemoryError exception when using
num_workers != 0. Could somebody describe how this process usually works? Does it copy dataset instance (including all its properties) into subprocess? Or does it use threads?
It seems that during the training process the amount of free RAM continues to reduce. I am using a custom dataset that generates images from strokes (Quick Draw Doodles data), and probably the problem is that the dataset doesn’t work well in multitasking setting. Could somebody give an advice on how to implement a multithread ready dataset?
What’s num_GPU? How to get it on google colab?