the training speed of num_workers=2 is quite faster than that of num_workers=1, however, there is small difference for num_workers=2 and 4.
when num_workers=4, there are much CPU ram and GPU ram left, num_workers=8 doesn’t work. it is very weird.
num_workers assigns a few threads to load image to RAM(CPU). If batch size keeps 32, the RAM needed for CPU should not change. however, more RAM is used for num_workers = 4.
The following is the issue for num_workers=8:
Analyzing anchors... anchors/target = 5.86, Best Possible Recall (BPR) = 1.0000
Image sizes 640 train, 640 test
Using 4 dataloader workers
Logging results to runs\train\exp7
Starting training for 100 epochs...
Epoch gpu_mem box obj cls total targets img_size
0%| | 0/319 [00:00<?, ?it/s]Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
NumExpr defaulting to 8 threads.
0%| | 0/319 [00:02<?, ?it/s]
Traceback (most recent call last):
File "D:/code_python/har_hailiang/har_hdd/algo/train.py", line 288, in train
pred = model(imgs) # forward
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "D:\code_python\har_hailiang\har_hdd\algo\models\yolo.py", line 122, in forward
return self.forward_once(x, profile) # single-scale inference, train
File "D:\code_python\har_hailiang\har_hdd\algo\models\yolo.py", line 138, in forward_once
x = m(x) # run
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "D:\code_python\har_hailiang\har_hdd\algo\models\common.py", line 35, in forward
return self.act(self.bn(self.conv(x)))
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\conv.py", line 423, in forward
return self._conv_forward(input, self.weight)
File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\conv.py", line 419, in _conv_forward
return F.conv2d(input, weight, self.bias, self.stride,
RuntimeError: CUDA out of memory. Tried to allocate 3.85 GiB (GPU 0; 24.00 GiB total capacity; 1.47 GiB already allocated; 20.35 GiB free; 1.54 GiB reserved in total by PyTorch)
python-BaseException
Why is there CUDA out of memory issue when you increase the num_workers? There shouldn’t be OOM error when num_workers is increased as it’s not related to GPU at all.
Can you check whether you are changing batch size or model input when increasing the num_workers.
Hi,
May it be, you are doing some image transforms on gpu? As transforms are usually made by dataloader if you moving data to gpu somethere in your dataset code, increasing the number of workers will increase the number of dataloader replicas and so allocated gpu memory will be increased as well.
It is enough to have 2 workers to prepare the data so to fully utilize GPU.
It is not clear from your data. Actually, the amount of used memory (both CPU and GPU) increases then you increase number of workers. Also, first 2 columns have CPU RAM 16 and second 2 - 48. Are those two different machines?
Not exactly true. Workers prepare data in advance to push it into GPU as fast as possible so to utilize GPU the most. All those workers require memory to put processed data from disk to memory, so increasing number of workers will increase memory footprint.
For the first 2 columns, the RAM is 16G. when I changed num_worker=4, there is OOM issue. So, I add a new 32G RAM to my machine. So, those 4 columns are run in the same machine.
You explanation above is the only possibility that we could think of.
what is your machine configuration?
what is your test data? for example, what is the time needed for each iteration/batch?
I don’t work with this task. And always use just google colab. My explanation is just how I understand the topic from reading the documentation, other discussions and my own experience.
As for debugging your issue, I would suggest trying to check your data pipeline to understand why is it so memory hungry. It might be there is an opportunity to optimize it. Also, It seems like 2 workers are actually enough for the task, as you can see that overall training time doesn’t improve much with further increasing number of workers.
Thank you!
I am training yolov3 with input image size 640*640 to the network.
I agree with you.
with 2 workers, the CPU and GPU might be fully used. however, I would like to know more details about how GPU and CPU are used. I would also like to know where the bottleneck in hardware is.
for my case,
during inference, if I run a video for inference(forward), the time needed is around 20ms.
during training, the time needed for one batch is around 0.73s=730ms which includes the time for forward and backward. If the time for forward is 20ms, the time for backward is around 710ms. 20ms VS 710ms, this seems to be unreasonable for powerful 3090 GPU. This is my major concern about the data in my table.
I would like to know your data. I think the ratio of the time needed for forward and back on your machine should be similar as mine. Furthermore, the time for forward and backward should be comparable for yours and mine.
I would like to elaborate a little on this. First, I don’t think it is as simple as 730 ms - 20 ms = 710 ms for backward pass, unless those 730 ms you measure only GPU part of the work. Second, you almost always have two phases of the work:
You prepare data (reading, resizing, augmenting images goes here), this work usually done by CPU
You train a neural network (those are forward and backward pass), this work usually done by GPU
To measure time spent throughout those task, you would probably better measure “data” time and “model” time separately. These separate measurements will help you to understand there the most of the work is taking place actually.
Consider my example. I am training a regular convnet classifier for kaggle competition with the backbone of ResNet200d. My setup is Google Colab V100 16 GB GPU. I use image size 768x768. Training takes ~17 images per second, validation ~ 37 images per second. I am actually not measuring the “data” part separately, so I can not tell you exact amount of time spent on GPU.
And you still can compare those numbers only then GPU utilization is the same, and you don’t have data processing as a bottleneck.
A valid solution on how to measure the time used by GPU will be to try it on synthetic data. Create tensors in memory and measure GPU time.
I did another interesting experiment on my another computer:
CPU i9-9900K, ram 32G, GPU 2080Ti(11G),
I trained the same yolov3 code, but with coco datasets.
the batch size is 8, and there are totally 122218 images that are separated to 15278 batches. num_workers =8, it could be successfully trained. Furthermore, the training time for each batch is around 0.28s.
this experiment confirm my doubts:
the training speed on my 3090 could be faster. there might be some hardware or software bottleneck.
my 3090 computer has much more hardware resource than my 2080Ti(11G) computer. however, on 3090 computer, num_workers could be set to 4 but not 8. there must be something that I should explore.
on 2080Ti computer, the code could be correctly trained.
On 3090 computer, once the taining and validation data were loaded, the code doesn’t go forward. I click the “Pause Program” button on Pycharm, and found that the code stops in the following code:
C:\ProgramData\Anaconda3\Lib\multiprocessing\connection.py:
def _exhaustive_wait(handles, timeout):
# Return ALL handles which are currently signalled. (Only
# returning the first signalled might create starvation issues.)
L = list(handles)
ready = []
while L:
res = _winapi.WaitForMultipleObjects(L, False, timeout)
if res == WAIT_TIMEOUT:
break
elif WAIT_OBJECT_0 <= res < WAIT_OBJECT_0 + len(L):
res -= WAIT_OBJECT_0
elif WAIT_ABANDONED_0 <= res < WAIT_ABANDONED_0 + len(L):
res -= WAIT_ABANDONED_0
else:
raise RuntimeError('Should not get here')
ready.append(L[res])
L = L[res+1:]
timeout = 0
return ready