Num_workers questions

Ardeal · March 9, 2021, 1:18am

I tried the following:

items	num_workers=1	num_workers = 2	num_workers = 4	num_workers = 8
CPU	10700K	10700K	10700K	10700K
CPU RAM(G)	16	16	48	48
CPU RAM used(G)	9.5	9.5	12	CUDA out of memory issue
GPU	3090	3090	3090	3090
GPU RAM(G)	24	24	24	24
GPU RAM used(G)	17.5	17.5	17.9	CUDA out of memory issue
GPU Power(w)	257	357	357	CUDA out of memory issue
image count	10180	10180	10180	10180
epoch	100	100	100	100
batch size	32	32	32	32
batch count	319	319	319	319
total time for training(hours)	15	6.459	6.05
time for each epoch(minutes)	9	3.875	3.6333
average time for batch(second)	1.69	0.73	0.617

I have a few questions:



the training speed of num_workers=2 is quite faster than that of num_workers=1, however, there is small difference for num_workers=2 and 4.
when num_workers=4, there are much CPU ram and GPU ram left, num_workers=8 doesn’t work. it is very weird.
num_workers assigns a few threads to load image to RAM(CPU). If batch size keeps 32, the RAM needed for CPU should not change. however, more RAM is used for num_workers = 4.

The following is the issue for num_workers=8:

Analyzing anchors... anchors/target = 5.86, Best Possible Recall (BPR) = 1.0000
Image sizes 640 train, 640 test
Using 4 dataloader workers
Logging results to runs\train\exp7
Starting training for 100 epochs...
     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
  0%|          | 0/319 [00:00<?, ?it/s]Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
NumExpr defaulting to 8 threads.
  0%|          | 0/319 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "D:/code_python/har_hailiang/har_hdd/algo/train.py", line 288, in train
    pred = model(imgs)  # forward
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "D:\code_python\har_hailiang\har_hdd\algo\models\yolo.py", line 122, in forward
    return self.forward_once(x, profile)  # single-scale inference, train
  File "D:\code_python\har_hailiang\har_hdd\algo\models\yolo.py", line 138, in forward_once
    x = m(x)  # run
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "D:\code_python\har_hailiang\har_hdd\algo\models\common.py", line 35, in forward
    return self.act(self.bn(self.conv(x)))
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\conv.py", line 423, in forward
    return self._conv_forward(input, self.weight)
  File "C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\conv.py", line 419, in _conv_forward
    return F.conv2d(input, weight, self.bias, self.stride,
RuntimeError: CUDA out of memory. Tried to allocate 3.85 GiB (GPU 0; 24.00 GiB total capacity; 1.47 GiB already allocated; 20.35 GiB free; 1.54 GiB reserved in total by PyTorch)
python-BaseException

surya00060 · March 9, 2021, 5:14am

Why is there CUDA out of memory issue when you increase the num_workers? There shouldn’t be OOM error when num_workers is increased as it’s not related to GPU at all.

Can you check whether you are changing batch size or model input when increasing the num_workers.

Ardeal · March 9, 2021, 6:21am

@surya00060 ,
Thank you!
I have the same doubts as yours. however, I tried many times and encountered the same issue I pasted above.

I confirm: I have tried many times with only num_workers changed without any other code changed.

By the way, I was doing above training on Windows 10 but not Linux.
I estimate the above issue could be easily reproduced.

Alexey_Demyanchuk · March 9, 2021, 9:44am

Hi,
May it be, you are doing some image transforms on gpu? As transforms are usually made by dataloader if you moving data to gpu somethere in your dataset code, increasing the number of workers will increase the number of dataloader replicas and so allocated gpu memory will be increased as well.

Alexey_Demyanchuk · March 9, 2021, 9:56am

It is enough to have 2 workers to prepare the data so to fully utilize GPU.

It is not clear from your data. Actually, the amount of used memory (both CPU and GPU) increases then you increase number of workers. Also, first 2 columns have CPU RAM 16 and second 2 - 48. Are those two different machines?

Not exactly true. Workers prepare data in advance to push it into GPU as fast as possible so to utilize GPU the most. All those workers require memory to put processed data from disk to memory, so increasing number of workers will increase memory footprint.

Ardeal · March 9, 2021, 10:02am

@Alexey_Demyanchuk ,
Thank you!

For the first 2 columns, the RAM is 16G. when I changed num_worker=4, there is OOM issue. So, I add a new 32G RAM to my machine. So, those 4 columns are run in the same machine.

You explanation above is the only possibility that we could think of.

what is your machine configuration?
what is your test data? for example, what is the time needed for each iteration/batch?

Alexey_Demyanchuk · March 9, 2021, 10:10am

I don’t work with this task. And always use just google colab. My explanation is just how I understand the topic from reading the documentation, other discussions and my own experience.

As for debugging your issue, I would suggest trying to check your data pipeline to understand why is it so memory hungry. It might be there is an opportunity to optimize it. Also, It seems like 2 workers are actually enough for the task, as you can see that overall training time doesn’t improve much with further increasing number of workers.

Ardeal · March 10, 2021, 1:28am

@Alexey_Demyanchuk ,

Thank you!
I am training yolov3 with input image size 640*640 to the network.

I agree with you.

with 2 workers, the CPU and GPU might be fully used. however, I would like to know more details about how GPU and CPU are used. I would also like to know where the bottleneck in hardware is.

for my case,
during inference, if I run a video for inference(forward), the time needed is around 20ms.
during training, the time needed for one batch is around 0.73s=730ms which includes the time for forward and backward. If the time for forward is 20ms, the time for backward is around 710ms. 20ms VS 710ms, this seems to be unreasonable for powerful 3090 GPU. This is my major concern about the data in my table.

I would like to know your data. I think the ratio of the time needed for forward and back on your machine should be similar as mine. Furthermore, the time for forward and backward should be comparable for yours and mine.

Alexey_Demyanchuk · March 10, 2021, 10:39am

I would like to elaborate a little on this. First, I don’t think it is as simple as 730 ms - 20 ms = 710 ms for backward pass, unless those 730 ms you measure only GPU part of the work. Second, you almost always have two phases of the work:

You prepare data (reading, resizing, augmenting images goes here), this work usually done by CPU
You train a neural network (those are forward and backward pass), this work usually done by GPU

To measure time spent throughout those task, you would probably better measure “data” time and “model” time separately. These separate measurements will help you to understand there the most of the work is taking place actually.

Consider my example. I am training a regular convnet classifier for kaggle competition with the backbone of ResNet200d. My setup is Google Colab V100 16 GB GPU. I use image size 768x768. Training takes ~17 images per second, validation ~ 37 images per second. I am actually not measuring the “data” part separately, so I can not tell you exact amount of time spent on GPU.

And you still can compare those numbers only then GPU utilization is the same, and you don’t have data processing as a bottleneck.

A valid solution on how to measure the time used by GPU will be to try it on synthetic data. Create tensors in memory and measure GPU time.

Ardeal · March 11, 2021, 1:21am

@Alexey_Demyanchuk ，
Thank you!

I did another interesting experiment on my another computer:
CPU i9-9900K, ram 32G, GPU 2080Ti(11G),
I trained the same yolov3 code, but with coco datasets.
the batch size is 8, and there are totally 122218 images that are separated to 15278 batches. num_workers =8, it could be successfully trained. Furthermore, the training time for each batch is around 0.28s.


this experiment confirm my doubts:  

the training speed on my 3090 could be faster. there might be some hardware or software bottleneck.
my 3090 computer has much more hardware resource than my 2080Ti(11G) computer. however, on 3090 computer, num_workers could be set to 4 but not 8. there must be something that I should explore.

Alexey_Demyanchuk · March 12, 2021, 9:13am

That would be more reliable to compare those two different machines on the same task.

Ardeal · March 17, 2021, 1:48pm

@Alexey_Demyanchuk ,
Thank you for your reply!

I did the following test:

	num_workers=8	num_workers = 8
CPU	i9-9900K(8 cores 16 threads)	i7-10700K(8scores 16 threads)
GPU	2080Ti(11GB)	3090(24GB)
CPU RAM used	9.5G	unknown
GPU RAM used	17.5G	unknown
epoch	300	300
datasets	COCO	COCO
image count	122218	122218
batch size	8	8
code	yolov3 the same	yolov3 the same

on 2080Ti computer, the code could be correctly trained.
On 3090 computer, once the taining and validation data were loaded, the code doesn’t go forward. I click the “Pause Program” button on Pycharm, and found that the code stops in the following code:

C:\ProgramData\Anaconda3\Lib\multiprocessing\connection.py:
    def _exhaustive_wait(handles, timeout):
        # Return ALL handles which are currently signalled.  (Only
        # returning the first signalled might create starvation issues.)
        L = list(handles)
        ready = []
        while L:
            res = _winapi.WaitForMultipleObjects(L, False, timeout)
            if res == WAIT_TIMEOUT:
                break
            elif WAIT_OBJECT_0 <= res < WAIT_OBJECT_0 + len(L):
                res -= WAIT_OBJECT_0
            elif WAIT_ABANDONED_0 <= res < WAIT_ABANDONED_0 + len(L):
                res -= WAIT_ABANDONED_0
            else:
                raise RuntimeError('Should not get here')
            ready.append(L[res])
            L = L[res+1:]
            timeout = 0
        return ready

It seems that the code is waiting for thread?

Alexey_Demyanchuk · March 22, 2021, 8:00am

Hi,

I am a bit lost. Because in the beginning of the thread your code worked on 3090 machine.
Hard to find the problem, sorry.

Ardeal · March 22, 2021, 8:17am

@Alexey_Demyanchuk ,

In the beginning, I did work on 3090 machine. however, I couldn’t find out the resolution of the issue. I then tried it on 2080Ti machine.

After testing the code on 2080Ti, There is no issue on 2080Ti. the issue still exists on 3090 machine. It is the issue as the beginning.

Alexey_Demyanchuk · March 22, 2021, 12:54pm

Is it the newest version of Pytorch you are using on the 3090 machine? If not, may be consider to update?