Program block when dataloader num_worker > 0

I use the newest version of yoloV5 to training the coco image, the program successful train when num_worker = 0, if the num_worker = 0, the program will block and spend a lot of time to acquire data. Also, if i only train in one gpu, there has no problem if num_worker > 0.

I train in one machine with four titan xp GPUS and the pytorch == 1.7.1, cuda == 10.1, python == 3.8.5, anaconda virtual environment.

I have the same problem not only in yoloV5 project, sometimes it also happened in other project. The block position shows in figure blow:

Dose anyone have the same problem ? Thanks a lot.

cc @VitalyFedyunin re: DataLoader