How to set dataloader para for complicated data augmentation?

I have to use a complicated data augmentation inside the dataloader (like https://github.com/yjxiong/tsn-pytorch/blob/master/dataset.py ).

trn_loader = DataLoader(train_data, batch_size=128, shuffle=True, pin_memory=True, num_workers=4)

As shown below, the time for cpu preparing data is much longer (20 times) than gpu training. The problem is the cpu usage is low, keeping between 1% and 5%

How should I change the parameters to keep cpus running at the high usage to save more time?

Epoch: [1][36/170] TotalTime(s) 6.600 DataTime(s) 6.450
Epoch: [1][37/170] TotalTime(s) 0.162 DataTime(s) 0.000
Epoch: [1][38/170] TotalTime(s) 0.912 DataTime(s) 0.823
Epoch: [1][39/170] TotalTime(s) 0.101 DataTime(s) 0.000
Epoch: [1][40/170] TotalTime(s) 6.662 DataTime(s) 6.577
Epoch: [1][41/170] TotalTime(s) 0.106 DataTime(s) 0.000
Epoch: [1][42/170] TotalTime(s) 1.084 DataTime(s) 0.991
Epoch: [1][43/170] TotalTime(s) 0.106 DataTime(s) 0.000
Epoch: [1][44/170] TotalTime(s) 6.280 DataTime(s) 6.197
Epoch: [1][45/170] TotalTime(s) 0.104 DataTime(s) 0.000
Epoch: [1][46/170] TotalTime(s) 0.159 DataTime(s) 0.072
Epoch: [1][47/170] TotalTime(s) 3.550 DataTime(s) 3.454
Epoch: [1][48/170] TotalTime(s) 9.151 DataTime(s) 9.057

GPU: p100
CPU: 8cores
RAM: 64GB
Data: 9GB
Pytorch: 0.4.1

PS: If setting worker=0, at the first 50 iterations, the cpu usage is 300%. After that the cpu usage oscillates between 0 and 200% slowly.

If your data augmentation relies heavily on PIL, you could try to install PIL-SIMD as a drop-in replacement, which should speed up your code.
Also, could you profile your code to see, where the bottleneck if your data loading is?
If you are IO bound, you could move the data onto a SSD if that’s not already the case.

Thank you for your replay. I have installed PIL-SIMD.

In my case, it is not PIL nor io. I am sorry not to mention this at first. The input is 9-channel data. I need to build another 10 more channels from these 9 original. Then send 19-channel ndarray to Transform(only totensor and norm).

There are two reasons that I do not produce those new channels to the disk before training:
1)It is a prototype. I am testing different new channels. Sometimes I would even change the number of new channels in the next hour.
2)I want to build a end-to-end model.

15146 dpnt 20 0 19.186g 3.044g 1.237g S 57.2 1.2 17:49.63 python
9050 dpnt 20 0 19.186g 1.887g 83936 R 6.6 0.8 0:00.20 python
9051 dpnt 20 0 19.186g 1.887g 83936 R 6.2 0.8 0:00.19 python
9052 dpnt 20 0 19.186g 1.887g 83936 R 5.9 0.8 0:00.18 python
9053 dpnt 20 0 19.186g 1.887g 83936 R 5.9 0.8 0:00.18 python

Above is the top info while training. As you can see, “workers” only use 6% of cpu. Is there a way to force workers to run faster like 50% or 100%?