I know when you run the first batch, it’s possible to be very slow. However, I met the following situation:
Read data: 1.64876580238
iter 60021 (epoch 8), train_loss = 2.436, time/batch = 0.396
Read data: 0.0616140365601
iter 60022 (epoch 8), train_loss = 2.617, time/batch = 0.098
When I spend a long time reading data, the time of network is also higher. The time is counted after synchronized. My dataloader includes multi-threaded loading.
Anyone has any idea?
That’s a tough one. There are many reasons why this can happen. You can have a background process that’s sleeping most of the time, but wakes up and consumes a lot of resources for some time. It might be Python’s garbage collection. It can be some kind of cleanup in the OS. It can be because of the hardware. Does this happen to you very often?
It may be solved if I change my data loading strategy. (The readdata time is occasionally long because I’m periodically dumping tasks into the pool). But every time readdata is long, the network also becomes slow. So I’m curious what will cause the network to become slow. Could high IO slow down the network computation? I thought the network computation time is mostly determined by GPU?
It is, but you still need to have your CPU share, so it can queue the kernels for the GPU. If the Python process is suspended, then it can’t give the GPU any work to do.