Training becomes slower gradually

amsword · June 20, 2020, 11:49pm

The training becomes slower. Anyone has any idea on the issues?

speed vs iteration

image912×582 23 KB
the grad graph size is consistent. The conclusion is drawn by calling the following function (input argument is loss.grad_fn) to see if there is more node added in the graph. The output here is always 11.

 def calc_num_node_in_grad_fn(grad_fn):
      result = 0
      if grad_fn is not None:
          result += 1
          if hasattr(grad_fn, 'next_functions'):
              for f in grad_fn.next_functions:
                  result += calc_num_node_in_grad_fn(f)
      return result

memory cost is not increased. nvidia-smi is called every 30 minutes. The memory size cost is always consistent.
pytorch version: 1.4.

Thanks very much

amsword · June 20, 2020, 11:51pm

no forward_hook, backward_hook are called. The network is fater-rcnn on top of the efficient-det’s backbone.

amsword · June 21, 2020, 12:09am

Checked the time cost for forward, backward, parameter update, data loading, and find the time cost of forwarding becomes large. The network is based on efficient-det’s backbone + faster-rcnn. SyncBN is used here. drop-connect is also applied by masking the output as all-0, which should have no dependency on the old iterations.

forward time

image787×521 20.6 KB
backward time

image786×512 26.4 KB
update time

image757×477 19.5 KB
data loading time

image776×477 21.3 KB

ptrblck · June 21, 2020, 6:28am

Is the forward or loss calculation object-dependent, i.e. is the computation increased for each new candidate?
I’m not familiar with your code, but could the model output more candidates, which would have to be filtered out during the forward pass? The backward, update_time, and data loading might be constant + noise.

amsword · June 21, 2020, 7:48am

Thanks for your reply. It is faster-rcnn and retinanet as well. The computations theoretically are consistent for different iterations.

ptrblck · June 21, 2020, 9:52pm

How did you profile the different parts of the code?
Could you post the code snippet used for the profiling, please?

amsword · June 22, 2020, 11:30pm

just to use time.time() before model(feature) and then use time.time() to capture the time elapsed. Then, print out.

ptrblck · June 23, 2020, 1:21am

If you are using the GPU, you would need to synchronize the code before starting and stopping the timer via torch.cuda.synchronize(), since CUDA operations are executed asynchronously.
If you don’t manually synchronize, the next blocking operation will accumulate the timings from the previous ops, so that your results might give you the wrong bottlenecks.