Training becomes slower gradually

The training becomes slower. Anyone has any idea on the issues?

  1. speed vs iteration

  2. the grad graph size is consistent. The conclusion is drawn by calling the following function (input argument is loss.grad_fn) to see if there is more node added in the graph. The output here is always 11.

 def calc_num_node_in_grad_fn(grad_fn):
      result = 0
      if grad_fn is not None:
          result += 1
          if hasattr(grad_fn, 'next_functions'):
              for f in grad_fn.next_functions:
                  result += calc_num_node_in_grad_fn(f)
      return result
  1. memory cost is not increased. nvidia-smi is called every 30 minutes. The memory size cost is always consistent.

  2. pytorch version: 1.4.

Thanks very much

no forward_hook, backward_hook are called. The network is fater-rcnn on top of the efficient-det’s backbone.

Checked the time cost for forward, backward, parameter update, data loading, and find the time cost of forwarding becomes large. The network is based on efficient-det’s backbone + faster-rcnn. SyncBN is used here. drop-connect is also applied by masking the output as all-0, which should have no dependency on the old iterations.

  1. forward time

  2. backward time

  3. update time

  4. data loading time

Is the forward or loss calculation object-dependent, i.e. is the computation increased for each new candidate?
I’m not familiar with your code, but could the model output more candidates, which would have to be filtered out during the forward pass? The backward, update_time, and data loading might be constant + noise.

Thanks for your reply. It is faster-rcnn and retinanet as well. The computations theoretically are consistent for different iterations.

How did you profile the different parts of the code?
Could you post the code snippet used for the profiling, please?

just to use time.time() before model(feature) and then use time.time() to capture the time elapsed. Then, print out.

If you are using the GPU, you would need to synchronize the code before starting and stopping the timer via torch.cuda.synchronize(), since CUDA operations are executed asynchronously.
If you don’t manually synchronize, the next blocking operation will accumulate the timings from the previous ops, so that your results might give you the wrong bottlenecks.

1 Like