I need to profile the backward pass of a model running on a GPU. I need to see how much time each layer’s gradient computation took along with achived TFLOPs during the operation. The problem is, If I use a profiler such as nsight systems then I cannot simply differentiate which kernel ran for which layer just because I cannot annotate the backward pass using nvtx. Is there some way in which the backward pass can be profiled.
autograd.profiler should give your the runtime for the backward functions. If you spot a bottleneck, you could run nsight systems in isolation on this particular
Hi, I have a same issue for profiling backward pass of each layer. Can you give me some hints for solving this problem? Thanks for any code or suggestion.
If you just want to profile the backward layer and get the current runtime, this code snippet might be helpful:
def profile(module, input): # Warmup for _ in range(50): output = module(input) g0 = torch.rand_like(output) for _ in range(50): output = module(input) output.backward(g0) nb_iters = 100 torch.cuda.synchronize() start = time.time() for _ in range(nb_iters): output = module(input) torch.cuda.synchronize() end = time.time() fwd_time = (end - start) / nb_iters # Profile backward pass torch.cuda.synchronize() start = time.time() for _ in range(nb_iters): output = module(input) module.weight.grad = None output.backward(g0) torch.cuda.synchronize() end = time.time() all_time = (end - start) / nb_iters bwd_time = all_time - fwd_time