How can I profile the Backward pass of model?

I need to profile the backward pass of a model running on a GPU. I need to see how much time each layer’s gradient computation took along with achived TFLOPs during the operation. The problem is, If I use a profiler such as nsight systems then I cannot simply differentiate which kernel ran for which layer just because I cannot annotate the backward pass using nvtx. Is there some way in which the backward pass can be profiled.

1 Like

autograd.profiler should give your the runtime for the backward functions. If you spot a bottleneck, you could run nsight systems in isolation on this particular backward call.

Hi, I have a same issue for profiling backward pass of each layer. Can you give me some hints for solving this problem? Thanks for any code or suggestion.

If you just want to profile the backward layer and get the current runtime, this code snippet might be helpful:

def profile(module, input):
    # Warmup
    for _ in range(50):
        output = module(input)

    g0 = torch.rand_like(output)
    for _ in range(50):
        output = module(input)
        output.backward(g0)


    nb_iters = 100
    torch.cuda.synchronize()
    start = time.time()
    for _ in range(nb_iters):
        output = module(input)
    torch.cuda.synchronize()
    end = time.time()
    fwd_time = (end - start) / nb_iters

    # Profile backward pass
    torch.cuda.synchronize()
    start = time.time()
    for _ in range(nb_iters):
        output = module(input)
        module.weight.grad = None
        output.backward(g0)

    torch.cuda.synchronize()
    end = time.time()
    all_time = (end - start) / nb_iters
    bwd_time = all_time - fwd_time
2 Likes