Using Pytorch profiler

mahmoodn · December 17, 2025, 2:44pm

Hi,

I have used the profiler on the top file like this:

    # Start sut before loadgen starts
    with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], 
                 record_shapes=True, 
                 profile_memory=True,
                 with_stack=True) as prof:
         with record_function("llama_inference_run"):
            sut.start()
            lgSUT = lg.ConstructSUT(sut.issue_queries, sut.flush_queries)
            log.info("Starting Benchmark run")
            lg.StartTestWithLogSettings(
                lgSUT,
                sut.qsl,
                settings,
                log_settings,
                args.audit_conf)
    prof.export_chrome_trace("llama3_trace3.json")
    # Stop sut after completion
    sut.stop()

Notice that I have specified profile_memory and with_stack for more in-depth analysis. I have opened the output file using perfetto.dev website, and the figure shows a CUDA kernel information.

Additionally, the Pythono process has three threads:

The call stack looks a bit weird to me. Still I don’t understand how that CUDA kernel is called. The file names, such as subprocess and threading are too high level. Or maybe I am misunderstanding the call stack.

Additionally, I don’t know where is the result of profile_memory in the figure and output file.

Any thoughts on that?