Tools to visualize/debug parallelism?

Are there any tools that can help to visualize and debug parallelism problems in my models?

For example:

  • I would love to see a chart showing when asynchronous kernels actually started and finished execution whether there are any data dependencies between them, whether kernel launch was delayed because of memory pressure, and so on.
  • I would like to make sure that my code is not accidentally causing CPU/GPU synchronization by converting result of GPU computation into a plain Python/NumPy type (or by performing any other operation that causes synchronization).

See Automatic differentiation package - torch.autograd — PyTorch master documentation, chrome trace shows cpu/gpu timeline, emit_nvtx - cuda low-level timeline.

Thanks, I’ve exported chrome trace from profiler, but I’m not sure how to interpret it. What are those “cpu_to_cuda” event lines? What are “Outgoing flow” and “Preceding/Following events”, which appear in the lower panel when I click something?

Also, this does not seem to help with my second problem (unintentional synchronization).

I don’t know how exactly it links events, cpu “commands” are linked to cuda kernel launches, and dependencies are somehow reflected.

Hm, you would normally see gaps in GPU timeline, if you have some bad cross-device interactions.