Scaling of pytorch profiler with large number of nodes

Hey there,

I’m using the pytorch profiler (profiler.step()) function to analyze my code for different quite large Models on 1,2,4,8,16 nodes each 8 GPUs. I store the data for the traces in json files and use my own parser and additionally tensorboard to evaluate the files.

I got strange behaviours and exploding runtimes for 8 and especially 16 nodes. The profiler stores the trace for each GPU so if I have 64 GPUs I have 64 json files and average the times. I run the test multiple times.

Is it possible that the profiler scales not very well using a large number of GPUs? I doubt it to be honest, but I’m not sure and searching for explanations for this behaviour.

Would appreciate any kind of help!