Scaling of pytorch profiler with large number of nodes

HariSeldon11988 · November 19, 2024, 5:21pm

Hey there,

I’m using the pytorch profiler (profiler.step()) function to analyze my code for different quite large Models on 1,2,4,8,16 nodes each 8 GPUs. I store the data for the traces in json files and use my own parser and additionally tensorboard to evaluate the files.

I got strange behaviours and exploding runtimes for 8 and especially 16 nodes. The profiler stores the trace for each GPU so if I have 64 GPUs I have 64 json files and average the times. I run the test multiple times.

Is it possible that the profiler scales not very well using a large number of GPUs? I doubt it to be honest, but I’m not sure and searching for explanations for this behaviour.

Would appreciate any kind of help!