Hey there,
I’m using the pytorch profiler (profiler.step()
) function to analyze my code for different quite large Models on 1,2,4,8,16 nodes each 8 GPUs. I store the data for the traces in json files and use my own parser and additionally tensorboard to evaluate the files.
I got strange behaviours and exploding runtimes for 8 and especially 16 nodes. The profiler stores the trace for each GPU so if I have 64 GPUs I have 64 json files and average the times. I run the test multiple times.
Is it possible that the profiler scales not very well using a large number of GPUs? I doubt it to be honest, but I’m not sure and searching for explanations for this behaviour.
Would appreciate any kind of help!