How to profiling ENTIRE pytorch code when GPUs are present?

I want to profile my entire training and eval pytorch code. I am using custom dataloaders (e.g. torchmeta library) and novel pytorch libraries (e.g. higher library) and I see very significant performance slow down from what other libraries reported (despite me using better GPUs e.g. I use v100 vs titan xp). They take 2.5 hours while mine is taking 16h or more.

Instead of sharing the code I want to profile the two ENTIRE scripts and pin point what is slowing things down when I compare the profilers output.

Unfortuantely, I see a lot of profilers and it hard to chose and what is worse is that most examples seem focused on profiling a specific model and not include the dataloader. For me the dataloader and the entire code is crucial.

These are the resources I’ve found:

python -m cProfile -s cumtime > profile.txt

which one is recommended for profiling the entire code so that it works even with the presence of GPU? is:

python -m cProfile -s cumtime > profile.txt  

the best way to do this

(btw profiling seems better than changing my code randomly until it speeds up)