I am trying to run a profiling script for pytorch on MS WSL 2.0 Ubuntu 20.04.
WSL is on the newest version (wsl --update). I am running the stable conda pytorch cuda 11.3 version from the pytorch website with pytorch 1.11. My GPU is a GTX 1650 Ti.
I ran my script and it finished without error. Then, I tried to profile it using pytorch’s bottleneck profiling tool python -m torch.utils.bottleneck run.py. If I run for a small number of epochs, the script finishes fine again. But when I do a longer run, I get the message Killed after the script runs “through” the autograd profiler. The command dmesg gives this output at the end:
[ 1224.321233] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=python,pid=295,uid=1000
[ 1224.321421] Out of memory: Killed process 295 (python) total-vm:55369308kB, anon-rss:15107852kB, file-rss:0kB, shmem-rss:353072kB, UID:1000 pgtables:39908kB oom_score_adj:0
[ 1224.746786] oom_reaper: reaped process 295 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:353936kB
So, when using the profiler, there seems to be a memory error. Is this related to the profiler somehow saving too much data in-mem? Then, it might be a common problem that occurs for too long runs?
I appreciate any help for this issue. It would be quite nice to see a longer profiling run, in order to determine the bottlenecks / expensive operations in my pytorch code.
I think your explanation makes sense and it seems the process uses ~15GB before it gets killed.
Unfortunately, I don’t know if there is a workaround other than to reduce the profiling time or to increase the RAM.
Thanks for the reply! Ok, so you would also suspect that this is a common error when making too long / large runs with the profiler? I’d be interested to know what’s the reason behind it? I surmise the profiler is saving a lot of intermittent data, which at some point simply exceeds memory. But why does it need to save the data, all it needs to store should be the stats after each loop, and, at the end, return their averages, or is it more complicated?
There seems to also be an issue with Nvidia CUPTI (CUDA Profiling Tools Interface). Each time, I run the script, I get a warning:
Running your script with the autograd profiler...
WARNING:2022-06-01 13:37:49 513:513 init.cpp:129] function status failed with error CUPTI_ERROR_NOT_INITIALIZED (15)
WARNING:2022-06-01 13:37:49 513:513 init.cpp:130] CUPTI initialization failed - CUDA profiler activities will be missing
This indicates that CUPTI is not running on WSL and therefore the profiling does not work. However, if I run on pure windows, it works (but much slower, due to the windows overhead in worker creation and other suspected difficulties with multiprocessing dataloading). Digging further into that, I found now that in the support matrix for WSL (CUDA on WSL :: CUDA Toolkit Documentation), it says that profiling tools are not yet supported. So, it seems there is no possibility to get it running based on CUPTI, for now.
Therefore, is there another way to get an accurate profiling result without using CUPTI? For instance, if I use the regular cProfiling profiler and use torch.cuda.synchronize in my loop, is it somehow possible to get the actual CUDA running times?
It should be more complicated as profilers sample the entire run and are able to create timelines from it, not only general statistics. E.g. using the built-in profiler (Kineto) as well as Nsight Systems would show the execution timelines, the kernel launches etc.
Yes, you could use this manual profiling approach by synchronizing the code manually before starting and stopping timers.