Pytorch profiling script killed by ubuntu on microsoft WSL

jayz · June 1, 2022, 4:58pm

I am trying to run a profiling script for pytorch on MS WSL 2.0 Ubuntu 20.04.
WSL is on the newest version (wsl --update). I am running the stable conda pytorch cuda 11.3 version from the pytorch website with pytorch 1.11. My GPU is a GTX 1650 Ti.

I ran my script and it finished without error. Then, I tried to profile it using pytorch’s bottleneck profiling tool python -m torch.utils.bottleneck run.py. If I run for a small number of epochs, the script finishes fine again. But when I do a longer run, I get the message Killed after the script runs “through” the autograd profiler. The command dmesg gives this output at the end:

[ 1224.321233] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=python,pid=295,uid=1000
[ 1224.321421] Out of memory: Killed process 295 (python) total-vm:55369308kB, anon-rss:15107852kB, file-rss:0kB, shmem-rss:353072kB, UID:1000 pgtables:39908kB oom_score_adj:0
[ 1224.746786] oom_reaper: reaped process 295 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:353936kB

So, when using the profiler, there seems to be a memory error. Is this related to the profiler somehow saving too much data in-mem? Then, it might be a common problem that occurs for too long runs?

I appreciate any help for this issue. It would be quite nice to see a longer profiling run, in order to determine the bottlenecks / expensive operations in my pytorch code.

Thanks! Best, JZ

ptrblck · June 1, 2022, 9:43pm

I think your explanation makes sense and it seems the process uses ~15GB before it gets killed.
Unfortunately, I don’t know if there is a workaround other than to reduce the profiling time or to increase the RAM.

jayz · June 2, 2022, 5:18am

Thanks for the reply! Ok, so you would also suspect that this is a common error when making too long / large runs with the profiler? I’d be interested to know what’s the reason behind it? I surmise the profiler is saving a lot of intermittent data, which at some point simply exceeds memory. But why does it need to save the data, all it needs to store should be the stats after each loop, and, at the end, return their averages, or is it more complicated?

EDIT

There seems to also be an issue with Nvidia CUPTI (CUDA Profiling Tools Interface). Each time, I run the script, I get a warning:

Running your script with the autograd profiler...
WARNING:2022-06-01 13:37:49 513:513 init.cpp:129] function status failed with error CUPTI_ERROR_NOT_INITIALIZED (15)
WARNING:2022-06-01 13:37:49 513:513 init.cpp:130] CUPTI initialization failed - CUDA profiler activities will be missing

This indicates that CUPTI is not running on WSL and therefore the profiling does not work. However, if I run on pure windows, it works (but much slower, due to the windows overhead in worker creation and other suspected difficulties with multiprocessing dataloading). Digging further into that, I found now that in the support matrix for WSL (CUDA on WSL :: CUDA Toolkit Documentation), it says that profiling tools are not yet supported. So, it seems there is no possibility to get it running based on CUPTI, for now.

Therefore, is there another way to get an accurate profiling result without using CUPTI? For instance, if I use the regular cProfiling profiler and use torch.cuda.synchronize in my loop, is it somehow possible to get the actual CUDA running times?

Thanks!

ptrblck · June 2, 2022, 6:11am

It should be more complicated as profilers sample the entire run and are able to create timelines from it, not only general statistics. E.g. using the built-in profiler (Kineto) as well as Nsight Systems would show the execution timelines, the kernel launches etc.

Yes, you could use this manual profiling approach by synchronizing the code manually before starting and stopping timers.

jayz · June 2, 2022, 7:37am

Thanks for the reply, again.

So, here is a small example:

n_runs = 100  # number of runs
n_loop = 1000 # epochs

# generate model
model = nn.Sequential(
    nn.Linear(32,32),
    nn.ReLU(),
    nn.Linear(32,32),
    nn.ReLU(),
    nn.Linear(32,1)
    )

# generate events
start = torch.cuda.Event(enable_timing=True)
end   = torch.cuda.Event(enable_timing=True)

means = []; stds = []
for _ in range(n_runs):
    
    timesb = []
    for _ in range(n_loop): 

        # generate batch 
        batch = torch.rand(12,32)

        # record event timing
        start.record()
        _ = model(batch)
        end.record()

        # synchronization
        torch.cuda.synchronize()

        # append
        timesb.append(start.elapsed_time(end))
        
    # append means and stds
    means.append(np.mean(timesb))
    stds.append(np.std(timesb))
    
print('%0.2fms +- %0.2fms'%(1000*np.mean(means),1000*np.mean(stds)))

>> 105.65ms +- 16.71ms

But, is there an easy way to record the timings of each individual operation (i.e. Linear and ReLU) and return them? Or would all of them have to be wrapped in separate (start…end) events?