Torch.compile from within docker image

rselagam · December 31, 2024, 7:45pm

Hello,

I am trying to optimize my workflow so that I can run a sweep of experiments on a cluster of GPUs. I have already trained my model, and am just interested in running inference with various data pre-processing configs. I’ve already attempted the following optimizations:

Setting pin_memory=True in my DataLoader
Setting num_workers > 0 in my DataLoader
Providing the option non_blocking=True when pushing data from host to device
Using different cuda streams for computations that can run concurrently on the GPU
Experimenting with the batch size
Remove CPU-GPU synch operations like .item() and .any()

These attempts have provided some speedups, but still result in experiment runtimes that aren’t suitable for what I want to study. I’ve tried to use torch.utils.bottleneck to identify where the bottleneck really is. The following is the top 6 items when sorting by self_cuda_time_total:

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                aten::cudnn_convolution         0.26%      34.981ms         1.27%     171.466ms     204.126us     755.697ms        62.63%     779.273ms     927.706us           0 b           0 b       1.50 Gb       1.50 Gb           840
void cudnn::cnn::conv2d_grouped_direct_kernel<false,...         0.00%       0.000us         0.00%       0.000us       0.000us     755.697ms        62.63%     755.697ms     899.639us           0 b           0 b           0 b           0 b           840
                                            aten::copy_         0.18%      24.274ms         1.99%     269.458ms      79.863us     248.706ms        20.61%     262.179ms      77.706us           0 b           0 b           0 b           0 b          3374
                         Memcpy HtoD (Pinned -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us     133.730ms        11.08%     133.730ms       9.552ms           0 b           0 b           0 b           0 b            14
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us     114.976ms         9.53%     114.976ms      54.750us           0 b           0 b           0 b           0 b          2100
                                       cudaLaunchKernel        12.35%        1.674s        12.36%        1.674s      67.111us      93.706ms         7.77%      93.706ms       3.757us           0 b           0 b           0 b           0 b         24941

It seems that, since cudnn_convolution is the most prevalent kernel, my application is GPU compute bound. After perusing through the forums, it seemed like my only way to get real speedups moving forward is to use torch.compile. There are large chunks of my code that use many torch operations, so torch.compile seems attractive for these leaf functions. However, when I try using the decorator (or manual torch.compile() calls in my app) on these functions, I get the following error:

torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
KeyError: 'getpwuid(): uid not found: 23568'

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] TorchDynamo compilation metrics:
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] Function, Runtimes (s)
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] _compile.<locals>.compile_inner, 0.0000
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] OutputGraph.call_user_compiler, 0.0000
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] create_aot_dispatcher_function, 2.6140
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] compile_fx.<locals>.fw_compiler_base, 0.0000
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] GraphLowering.run, 0.0712
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] GraphLowering.compile_to_module, 0.0000
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] Scheduler.__init__, 0.3985
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] Scheduler.codegen, 0.0000

This is strange because running id does reveal that uid=23568 exists. The GPUs in my cluster are NVIDIA A100s, and I’m using a Docker image with pytorch version 2.1.0+cu121 and CUDA version 12.4. My dataset is too large to load the entirety onto my GPU (dataset is around 350GB and GPU has only 8GB of VRAM). Any ideas on how I can get torch.compile to work in my application?