Torch.compile from within docker image

Hello,

I am trying to optimize my workflow so that I can run a sweep of experiments on a cluster of GPUs. I have already trained my model, and am just interested in running inference with various data pre-processing configs. I’ve already attempted the following optimizations:

  • Setting pin_memory=True in my DataLoader
  • Setting num_workers > 0 in my DataLoader
  • Providing the option non_blocking=True when pushing data from host to device
  • Using different cuda streams for computations that can run concurrently on the GPU
  • Experimenting with the batch size
  • Remove CPU-GPU synch operations like .item() and .any()

These attempts have provided some speedups, but still result in experiment runtimes that aren’t suitable for what I want to study. I’ve tried to use torch.utils.bottleneck to identify where the bottleneck really is. The following is the top 6 items when sorting by self_cuda_time_total:

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                aten::cudnn_convolution         0.26%      34.981ms         1.27%     171.466ms     204.126us     755.697ms        62.63%     779.273ms     927.706us           0 b           0 b       1.50 Gb       1.50 Gb           840
void cudnn::cnn::conv2d_grouped_direct_kernel<false,...         0.00%       0.000us         0.00%       0.000us       0.000us     755.697ms        62.63%     755.697ms     899.639us           0 b           0 b           0 b           0 b           840
                                            aten::copy_         0.18%      24.274ms         1.99%     269.458ms      79.863us     248.706ms        20.61%     262.179ms      77.706us           0 b           0 b           0 b           0 b          3374
                         Memcpy HtoD (Pinned -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us     133.730ms        11.08%     133.730ms       9.552ms           0 b           0 b           0 b           0 b            14
                         Memcpy DtoD (Device -> Device)         0.00%       0.000us         0.00%       0.000us       0.000us     114.976ms         9.53%     114.976ms      54.750us           0 b           0 b           0 b           0 b          2100
                                       cudaLaunchKernel        12.35%        1.674s        12.36%        1.674s      67.111us      93.706ms         7.77%      93.706ms       3.757us           0 b           0 b           0 b           0 b         24941

It seems that, since cudnn_convolution is the most prevalent kernel, my application is GPU compute bound. After perusing through the forums, it seemed like my only way to get real speedups moving forward is to use torch.compile. There are large chunks of my code that use many torch operations, so torch.compile seems attractive for these leaf functions. However, when I try using the decorator (or manual torch.compile() calls in my app) on these functions, I get the following error:

torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
KeyError: 'getpwuid(): uid not found: 23568'

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] TorchDynamo compilation metrics:
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] Function, Runtimes (s)
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] _compile.<locals>.compile_inner, 0.0000
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] OutputGraph.call_user_compiler, 0.0000
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] create_aot_dispatcher_function, 2.6140
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] compile_fx.<locals>.fw_compiler_base, 0.0000
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] GraphLowering.run, 0.0712
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] GraphLowering.compile_to_module, 0.0000
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] Scheduler.__init__, 0.3985
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] Scheduler.codegen, 0.0000

This is strange because running id does reveal that uid=23568 exists. The GPUs in my cluster are NVIDIA A100s, and I’m using a Docker image with pytorch version 2.1.0+cu121 and CUDA version 12.4. My dataset is too large to load the entirety onto my GPU (dataset is around 350GB and GPU has only 8GB of VRAM). Any ideas on how I can get torch.compile to work in my application?

Turns out this was a simple file permissions error. For anyone else who runs into this issue, the following describes my understanding of the problem and my own solution.

It seems that torch.compile requires rw access to some file/directory that must be owned by the user when a Docker image launches. Since I created my Docker image on a machine that is different than where my image runs, the two uids are different and torch.compile fails. My hacky solution to simply set the USER in my Dockerfile to the same userid where my image runs, i.e. by adding the following lines to the bottom of my Dockerfile:

RUN useradd -u 23568 myExecutionPointUser
USER myExecutionPoint

I don’t get write access to what user my Docker image runs as (only read). This seems to be the only solution for me, and it seems to rely on having control to change the Docker image I use. I’m now able to use torch.compile. Though it’s not giving me the perfomance benefits I was expecting, I believe that is an issue separate from this (I hope). Not sure how to do this myself, but this issue can be marked as resolved/answered.