Hello,
I am trying to optimize my workflow so that I can run a sweep of experiments on a cluster of GPUs. I have already trained my model, and am just interested in running inference with various data pre-processing configs. I’ve already attempted the following optimizations:
- Setting
pin_memory=True
in myDataLoader
- Setting
num_workers > 0
in myDataLoader
- Providing the option
non_blocking=True
when pushing data from host to device - Using different cuda streams for computations that can run concurrently on the GPU
- Experimenting with the batch size
- Remove CPU-GPU synch operations like .item() and .any()
These attempts have provided some speedups, but still result in experiment runtimes that aren’t suitable for what I want to study. I’ve tried to use torch.utils.bottleneck to identify where the bottleneck really is. The following is the top 6 items when sorting by self_cuda_time_total
:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::cudnn_convolution 0.26% 34.981ms 1.27% 171.466ms 204.126us 755.697ms 62.63% 779.273ms 927.706us 0 b 0 b 1.50 Gb 1.50 Gb 840
void cudnn::cnn::conv2d_grouped_direct_kernel<false,... 0.00% 0.000us 0.00% 0.000us 0.000us 755.697ms 62.63% 755.697ms 899.639us 0 b 0 b 0 b 0 b 840
aten::copy_ 0.18% 24.274ms 1.99% 269.458ms 79.863us 248.706ms 20.61% 262.179ms 77.706us 0 b 0 b 0 b 0 b 3374
Memcpy HtoD (Pinned -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 133.730ms 11.08% 133.730ms 9.552ms 0 b 0 b 0 b 0 b 14
Memcpy DtoD (Device -> Device) 0.00% 0.000us 0.00% 0.000us 0.000us 114.976ms 9.53% 114.976ms 54.750us 0 b 0 b 0 b 0 b 2100
cudaLaunchKernel 12.35% 1.674s 12.36% 1.674s 67.111us 93.706ms 7.77% 93.706ms 3.757us 0 b 0 b 0 b 0 b 24941
It seems that, since cudnn_convolution
is the most prevalent kernel, my application is GPU compute bound. After perusing through the forums, it seemed like my only way to get real speedups moving forward is to use torch.compile
. There are large chunks of my code that use many torch operations, so torch.compile
seems attractive for these leaf functions. However, when I try using the decorator (or manual torch.compile()
calls in my app) on these functions, I get the following error:
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
KeyError: 'getpwuid(): uid not found: 23568'
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting:
import torch._dynamo
torch._dynamo.config.suppress_errors = True
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] TorchDynamo compilation metrics:
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] Function, Runtimes (s)
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] _compile.<locals>.compile_inner, 0.0000
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] OutputGraph.call_user_compiler, 0.0000
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] create_aot_dispatcher_function, 2.6140
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] compile_fx.<locals>.fw_compiler_base, 0.0000
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] GraphLowering.run, 0.0712
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] GraphLowering.compile_to_module, 0.0000
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] Scheduler.__init__, 0.3985
[2024-12-31 19:20:32,115] torch._dynamo.utils: [INFO] Scheduler.codegen, 0.0000
This is strange because running id
does reveal that uid=23568
exists. The GPUs in my cluster are NVIDIA A100s, and I’m using a Docker image with pytorch version 2.1.0+cu121
and CUDA version 12.4
. My dataset is too large to load the entirety onto my GPU (dataset is around 350GB and GPU has only 8GB of VRAM). Any ideas on how I can get torch.compile
to work in my application?