Synopsis: Training and inference on a GPU is dramatically slower than on any CPU.
Setup:
- Training a highly customized Transformer model on an Azure VM (Standard NC6s v3 [6 vcpus, 112 GiB memory]) with a Tesla V100 (Driver Version: 550.54.15 & CUDA Version: 12.4).
- The dataset is not very large (e.g. 1 GB) with dimensions [12000, 51, 48] using mini-batches of size 256.
- All data are loaded into tensors and sent to GPU memory (e.g. to(device)) before even instantiating the model.
- model is sent to(device) and then compiled via model = torch.compile(model, mode=“reduce-overhead”).
- Call model training via torch.autocast(device_type=device, dtype=torch.bfloat16, enabled=False).
Situation: Using Torch’s built-in Profiler, I am noticing a dramatic slowdown when using GPU/CUDA rather than calculating everything on CPU. I do not believe this has anything to do with data flowing between the CPU and GPU given my comments above, but could be wrong.
The following two tables are exactly the same datasets and run on the same machine. The only difference between the CPU runtime and the GPU/CUDA runtime is where the data is being stored (e.g. they are both Torch.Tensors sent to(device)) and where compute occurs.
What’s Weird: model_inference is 27.5x slower on GPU than on CPU and model_inference is called 3x times in the CUDA version…
CPU Runtime
Name | Self CPU % | Self CPU | CPU total % | CPU total | CPU time avg | # of Calls |
---|---|---|---|---|---|---|
model_inference | 1.90% | 496.389ms | 100.00% | 26.178s | 26.178s | 1 |
aten::empty | 0.55% | 144.920ms | 0.55% | 144.920ms | 19.006us | 7625 |
aten::random_ | 0.00% | 12.914us | 0.00% | 12.914us | 12.914us | 1 |
aten::item | 0.00% | 3.033us | 0.00% | 4.424us | 4.424us | 1 |
aten::_local_scalar_dense | 0.00% | 1.391us | 0.00% | 1.391us | 1.391us | 1 |
enumerate(DataLoader)#SingleProcessDataLoaderIter.… | 0.64% | 167.173ms | 1.34% | 349.861ms | 9.207ms | 38 |
aten::randperm | 0.00% | 436.807us | 0.00% | 876.528us | 219.132us | 4 |
aten::scalar_tensor | 0.00% | 7.450us | 0.00% | 7.450us | 3.725us | 2 |
aten::resize_ | 0.01% | 1.711ms | 0.01% | 1.711ms | 2.554us | 670 |
aten::resolve_conj | 0.01% | 2.982ms | 0.01% | 2.982ms | 0.326us | 9141 |
Self CPU time total: 26.178s
GPU/CUDA Runtime
Name | Self CPU % | Self CPU | CPU total % | CPU total | CPU time avg | Self CUDA | Self CUDA % | CUDA total | CUDA time avg | # of Calls |
---|---|---|---|---|---|---|---|---|---|---|
model_inference | 0.00% | 0.00us | 0.00% | 0.00us | 0.00us | 1439.411s | 95.27% | 1439.411s | 719.705s | 2 |
GraphLowering.run (dynamo_timed) | 0.00% | 0.00us | 0.00% | 0.00us | 0.00us | 59.987s | 3.97% | 59.987s | 810.641ms | 74 |
CachingAutotuner.benchmark_all_configs (dynamo_timed… | 0.00% | 0.00us | 0.00% | 0.00us | 0.00us | 5.193s | 0.34% | 5.193s | 144.253ms | 36 |
aten::fill_ | 0.13% | 950.923ms | 0.17% | 1.249s | 28.885us | 3.996s | 0.26% | 3.996s | 92.384us | 43249 |
aten::zero_ | 0.13% | 947.515ms | 0.30% | 2.238s | 49.680us | 0.000us | 0.00% | 3.994s | 88.611us | 45049 |
void at::native::vectorized_elementwise_kernel<4, at… | 0.00% | 0.00us | 0.00% | 0.00us | 0.00us | 3.956s | 0.26% | 3.956s | 287.800us | 13747 |
CachingAutotuner.benchmark_all_configs (dynamo_timed… | 0.36% | 2.618s | 0.71% | 5.23s | 145.287ms | 0.000us | 0.00% | 3.737s | 103.808ms | 36 |
model_inference | 0.43% | 3.184s | 99.58% | 731.765s | 731.765s | 0.000us | 0.00% | 2.663s | 2.663s | 1 |
Torch-Compiled Region | 0.00% | 11.368ms | 0.52% | 103.364ms | 3.824s | 0.000us | 0.00% | 2.298s | 62.111ms | 37 |
CompiledFunction | 0.10% | 768.567ms | 0.52% | 103.032ms | 3.812s | 158.932ms | 0.01% | 2.298s | 62.111ms | 37 |
Self CPU time total: 734.872s
Self CUDA time total: 1510.825s
Any insight as to what’s causing this would be greatly appreciated.