Compiling the model results in 20X slow down

Hi,
I’m using the default settings for model compilation.

In my case, compiling the model results in a 20X slow down. I left two models running (one compiled and one not), and the results are:

compiled: 873 steps in 8 hours
not-compiled: 16 256 steps in 8 hours

Each time during a forward, I’m passing tensor of the same dimensions exactly (BS x padded-len).

For the compiled model, the first step of the first batch takes A LOT of time [5 mins for the compilation time?].

Compiled model:
2023/08/03 23:55:33 Epoch 0: 0%| | 1/6103 [04:53<497:03:28, 293.25s/it]
2023/08/03 23:56:01 Epoch 0: 0%| | 2/6103 [05:20<271:54:04, 160.44s/it, v_num=-332]
2023/08/03 23:56:28 Epoch 0: 0%| | 3/6103 [05:47<196:25:00, 115.92s/it, v_num=-332]
2023/08/03 23:57:00 Epoch 0: 0%| | 4/6103 [06:19<160:49:52, 94.93s/it, v_num=-332]

Not-compiled model:
2023/08/03 23:57:41 Epoch 0: 0%| | 1/6103 [00:07<12:09:16, 7.17s/it, v_num=-334]
2023/08/03 23:57:42 Epoch 0: 0%| | 2/6103 [00:08<7:23:44, 4.36s/it, v_num=-334]
2023/08/03 23:57:44 Epoch 0: 0%| | 3/6103 [00:10<5:48:41, 3.43s/it, v_num=-334]
2023/08/03 23:57:45 Epoch 0: 0%| | 4/6103 [00:11<5:00:26, 2.96s/it, v_num=-334]
2023/08/03 23:57:47 Epoch 0: 0%| | 5/6103 [00:13<4:32:01, 2.68s/it, v_num=-334]
2023/08/03 23:57:48 Epoch 0: 0%| | 6/6103 [00:14<4:12:42, 2.49s/it, v_num=-334]
2023/08/03 23:57:50 Epoch 0: 0%| | 7/6103 [00:16<3:58:51, 2.35s/it, v_num=-334]

Hardware for the runs:
V100 32GB, 1 GPU

Hi @mer,

I also had some problems with model compilation. Most of them got solved by using the latest nightly. Could you try it?

Best,
Thorsten

Sure, I’ll try it and post an update in about an hour.

Edit: it might take a little longer :sweat_smile:

Installing collected packages: pytorch-triton, torch, torchvision, torchaudio
ERROR: Could not install packages due to an OSError: [Errno 38] Function not implemented: '/usr/local/lib/python3.10/site-packages/triton/__init__.py'

@thorstenwagner

My torch packages:

Singularity> pip freeze | grep torch
pytorch-lightning==2.0.6
pytorch-metric-learning==2.3.0
pytorch-ranger==0.1.1
pytorch-triton==2.1.0+440fd1bf20
torch==2.1.0.dev20230620+cu121
torch-optimizer==0.3.0
torchaudio==2.1.0.dev20230620+cu121
torchmetrics==0.11.4
torchvision==0.16.0.dev20230620+cu121

Results - all run for more than 1 hour on an A100 40GB:
compiled -

Epoch 0: 2%|▏ | 135/6103 [1:14:59<55:15:02, 33.33s/it, v_num=-348]

compiled, dynamic=True -

Epoch 0: 2%|▏ | 107/6103 [1:14:52<69:55:47, 41.99s/it, v_num=-350]

not-compiled -

Epoch 1: 87%|████████▋ | 5337/6103 [34:55<05:00, 2.55it/s, v_num=-349]

Here is my env:

Singularity> python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 2.1.0.dev20230620+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 11 (bullseye) (x86_64)
GCC version: (Debian 10.2.1-6) 10.2.1 20210110
Clang version: Could not collect
CMake version: version 3.27.0
Libc version: glibc-2.31

Python version: 3.10.10 (main, Mar 23 2023, 03:59:34) [GCC 10.2.1 20210110] (64-bit runtime)
Python platform: Linux-5.14.0-162.18.1.el9_1.x86_64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A100-SXM4-40GB
Nvidia driver version: 525.85.12
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Do you compile it ‘reduce-overhead’ mode?

Like

model = torch.compile(model, mode="reduce-overhead")

After about 2 hours:

Epoch 0: 6%|▌ | 359/6103 [2:10:40<34:50:47, 21.84s/it, v_num=-544] - reduce-overhead
Epoch 0: 4%|▍ | 255/6103 [2:10:39<49:56:24, 30.74s/it, v_num=-543] - default
Epoch 0: 6%|▌ | 378/6103 [2:10:39<32:58:47, 20.74s/it, v_num=-542] - max-autotune

It’s nowhere near not compiling : /

You are still on an older nightly release from June, so could you try to update to the latest nightly release or are you still running into the installation issues?

Hi @ptrblck,
First of all, thanks for all the responses to all of the threads, not only this one. You have helped me and others using Pytorch soooo much.

I have updated the packages. Previously, I just used

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121

And it downloaded those versions. I somehow have done it again, to the latest versions, and the issue seems to be resolved.

After 5 mins, the “not-compiled” was the fastest, but after 35 mins the “max-autotune” was actually about 15% faster. I’ll try to post an update later.

Thanks so much.

1 Like

compilation happens when after you run opt_m = torch.compile(), when you run opt_m(batch) - the naming is confusing, compile should be called a JIT but that was already taken XD