Sudden training crippling slowdown since 1.9 on GKE

Hi all!
After more than 2 days of fighting I’m back at square one with a weird decrease in training speed and I declare myself officially lost!

On the 15th of June I executed for the last time a train script for a DistilMBERT fine-tuning. At that time, speed was ~6 batches/sec (batch size 16), instead this week with the same code, same dataset and same hyperparams, one batch takes 2 full seconds, making it impossible to train with that speed.

Since it might be relevant to the problem I’ll detail my current setup which is on GCP, running in GKE a node with a Tesla V100 (link), in which we install the nvidia driver (version: 450.51.06) using this daemonset and afterwards, we create pods from it.
In other words, we have a server with the GPU card and we create VMs on it.

So, after comparing as much as I could, this is what changed in the last month:

  1. New Pytorch version: 1.9.0 got released
  2. The CUDA driver on the GKE node server went from 10.2 to 11.0
  3. Not sure if the nvidia drivers changed, but it is highly unlikely

Torch was installed without specifying any version, so I’m guessing that 1 month ago it had installed 1.8.0 compiled with CUDA 10.2.

Things I have tried so far:

  • torch==1.9.0+cu111 → same issue
  • torch==1.8.0+cu111 → same issue
  • torch==1.8.0 (CUDA 10.2) → same issue
  • upgrading CUDA driver on the VM to 11.1* → same issue
  • upgrading CUDA driver on the VM to 11.2* → same issue
  • disabling cudnn backend and increasing batch size as much as possible (to 48) as suggested in git issue 47908 (The speed of pytorch with cudatoolkit 11.0 is slower than cudatoolkit 10.2) → this lead to a minor speed improvement on the first iterations (around 2 batches/sec), but after a few iterations it went back to the current slow speed

* I’m not entirely sure if this has any impact at all, since Pytorch comes with it’s own CUDA+CUDNN compiled, right? Even further, 1 month ago I was not even installing CUDA drivers on the VM, only consuming the nvidia drivers that came through the node.

Things I can’t do for now:

  • Install CUDA 10.2 on the VM
  • Downgrade CUDA on the GKE node server

My main question here is: could the update on the CUDA drivers on the node where the VM is created cause any issue?

I’m not sure what type of logs or screenshots should I share to provide more info, so please, feel free to request anything!

Thanks in advance :slight_smile:

I’m interested in trying to reproduce this on 10.2/11.x. Do you have a minimal example handy (e.g., just the model with random data) that demonstrates the issue?

The driver itself probably wouldn’t be responsible for the issue. But the switch from CUDA 10.2 torch to 11.x very well might be the cause.

Yes, sure! I’ll be able to share something on Monday.

I’m starting to believe my code (which was actually developed 1.5 years ago) might not be optimal for the current version of Pytorch. I might be accumulating something somewhere that with the latest updates causes slowdowns, I’m don’t know.

A BIG update though: with torch==1.9.0+cu111 on the server with CUDA 11.0 if I set the following:

torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = True # in the past this was False
torch.backends.cudnn.enabled = False # does it make sense to disable after setting benchmark True?

The speed is normal again :thinking:

I’m not sure the settings make sense, especially if cudnn.enabled = False is set. That being said, I’m not even sure what fraction of operations in a transformer like DistilMBERT would be dispatched to cuDNN, which should really only be doing the heavy lifting for convolution-heavy models.