After more than 2 days of fighting I’m back at square one with a weird decrease in training speed and I declare myself officially lost!
On the 15th of June I executed for the last time a train script for a DistilMBERT fine-tuning. At that time, speed was ~6 batches/sec (batch size 16), instead this week with the same code, same dataset and same hyperparams, one batch takes 2 full seconds, making it impossible to train with that speed.
Since it might be relevant to the problem I’ll detail my current setup which is on GCP, running in GKE a node with a Tesla V100 (link), in which we install the nvidia driver (version: 450.51.06) using this daemonset and afterwards, we create pods from it.
In other words, we have a server with the GPU card and we create VMs on it.
So, after comparing as much as I could, this is what changed in the last month:
- New Pytorch version: 1.9.0 got released
- The CUDA driver on the GKE node server went from 10.2 to 11.0
- Not sure if the nvidia drivers changed, but it is highly unlikely
Torch was installed without specifying any version, so I’m guessing that 1 month ago it had installed 1.8.0 compiled with CUDA 10.2.
Things I have tried so far:
- torch==1.9.0+cu111 → same issue
- torch==1.8.0+cu111 → same issue
- torch==1.8.0 (CUDA 10.2) → same issue
- upgrading CUDA driver on the VM to 11.1* → same issue
- upgrading CUDA driver on the VM to 11.2* → same issue
- disabling cudnn backend and increasing batch size as much as possible (to 48) as suggested in git issue 47908 (The speed of pytorch with cudatoolkit 11.0 is slower than cudatoolkit 10.2) → this lead to a minor speed improvement on the first iterations (around 2 batches/sec), but after a few iterations it went back to the current slow speed
* I’m not entirely sure if this has any impact at all, since Pytorch comes with it’s own CUDA+CUDNN compiled, right? Even further, 1 month ago I was not even installing CUDA drivers on the VM, only consuming the nvidia drivers that came through the node.
Things I can’t do for now:
- Install CUDA 10.2 on the VM
- Downgrade CUDA on the GKE node server
My main question here is: could the update on the CUDA drivers on the node where the VM is created cause any issue?
I’m not sure what type of logs or screenshots should I share to provide more info, so please, feel free to request anything!
Thanks in advance