We’re seeing a large performance drop when going from:
We assumed it would be addressed in September (it’s the summer after all) but we’re seeing the same drop with this image:
In particular, when we train Llama2 using the 23.07 image (which is CUDA 12.1) we’re seeing approx 3,500 tokens per second while training, and with the 23.08 and 23.09 images, we’re seeing 2,800 tokens per second (CUDA 12.2). We are training on 3x NVIDIA A6000 Ada Lovelace cards.
The only thing we are installing is the base docker image and the Flash Attention module from Tri Dao’s repo.
What’s the best way for us to fix this?
Is there a standard benchmark built in the container that we could run to give us a better comparison?