Docker Image Performance Drop between 23.07 to 23.09

Hey there,

We’re seeing a large performance drop when going from:

nvcr.io/nvidia/pytorch:23.07-py3

to

nvcr.io/nvidia/pytorch:23.08-py3

We assumed it would be addressed in September (it’s the summer after all) but we’re seeing the same drop with this image:

nvcr.io/nvidia/pytorch:23.09-py3

In particular, when we train Llama2 using the 23.07 image (which is CUDA 12.1) we’re seeing approx 3,500 tokens per second while training, and with the 23.08 and 23.09 images, we’re seeing 2,800 tokens per second (CUDA 12.2). We are training on 3x NVIDIA A6000 Ada Lovelace cards.

The only thing we are installing is the base docker image and the Flash Attention module from Tri Dao’s repo.

What’s the best way for us to fix this?

Is there a standard benchmark built in the container that we could run to give us a better comparison?

Thanks!

Could you post a minimal and executable code snippet reproducing the regression?