Fine-tuning GPT-2 on multiple GPUs and still not enough of memory

You would need to check the stackrace to narrow down which layer and operation fails.
Use blocking launches via CUDA_LAUNCH_BLOCKING=1 to make sure the stacktrace points to the failing layer.
My guess is an embedding layer is failing due to an invalid input index.

Your code is posted, but not executable, so doesn’t help in debugging.