Extreme single thread cpu kernel usage while training on GPU

That’s an interesting observation!
Based on your comment:

When using the autocast (mixed precision) context manager, we can see that the CPU usage is mainly concentrated in a single core. When setting use_autocast=False however, the CPU usage is more spread across several cores.

it seems you are using autocast on the CPU and then observe this behavior?
Also, it seems you are “mainly” seeing a single core usage but other cores are still used?