GPU vs. CPU Performance: Unusual Attention Head Effects

Hi there!

I tried out a little experiment to see how does a GPU and CPU actually impacts the training time of a transformer based language model by changing the number of attention heads to see how it affects the performance.

Here’s what I found:
When attention head count increases from 1 to 128 (n_embed: 128), I observed that GPU performance decreases from 12x to 1.23x relative to the CPU.

I mean in the ideal situation my brain is expecting a linear trend which is not the case because there are underlying stuff that I don’t understand, but when GPU gain went from 12x to 1.23x - I’m CONFUSED!

I’m curious to find a model configuration where the GPU gain is close to 1x w.r.t CPU - case where the compute performances are equal (which doesn’t make any sense to me).

Here’s a google collaboration notebook link: Google Colab

If anyone would like to share their thought process behind this trend, I would appreciate it.
Thanks!