Optimal matrix sizes for utilizing tensor cores when running Transformer-decoder type models?

What are the optimal matrix sizes for utilizing tensor cores when running transformer-decoder type models (e.g. Mistral7B)? Should both the sequence length and batch-size be a multiple of 8, 64 or 256? The vocab-size is 32000 which already divides 256.

Is there a reference for this setting? I’m looking specifically for the Ada Lovelace chipset (we are running on the A6000 Ada Lovelace cards) but I also need this value for the H100 (Hopper chipset).

Enabling tensor-cores in pytorch requires this to be set:

torch.set_float32_matmul_precision("high")

but I can’t tell if the tensor-cores are actually being used. Is there a way of checking - some sort of debugging/logging that we can expose?

Thanks!

Are all of these points generated by ChatGPT or did you verify them? E.g. I don’t understand how blocking launches would help here and nvprof was deprecated a long time ago.

1 Like

More info here:

Tensor Cores can be used for... 

cuBLAS version ≥ 11.0
cuDNN version ≥ 7.6.3

INT8    Always but most efficient with multiples of 16; on A100, multiples of 128.
FP16    Always but most efficient with multiples of 8; on A100, multiples of 64.
TF32    Always but most efficient with multiples of 4; on A100, multiples of 32.
FP64    Always but most efficient with multiples of 2; on A100, multiples of 16.

Presumably bfloat16 would match up with the FP16 recommendations here (?)

I wish there was more info on the recommendations; where do the Ada Lovelace chips fall into this?

I probably just need to figure out how to use the PyTorch profiler and figure this out via trial and error…

And of course there is this famous example

https://twitter.com/karpathy/status/1621578354024677377