What are the optimal matrix sizes for utilizing tensor cores when running transformer-decoder type models (e.g. Mistral7B)? Should both the sequence length and batch-size be a multiple of 8, 64 or 256? The vocab-size is 32000 which already divides 256.
Is there a reference for this setting? I’m looking specifically for the Ada Lovelace chipset (we are running on the A6000 Ada Lovelace cards) but I also need this value for the H100 (Hopper chipset).
Enabling tensor-cores in pytorch requires this to be set:
torch.set_float32_matmul_precision("high")
but I can’t tell if the tensor-cores are actually being used. Is there a way of checking - some sort of debugging/logging that we can expose?
Are all of these points generated by ChatGPT or did you verify them? E.g. I don’t understand how blocking launches would help here and nvprof was deprecated a long time ago.
Tensor Cores can be used for...
cuBLAS version ≥ 11.0
cuDNN version ≥ 7.6.3
INT8 Always but most efficient with multiples of 16; on A100, multiples of 128.
FP16 Always but most efficient with multiples of 8; on A100, multiples of 64.
TF32 Always but most efficient with multiples of 4; on A100, multiples of 32.
FP64 Always but most efficient with multiples of 2; on A100, multiples of 16.
Presumably bfloat16 would match up with the FP16 recommendations here (?)
I wish there was more info on the recommendations; where do the Ada Lovelace chips fall into this?
I probably just need to figure out how to use the PyTorch profiler and figure this out via trial and error…