Hi,
I’m a happy owner of a 4090 and I’m about to train simple classification network on 2 million 224x224 images (vision-only), I’m using a fairly huge model and I want to speed-up my training setup.
I’ve been in deep learning for a while, but I still haven’t found any comprehensive guide on how to train models real fast, so here is what I know:
- use fp16 (GradScaler), 2x memory consumption reduction and some speed-up dependent on the architecture
- how to user tensor cores? I have no idea if they’re being used at all, how to enable them, how to monitor them? do they become automatically enabled when I use fp16?
- I’ve also tried torch.compile, it brought ~10% improvement in speed since pytorch doesn’t support flash attention for cards aside from A100/H100 as I understand (is it true though?)
thats it really, also I have some assumptions on what might work,
and here I want you to share your personal experience
- Is there any benefit using bfloat16 instead of float16?
- Did you try using memory_format.channel_last?
and maybe you know some other tricks and tips on how to speed-up training?
worth to mention, that I don’t have any CPU bottleneck, thats for sure (I’m using Ryzen 9 5950x with Samsung 980 pro 2TB NVME)