I am running training on a rather simple deep learning network (~800kb).
LSTM with 2 layers and 128 neurons
Feed Forward with 2 linear layers and 64 neurons
Final Feed forward with 2 linear layers and 64 neurons
However, I am getting low GPU utilization at around 45% & 140W when looking at nvidia-smi on my RTX 6000
I am not sure why this is the case as there should not be bottlenecks as data is all loaded into memory, and I sent the tensors straight to the GPU. So the GPU is doing the shuffling and slicing on batch sizes (previously I did this on the CPU, but the CPU was maxed out so it was the bottleneck).
With 2 million inputs (~1000 features each input), it uses about 18GB of GPU Memory out of 24GB on my RTX 6000
I am using a batch size of 1024 and its taking about 60 seconds an epoch. It seems training speed is linear with batch size as with 512 it was about 120seconds, and 2048 was 33 seconds.
My forward/backward pass is the standard pytorch implementation.
When I added in torch.cuda.amp.autocast/GradScaler.and torch.backends.cudnn.benchmark = True, my utilization and wattage when down but time per epoch stayed the same.
I am wondering if you can help me troubleshoot as I need to train this model as fast as possible. I will also run into the issue down the line of having too much data to fit into the GPU (but will fit in RAM), so any help on an efficient solution there would also be appreciated. I previously used DataLoaders/Dataset but it was really slow, even slower with pinned_memory and num_workers > 0.
Thanks! Let me know if any further information is required