Hello, I’m trying to make sure I have optimized my pytorch code for training runtime as well as memory as much as possible but I’m not sure what sort of lower level things I should be looking out for. I’ve ran a benchmark on resnet152 on 224x224 images on a custom image dataset mapping to 33 classes (all one-hot) on an AWS tesla k80 (p2 instance) and im noticing a few things:
- I can’t get above 50 batch size without a memory error
- at a batch size of 50, I get on average 18.3 images/second for training
Do these sound reasonable? Please let me know if adding my code loop would help. I’ve also tried adding things like cuda.synchronize() and benchmark mode, but these things don’t seem to help.