This would be expected depending on the input activation size, as the intermediate forward activations could use the majority of the memory. You could check this post for an estimation of a ResNet architecture.
This would be expected depending on the input activation size, as the intermediate forward activations could use the majority of the memory. You could check this post for an estimation of a ResNet architecture.