The actual memory usage will depend on your setup.
E.g. different GPU architectures and CUDA runtimes will vary in the CUDA context size. The actual size will also very depending if CUDA’s lazy module loading is enabled or not. Starting with the PyTorch binaries shipping with CUDA >= 11.7 we’ve enabled it by default. This will create a small context at the init time and will lazily load the device kernel code into the context once a new kernel is called. If your workflow uses dynamic shapes the context size could thus grow.
Also, depending on your model you might use cudnn.benchmark = True
, which will profile available kernels for your current use case and will select the fastest one which uses a workspace which would fit into your device memory.
As you can see, a lot of factors depend on your actual setup. While a theoretical memory usage can be calculated based on the number of parameters and intermediate activations (this post gives you an example) you should add an expected overhead for the aforementioned points.
1 Like