From testing experience, the first Tensor push to GPU will roughly take up to 700-800 MiB of the GPU VRAM. You can then allocate more tensor to GPU without a shift in the VRAM until you have exceeded the pre-allocated space given from the first Tensor.
x = torch.tensor(1).cuda() # ~770 MiB allocated z = torch.zeros((1000)).cuda() # VRAM not affected. d = torch.zeros((1000000000)).cuda() # plus ~1000MiB (more accurate allocation according to bytes allocated for the datatype multiply by itemsize.
How do I create the first Tensor without 700+MiB worth of VRAM baggage? This will be especially useful for inference/deployment. Most Deep Learning applications have static inputs allocation; for example, a batch of images (ex: size (10,3,200,200) in float32) will just require 4.8MB of VRAM space. It does not make any sense in deployment to waste so much space for your model inputs.
Are there a specific way to initialise PyTorch Tensor similar to how we do it in PyCUDA/CUDA with allocate and copy? It will awesome if there is a way to allow developers to specify the nbytes of their Tensor like CUDA.
Hopefully there is already a way in PyTorch or else only the implementation of CUDA code itself can ensure CORRECT memory allocation for our inputs. PyCUDA GpuArray does not transfer well to PyTorch Tensor (the other way around works though) which leads to the need to code every single function in PyCUDA from scratch.