For a potentially large performance penalty due to the additional data movement.
You could offload data to the CPU as described here or you could try to write a custom allocator with offloading capabilities using torch.cuda.CUDAPluggableAllocator
.