Instead of checking the reserved memory (which includes the cache) and subtracting the reported memory by nvidia-smi I would just use torch.cuda.memory_allocated which will return the used and allocated memory only.
Instead of checking the reserved memory (which includes the cache) and subtracting the reported memory by nvidia-smi I would just use torch.cuda.memory_allocated which will return the used and allocated memory only.