Tracemalloc inside Modules?

I have a system that is working on one GPU. When switching to two GPUs I was having problems with tensors being on different GPUs even though I was using register_buffer. To fix this I am using the to() function I found self.var1.to(device=torch.cuda.device_of(self.var2).idx), which I suspect might be the source of the leak. However, in tracemalloc I am only seeing output = net(inputs) in the traceback so I can’t actually see whats happening in my custom module. Any input on if I’m using to() wrong or how to get tracemalloc to keep track of my actual code would be greatly appreciated. Thanks! The following is what my trace looks like.

6346 memory blocks: 1255.4 KiB
File “/home/administrator/PyTor4-1/lib/python3.6/site-packages/torch/nn/parallel/replicate.py”, line 29
replica.dict = module.dict.copy()
File “/home/administrator/PyTor4-1/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py”, line 127
return replicate(module, device_ids)
File “/home/administrator/PyTor4-1/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py”, line 122
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File “/home/administrator/PyTor4-1/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 477
result = self.forward(*input, **kwargs)
File “/home/administrator/cifarLike/wide-resnet.pytorch/main.py”, line 921
output = net(inputs)
File “/home/administrator/cifarLike/wide-resnet.pytorch/main.py”, line 2
import torch