Iterating over tensors and calling .to("cuda")
causes a lot of overhead while the GPU is not really busy. Is there a way to tell pytorch to bring a batch of tensors to the GPU?
I think you could use tensordicts, but I’m unsure how allocation on GPU happens under the hood.
https://pytorch.org/tensordict/main/overview.html
Yes tensordict will execute that somewhat faster using non_blocking data transfer. Happy to look at the profile if you’re willing to share it with that lib!
1 Like