Performance questions

I’m curious about best practices regarding PyTorch performance. When considering PyTorch on the CPU, I think I have a rough understanding of how one should write code – basically treat it like Numpy. Things that can use PyTorch functions are going to be faster than using e.g. native python loops. As I understand it, this is using optimised C code to do the heavy lifting.

My confusion arises when the GPU comes into play:

  1. What is the effect of using native python functions on Variables that live on the GPU? Is it ‘doubly bad’, as the data not only is using slow python functions, but has to be shuffled between the GPU and CPU? If not, how does it work?
  2. How can I work out when I am being inefficient with the GPU? I can use e.g. line_profiler to work out which functions are taking a long time. How do I work out why it is slow (e.g. the underlying algorithm, or data transfer)?
  3. How do I take advantage of pinned memory? When do I need to?
  4. If I have intermediate Variables for some network, is there a penalty to e.g. creating a new Variable each time? How can I structure my class, inheriting from nn.Module, such that I can have these Variables moved to the GPU by calling .cuda()? Or do I just have to pass in an extra flag to __init__, and manually check everything?

Sorry for all the questions!