I converted a colab.research.google.com Jupyter notebook to run on my local machine. Unsurprisingly it takes much longer. However, when cProfiling the python code I find that 40% of that time is taken in torch.Tensor.cuda() - which is basically copying your objects from RAM to memory on the GPU. Any way of speeding this up short of “buy a new faster machine”? Buy a new faster machine that is, with a different and faster interconnect between RAM and the GPU. Discrete (I assume) GPU card - GeForce GTX 1660. New machine - haven’t opened it up and looked around much.
Interestingly, I have a routine that measures TFOPS and the 1660 is faster by some 1/3 than whatever the default GPU is that I’m getting in an unpaid Colab acct. Wonder if paying the $9.99/mth results in faster GPUs or interconnects? Although of course that’s not where my current bottleneck is. The two training steps of the code in question take 1.5 hrs on Colab - and much longer locally.
I’m not sure how you are profiling your code, but note that CUDA operations are executed asynchronously, so you would need to synchronize the code via torch.cuda.synchronize() before starting and stopping the timer.
If your device is busy with other workload (e.g. the model training), the tensor copy operation might accumulate the previous timings, and your profiling would be invalid.
I’ve modified (hacked) the following so that it will run locally:
If need be, I could post my version somewhere.
Then I run locally with:
ipython -m cProfile FlowerML-2.py
As I was googling on the problem I noticed your id reply to someone somewhere to the effect that asynchronous operations were being added to pytorch (probably long ago) but that the person asking would need to build pytorch for him or herself.
… ah, found it (your earlier conversation):
I’m a little confused - immediately after the call to .cuda() the same objects are operated upon, presumably on the GPU and happening under-the-hood within pytorch, and so I was assuming this would need to happen synchronously.
Not all of which (the above) necessarily addresses how you’ve asked your question. I’m not at all new to programming but relatively new to ML and pytorch.
You cannot reduce time using torch.cuda.syncrhonize(), as it’s used to synchronize the code in order to get valid profiling times. As described before, CUDA operations are executed asynchronously, so profiling them without a synchronization yields invalid results.