I am suffering from a very strange problem and I have to ask someone who might be familiar with it.
Briefly speaking, 0% gpu-util with a lot of gpu memory (see figure below).
The code I am running is protonets which uses the fancy torchnet frame. I have a Quadro P4000 on my local desktop and a Tesla V100 on the server cluster. The code works well with 50% gpu-util on my local desktop but cannot reach even 1% gpu-util on the cluster with the V100. It turns out the code on the cluster is at least 10 times slower than on the local machine.
After a lot of configurations, both the cluster node and my location desktop have the same settings: python 3.6.5, cuda 9.0, pytorch 0.4.0. The configuration thus is probably not an issue.
I am pretty sure I have applied cuda() to all the possible models, tensors, and you can see that they are indeed loaded into the gpu-memory (2503/16160 MiB, in the picture). But why can’t they go through the computation in gpu?
I initially doubt whether torchnet support V100 or not. But after running a simple example, I found the example ran well on V100.
PS: I didn’t change anything in the source code of the protonets. But it just cannot work on V100, while work well on P4000.