[SOLVED] Make Sure That Pytorch Using GPU To Compute

I add my profiling results from torch.utils.bottleneck. I am not sure what can I read from the results, in particular if CUDA time is bigger than CPU time - does it mean GPU is utilised? Thanks!

--------------------------------------------------------------------------------
  autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

---------  ---------------  ---------------  ---------------  ---------------  ---------------
Name              CPU time        CUDA time            Calls        CPU total       CUDA total
---------  ---------------  ---------------  ---------------  ---------------  ---------------
stack        1995016.802us          0.000us                1    1995016.802us          0.000us
stack        1433562.687us          0.000us                1    1433562.687us          0.000us
stack        1418816.239us          0.000us                1    1418816.239us          0.000us
stack        1208400.125us          0.000us                1    1208400.125us          0.000us
stack        1109156.949us          0.000us                1    1109156.949us          0.000us
stack        1043755.894us          0.000us                1    1043755.894us          0.000us
stack         989006.451us          0.000us                1     989006.451us          0.000us
stack         988511.989us          0.000us                1     988511.989us          0.000us
stack         984434.292us          0.000us                1     984434.292us          0.000us
stack         980338.307us          0.000us                1     980338.307us          0.000us
stack         976940.691us          0.000us                1     976940.691us          0.000us
stack         955838.942us          0.000us                1     955838.942us          0.000us
stack         955763.458us          0.000us                1     955763.458us          0.000us
stack         952211.930us          0.000us                1     952211.930us          0.000us
stack         951751.424us          0.000us                1     951751.424us          0.000us

--------------------------------------------------------------------------------
  autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
        top 15 events sorted by cpu_time_total

	Because the autograd profiler uses the CUDA event API,
	the CUDA time column reports approximately max(cuda_time, cpu_time).
	Please ignore this output if your code does not use CUDA.

---------  ---------------  ---------------  ---------------  ---------------  ---------------
Name              CPU time        CUDA time            Calls        CPU total       CUDA total
---------  ---------------  ---------------  ---------------  ---------------  ---------------
stack        1348676.702us    1348687.500us                1    1348676.702us    1348687.500us
stack        1325784.279us    1325796.875us                1    1325784.279us    1325796.875us
stack        1301842.419us    1301843.750us                1    1301842.419us    1301843.750us
stack        1271585.903us    1271609.375us                1    1271585.903us    1271609.375us
stack        1269943.439us    1269953.125us                1    1269943.439us    1269953.125us
stack        1184606.802us    1184597.656us                1    1184606.802us    1184597.656us
stack        1176057.135us    1176062.500us                1    1176057.135us    1176062.500us
stack        1108025.533us    1108031.250us                1    1108025.533us    1108031.250us
stack        1095250.413us    1095257.812us                1    1095250.413us    1095257.812us
stack        1082371.450us    1082375.000us                1    1082371.450us    1082375.000us
stack        1080302.317us    1080312.500us                1    1080302.317us    1080312.500us
stack        1028030.105us    1028039.062us                1    1028030.105us    1028039.062us
stack        1015617.116us    1015625.000us                1    1015617.116us    1015625.000us
stack         861592.872us     861601.562us                1     861592.872us     861601.562us
stack         860586.499us     860593.750us                1     860586.499us     860593.750us

I found that the bottleneck is DataLoader - I implemented my own DataLoader according to this code

for epoch in range(epochs):
    print (time1)
    for data in loader:
        print (time2)
        ....

time2-time1 is extremely large (~15s) while all the rest excutions inside the inner loop including forward and backprop takes <1s.

The initial step might take more time, as the workers will be spawned and start to create the next batch. Once you are inside the loop the following iterations should be faster.
Have a look at the ImageNet example to see, how to check the data loading time for the following iterations.

OK, thanks for the confirmation. The thing that bothers me is that Pytorch seems to run slower than Keras given the same dataset and similar network size if the batch_size is larger. However, it does have an advantage with smaller batch_size.

What kind of model and dataset are you using?
Also, how large is the speed difference?

So what happens if one of my tensors is on CPU and the other is on GPU e.g. say I forgot to do .to(device) on everything assuming I am doing:

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

would that tensor be moved to GPU or what would happen?

What is the best way to make sure everything is truly using GPU. Do I need to worry that I might have forgotten to call .to(device) on something?

1 Like

is that equivalent to:

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")
   model = MyModel()
   model.to(device)

?

is there some sort of internal flag I can check to see if things are properly placed in GPU?

If some parameters are not located on the device while they are expected to, you’ll get an error.
To check all parameters, you could run something like:

for p in model.parameters():
    print(p.device)

Note that this only checks the parameters, so you might also want to use it for buffers.

1 Like

it seems those checks are unnecessary since you guys’s type checking disallows computing things where something is on GPU and the other is on CPU. Right?

That’s correct. If you run your code and some operations are using tensors on the GPU and CPU, you’ll get an error.

1 Like

if you are using anaconda, right click to anaconda navigator and choose run with graphic processor and choose which graphic processor that you want to use. Then launch IDE.
After IDE opened write torch.cuda.is_available() for checking it.
İf it’s true, that mean it worked.

1 Like