[SOLVED] Make Sure That Pytorch Using GPU To Compute

Hi when I try this codes, the second failed with the info. :Segmentation fault (core dumped). But when I add CUDA_VISIBLE_DEVICES=1 it works. only when I using CUDA_VISIBLE_DEVICES=0 it is failed. Can you PLS tell me why and give any suggestion.

1 Like

I’m trying to implement the methods at the beginning of this thread as follows:

model = model.cuda()


import time
start = time.time()
train_loss = []
train_accu = []
i = 0
for epoch in range(20):
    for data, target in train_loader:
        data, target = (Variable(data).double()).cuda(), (Variable(target).long()).cuda()
        output = model(data.view(batch_size,1,64,64))
        loss = F.nll_loss(output, target) # Negative log likelihood (goes with softmax). 
        loss.backward()    # calc gradients
        train_loss.append(loss.data[0]) # Calculating the loss
        optimizer.step()   # update gradients
        prediction = output.data.max(1)[1]   # first column has actual prob.
        accuracy = (prediction.eq(target.data).sum()/batch_size)*100
        if i % 10 == 0:
            print('Epoch:',str(epoch),'Train Step: {}\tLoss: {:.3f}\tAccuracy: {:.3f}'.format(i, loss.data[0], accuracy))
        i += 1
end = time.time()
print('TRAIN TIME:')

But when I train, I just get a constant accuracy of 0%. Am I missing some part where I need to cast to .cuda() ?

prediction.eq(target.data) returns a byte tensor/variable. Summing it up and dividing it by a batchsize would lead to zero.

Try it with

accuracy = (prediction.eq(target.data).float().sum()/batch_size)*100

Ah yes it would wouldn’t it! Worked beautifully, thanks!

Would it matter that I’ve called .cuda() on data before turning it into a variable or should i be doing Variable(data).double().cuda() ?

This should both work equally good.
I would recommend switching to pytorch 0.4 as both classes are merged in this release

hello dear i have the same issue. i don’t know how to solve it. could you help me please.


I am struggling with running Pytorch on GPU. I created a simple fully connected network, set batch_size very large to make sure all data will be fed for the first time, and put my model, X and y to GPU using to('cuda'). The training takes long time comparing to Keras on GPU, and takes similar time to that if I set os.environ["CUDA_VISIBLE_DEVICES"]="-1" such that training will be run on CPU. I wonder if I miss any import step to run Pytorch on GPU.

In fact I observed timing difference for a CNN network - GPU runs faster than CPU. However, I cannot manage to realise it for a fully connected network. The size of the network won’t change the conclusion.

Is there any test code for a fully connected deep network running on GPU? All examples on the web that I can find are CNNs.

I add my profiling results from torch.utils.bottleneck. I am not sure what can I read from the results, in particular if CUDA time is bigger than CPU time - does it mean GPU is utilised? Thanks!

  autograd profiler output (CPU mode)
        top 15 events sorted by cpu_time_total

---------  ---------------  ---------------  ---------------  ---------------  ---------------
Name              CPU time        CUDA time            Calls        CPU total       CUDA total
---------  ---------------  ---------------  ---------------  ---------------  ---------------
stack        1995016.802us          0.000us                1    1995016.802us          0.000us
stack        1433562.687us          0.000us                1    1433562.687us          0.000us
stack        1418816.239us          0.000us                1    1418816.239us          0.000us
stack        1208400.125us          0.000us                1    1208400.125us          0.000us
stack        1109156.949us          0.000us                1    1109156.949us          0.000us
stack        1043755.894us          0.000us                1    1043755.894us          0.000us
stack         989006.451us          0.000us                1     989006.451us          0.000us
stack         988511.989us          0.000us                1     988511.989us          0.000us
stack         984434.292us          0.000us                1     984434.292us          0.000us
stack         980338.307us          0.000us                1     980338.307us          0.000us
stack         976940.691us          0.000us                1     976940.691us          0.000us
stack         955838.942us          0.000us                1     955838.942us          0.000us
stack         955763.458us          0.000us                1     955763.458us          0.000us
stack         952211.930us          0.000us                1     952211.930us          0.000us
stack         951751.424us          0.000us                1     951751.424us          0.000us

  autograd profiler output (CUDA mode)
        top 15 events sorted by cpu_time_total

	Because the autograd profiler uses the CUDA event API,
	the CUDA time column reports approximately max(cuda_time, cpu_time).
	Please ignore this output if your code does not use CUDA.

---------  ---------------  ---------------  ---------------  ---------------  ---------------
Name              CPU time        CUDA time            Calls        CPU total       CUDA total
---------  ---------------  ---------------  ---------------  ---------------  ---------------
stack        1348676.702us    1348687.500us                1    1348676.702us    1348687.500us
stack        1325784.279us    1325796.875us                1    1325784.279us    1325796.875us
stack        1301842.419us    1301843.750us                1    1301842.419us    1301843.750us
stack        1271585.903us    1271609.375us                1    1271585.903us    1271609.375us
stack        1269943.439us    1269953.125us                1    1269943.439us    1269953.125us
stack        1184606.802us    1184597.656us                1    1184606.802us    1184597.656us
stack        1176057.135us    1176062.500us                1    1176057.135us    1176062.500us
stack        1108025.533us    1108031.250us                1    1108025.533us    1108031.250us
stack        1095250.413us    1095257.812us                1    1095250.413us    1095257.812us
stack        1082371.450us    1082375.000us                1    1082371.450us    1082375.000us
stack        1080302.317us    1080312.500us                1    1080302.317us    1080312.500us
stack        1028030.105us    1028039.062us                1    1028030.105us    1028039.062us
stack        1015617.116us    1015625.000us                1    1015617.116us    1015625.000us
stack         861592.872us     861601.562us                1     861592.872us     861601.562us
stack         860586.499us     860593.750us                1     860586.499us     860593.750us

I found that the bottleneck is DataLoader - I implemented my own DataLoader according to this code

for epoch in range(epochs):
    print (time1)
    for data in loader:
        print (time2)

time2-time1 is extremely large (~15s) while all the rest excutions inside the inner loop including forward and backprop takes <1s.

The initial step might take more time, as the workers will be spawned and start to create the next batch. Once you are inside the loop the following iterations should be faster.
Have a look at the ImageNet example to see, how to check the data loading time for the following iterations.

OK, thanks for the confirmation. The thing that bothers me is that Pytorch seems to run slower than Keras given the same dataset and similar network size if the batch_size is larger. However, it does have an advantage with smaller batch_size.

What kind of model and dataset are you using?
Also, how large is the speed difference?

So what happens if one of my tensors is on CPU and the other is on GPU e.g. say I forgot to do .to(device) on everything assuming I am doing:

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

would that tensor be moved to GPU or what would happen?

What is the best way to make sure everything is truly using GPU. Do I need to worry that I might have forgotten to call .to(device) on something?

1 Like

is that equivalent to:

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")
   model = MyModel()


is there some sort of internal flag I can check to see if things are properly placed in GPU?

If some parameters are not located on the device while they are expected to, you’ll get an error.
To check all parameters, you could run something like:

for p in model.parameters():

Note that this only checks the parameters, so you might also want to use it for buffers.

1 Like

it seems those checks are unnecessary since you guys’s type checking disallows computing things where something is on GPU and the other is on CPU. Right?

That’s correct. If you run your code and some operations are using tensors on the GPU and CPU, you’ll get an error.

1 Like

if you are using anaconda, right click to anaconda navigator and choose run with graphic processor and choose which graphic processor that you want to use. Then launch IDE.
After IDE opened write torch.cuda.is_available() for checking it.
İf it’s true, that mean it worked.

1 Like