Hi when I try this codes, the second failed with the info. :Segmentation fault (core dumped). But when I add CUDA_VISIBLE_DEVICES=1 it works. only when I using CUDA_VISIBLE_DEVICES=0 it is failed. Can you PLS tell me why and give any suggestion.
I’m trying to implement the methods at the beginning of this thread as follows:
model = model.cuda()
torch.backends.cudnn.benchmark=True
import time
start = time.time()
model.train()
train_loss = []
train_accu = []
i = 0
for epoch in range(20):
for data, target in train_loader:
data, target = (Variable(data).double()).cuda(), (Variable(target).long()).cuda()
optimizer.zero_grad()
output = model(data.view(batch_size,1,64,64))
loss = F.nll_loss(output, target) # Negative log likelihood (goes with softmax).
loss.backward() # calc gradients
train_loss.append(loss.data[0]) # Calculating the loss
optimizer.step() # update gradients
prediction = output.data.max(1)[1] # first column has actual prob.
accuracy = (prediction.eq(target.data).sum()/batch_size)*100
train_accu.append(accuracy)
if i % 10 == 0:
print('Epoch:',str(epoch),'Train Step: {}\tLoss: {:.3f}\tAccuracy: {:.3f}'.format(i, loss.data[0], accuracy))
i += 1
end = time.time()
print('TRAIN TIME:')
print('%.2gs'%(end-start))
But when I train, I just get a constant accuracy of 0%. Am I missing some part where I need to cast to .cuda() ?
prediction.eq(target.data)
returns a byte tensor/variable. Summing it up and dividing it by a batchsize would lead to zero.
Try it with
accuracy = (prediction.eq(target.data).float().sum()/batch_size)*100
Ah yes it would wouldn’t it! Worked beautifully, thanks!
Would it matter that I’ve called .cuda() on data before turning it into a variable or should i be doing Variable(data).double().cuda() ?
This should both work equally good.
I would recommend switching to pytorch 0.4 as both classes are merged in this release
hello dear i have the same issue. i don’t know how to solve it. could you help me please.
Hi,
I am struggling with running Pytorch on GPU. I created a simple fully connected network, set batch_size very large to make sure all data will be fed for the first time, and put my model, X and y to GPU using to('cuda')
. The training takes long time comparing to Keras on GPU, and takes similar time to that if I set os.environ["CUDA_VISIBLE_DEVICES"]="-1"
such that training will be run on CPU. I wonder if I miss any import step to run Pytorch on GPU.
In fact I observed timing difference for a CNN network - GPU runs faster than CPU. However, I cannot manage to realise it for a fully connected network. The size of the network won’t change the conclusion.
Is there any test code for a fully connected deep network running on GPU? All examples on the web that I can find are CNNs.
I add my profiling results from torch.utils.bottleneck
. I am not sure what can I read from the results, in particular if CUDA time is bigger than CPU time - does it mean GPU is utilised? Thanks!
--------------------------------------------------------------------------------
autograd profiler output (CPU mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
--------- --------------- --------------- --------------- --------------- ---------------
Name CPU time CUDA time Calls CPU total CUDA total
--------- --------------- --------------- --------------- --------------- ---------------
stack 1995016.802us 0.000us 1 1995016.802us 0.000us
stack 1433562.687us 0.000us 1 1433562.687us 0.000us
stack 1418816.239us 0.000us 1 1418816.239us 0.000us
stack 1208400.125us 0.000us 1 1208400.125us 0.000us
stack 1109156.949us 0.000us 1 1109156.949us 0.000us
stack 1043755.894us 0.000us 1 1043755.894us 0.000us
stack 989006.451us 0.000us 1 989006.451us 0.000us
stack 988511.989us 0.000us 1 988511.989us 0.000us
stack 984434.292us 0.000us 1 984434.292us 0.000us
stack 980338.307us 0.000us 1 980338.307us 0.000us
stack 976940.691us 0.000us 1 976940.691us 0.000us
stack 955838.942us 0.000us 1 955838.942us 0.000us
stack 955763.458us 0.000us 1 955763.458us 0.000us
stack 952211.930us 0.000us 1 952211.930us 0.000us
stack 951751.424us 0.000us 1 951751.424us 0.000us
--------------------------------------------------------------------------------
autograd profiler output (CUDA mode)
--------------------------------------------------------------------------------
top 15 events sorted by cpu_time_total
Because the autograd profiler uses the CUDA event API,
the CUDA time column reports approximately max(cuda_time, cpu_time).
Please ignore this output if your code does not use CUDA.
--------- --------------- --------------- --------------- --------------- ---------------
Name CPU time CUDA time Calls CPU total CUDA total
--------- --------------- --------------- --------------- --------------- ---------------
stack 1348676.702us 1348687.500us 1 1348676.702us 1348687.500us
stack 1325784.279us 1325796.875us 1 1325784.279us 1325796.875us
stack 1301842.419us 1301843.750us 1 1301842.419us 1301843.750us
stack 1271585.903us 1271609.375us 1 1271585.903us 1271609.375us
stack 1269943.439us 1269953.125us 1 1269943.439us 1269953.125us
stack 1184606.802us 1184597.656us 1 1184606.802us 1184597.656us
stack 1176057.135us 1176062.500us 1 1176057.135us 1176062.500us
stack 1108025.533us 1108031.250us 1 1108025.533us 1108031.250us
stack 1095250.413us 1095257.812us 1 1095250.413us 1095257.812us
stack 1082371.450us 1082375.000us 1 1082371.450us 1082375.000us
stack 1080302.317us 1080312.500us 1 1080302.317us 1080312.500us
stack 1028030.105us 1028039.062us 1 1028030.105us 1028039.062us
stack 1015617.116us 1015625.000us 1 1015617.116us 1015625.000us
stack 861592.872us 861601.562us 1 861592.872us 861601.562us
stack 860586.499us 860593.750us 1 860586.499us 860593.750us
I found that the bottleneck is DataLoader - I implemented my own DataLoader according to this code
for epoch in range(epochs):
print (time1)
for data in loader:
print (time2)
....
time2-time1
is extremely large (~15s) while all the rest excutions inside the inner loop including forward and backprop takes <1s.
The initial step might take more time, as the workers will be spawned and start to create the next batch. Once you are inside the loop the following iterations should be faster.
Have a look at the ImageNet example to see, how to check the data loading time for the following iterations.
OK, thanks for the confirmation. The thing that bothers me is that Pytorch seems to run slower than Keras given the same dataset and similar network size if the batch_size is larger. However, it does have an advantage with smaller batch_size.
What kind of model and dataset are you using?
Also, how large is the speed difference?
So what happens if one of my tensors is on CPU and the other is on GPU e.g. say I forgot to do .to(device)
on everything assuming I am doing:
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
would that tensor be moved to GPU or what would happen?
What is the best way to make sure everything is truly using GPU. Do I need to worry that I might have forgotten to call .to(device)
on something?
is that equivalent to:
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
model = MyModel()
model.to(device)
?
is there some sort of internal flag I can check to see if things are properly placed in GPU?
If some parameters are not located on the device while they are expected to, you’ll get an error.
To check all parameters, you could run something like:
for p in model.parameters():
print(p.device)
Note that this only checks the parameters, so you might also want to use it for buffers.
it seems those checks are unnecessary since you guys’s type checking disallows computing things where something is on GPU and the other is on CPU. Right?
That’s correct. If you run your code and some operations are using tensors on the GPU and CPU, you’ll get an error.
if you are using anaconda, right click to anaconda navigator and choose run with graphic processor and choose which graphic processor that you want to use. Then launch IDE.
After IDE opened write torch.cuda.is_available() for checking it.
İf it’s true, that mean it worked.