Issue in GPU training for cifar10_tutorial with Pytorch!

rfid_gao · September 5, 2017, 4:57am

Dear sir,
I am a new guy,had issue with GPU training for cifar10_tutorial. CPU training code is fine!

My environment:
pytorch 0.2.0 py27hc03bea1_4cu80 [cuda80] soumith
torchvision 0.1.9 py27hdb88a65_1 soumith

code:
in 30: outputs = net.cuda(Variable(images))

I get:

RuntimeError Traceback (most recent call last)
in ()
----> 1 outputs = net.cuda(Variable(images))

/home/john/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.pyc in cuda(self, device_id)
145 copied to that device
146 “”"
–> 147 return self._apply(lambda t: t.cuda(device_id))
148
149 def cpu(self):

/home/john/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.pyc in _apply(self, fn)
116 def _apply(self, fn):
117 for module in self.children():
–> 118 module._apply(fn)
119
120 for param in self._parameters.values():

/home/john/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.pyc in _apply(self, fn)
122 # Variables stored in modules are graph leaves, and we don’t
123 # want to create copy nodes, so we have to unpack the data.
–> 124 param.data = fn(param.data)
125 if param._grad is not None:
126 param._grad.data = fn(param._grad.data)

/home/john/anaconda2/lib/python2.7/site-packages/torch/nn/modules/module.pyc in (t)
145 copied to that device
146 “”"
–> 147 return self._apply(lambda t: t.cuda(device_id))
148
149 def cpu(self):

/home/john/anaconda2/lib/python2.7/site-packages/torch/_utils.pyc in _cuda(self, device, async)
51 if device is None:
52 device = torch.cuda.current_device()
—> 53 if self.get_device() == device:
54 return self
55 else:

/home/john/anaconda2/lib/python2.7/site-packages/torch/autograd/variable.pyc in bool(self)
121 return False
122 raise RuntimeError(“bool value of Variable objects containing non-empty " +
–> 123 torch.typename(self.data) + " is ambiguous”)
124
125 nonzero = bool

RuntimeError: bool value of Variable objects containing non-empty torch.ByteTensor is ambiguous

Do anybody try the tutorial? Is this a known bug? How could I fix it?
Thanks
John

allenye0119 · September 5, 2017, 5:12am

Can you provide a link to that tutorial?

QuantScientist · September 5, 2017, 7:25pm

My CIFAR-10 example is fully working on the GPU:
https://github.com/QuantScientist/Deep-Learning-Boot-Camp/blob/master/day%2002%20PyTORCH%20and%20PyCUDA/PyTorch/21-PyTorch-CIFAR-10-Custom-data-loader-from-scratch.ipynb

rfid_gao · September 5, 2017, 8:12pm

http://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html
Maybe I code has problem, who can provide one to test my GPU?

QuantScientist · September 5, 2017, 8:32pm

I just provided you with a link to my code above …

rfid_gao · September 6, 2017, 1:47am

Thanks, sorry, how to get it.
I use wget https://github.com/QuantScientist/Deep-Learning-Boot-Camp/blob/master/day%2002%20PyTORCH%20and%20PyCUDA/PyTorch/21-PyTorch-CIFAR-10-Custom-data-loader-from-scratch.ipynb

Then try to open it, it claim the file is not jason format.

allenye0119 · September 6, 2017, 1:53am

Where did you copy the line output = net.cuda(Variable(images)) from?
If you want to train your model on GPU, you should follow this part of the tutorial.

net.cuda()
images = Variable(images.cuda())
output = net(images)

rfid_gao · September 6, 2017, 1:56am

I get it, but this is not the same one, Thank you very much!

rfid_gao · September 6, 2017, 2:03am

Hi Allenye0119, Thank you reply very much, the part is too simple.

I search the internet cannot find a good document for it. I already did:
net = Net()
net.cuda()

forward + backward + optimize

    outputs = net(inputs).cuda()  <== should I touch it or leave it alone.

If you can, could you provide the guid how to modify http://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html to GPU trainingable
Thanks

rfid_gao · September 6, 2017, 2:12am

Thank everybody, I think I start figure it out, I will follow QuantScientistSolomon K
6h’s example. This case could closed

rfid_gao · September 6, 2017, 2:22am

Dear Quant, I run your notbook, but it take forever. How I know it run success on my GPU?
How to debug it?
I think my GPU stats is more then yours any problem?

e)0e7Every 0.1s: nvidia-settings -q GPUUtilization -q us… Tue Sep 5 19:19:41 2017Attribute ‘GPUUtilization’ (john-GS63VR-6RF:0[gpu:0]): graphics=3,memory=1, video=0, PCIe=1Attribute ‘UsedDedicatedGPUMemory’ (john-GS63VR-6RF:0[gpu:0]): 264.‘UsedDedicatedGPUMemory’ is an integer attribute.‘UsedDedicatedGPUMemory’ is a read-only attribute.‘UsedDedicatedGPUMemory’ can use the following target types: GPU.18,728,311,556321520411523322031123127,417,10, video=0, PCIe=44849, video=0, PCIe=3 211, video=0, PCIe=71804659, video=0, PCIe=3 52111, video=0, PCIe=7426170486616, video=0, PCIe=3 610, video=0, PCIe=4892019769910, video=0, PCIe=42036116, video=0, PCIe=3 82210, video=0, PCIe=6148, video=0, PCIe=4 99710, video=0, PCIe=459, video=0, PCIe=3 810, video=0, PCIe=694620115, video=0, PCIe=3 810, video=0, PCIe=450222619048749, video=0, PCIe=3 2011, video=0, PCIe=61180479,6, video=0, PCIe=3 19,10, video=0, PCIe=472516, video=0, PCIe=3 910, video=0, PCIe=422353137, video=0, PCIe=4 810, video=0, PCIe=412026116, video=0, PCIe=3 2212, video=0, PCIe=4417027910, video=0, PCIe=482021475810, video=0, PCIe=4624469,6, video=0, PCIe=3 14,1062211, video=0, PCIe=114423430146272467943007 61107, video=0, PCIe=1 3827234117496767523120:004362413162314322425241332522311423235742141646526374633172343842342942310232312323223233237424851342725154314231427469210,35,137463754112,430347,23311852293472331129723614536522041392410,318,2716356429210,39,2633123823211422183257934211723934310,4625853313, video=0, PCIe=47324137,283223693410,5,23617472617623311285221647697432182212,342,113016,344,112111,238,16,341232,11211,334,11210,2332,114015678940378781784213,629,4731875311,73

212412125121261536116,12 217426478667932810,373,11250749310,9,7219310,9,72321142231233245374251364535724961522934611,08,2936237211393451262383116374217239349825134153531:009246172951513413522317244126234123312524172346251315225311236522134221742314284527246197261374231051245412149247910,329,212,31213214,1123410,344,1137235911,349,238311693311,48,237311723283232934262231120746763741634231833912,4824,126331

rfid_gao · September 6, 2017, 2:43am

CUDA Trick in my system it take a long time and seems forever!
I had 01:00.0 VGA compatible controller: NVIDIA Corporation Device 1c20 (rev a1)
and I had try nvidia driver 375/384 all the same.
I use conda:
john@john-GS63VR-6RF ~ $ conda list | grep -i cuda
accelerate_cudalib 2.0 0
cuda80 1.0 0 soumith
cudatoolkit 7.0 1
pytorch 0.2.0 py27hc03bea1_4cu80 [cuda80] soumith
Why my laptop cud Trick take so long?

QuantScientist · September 6, 2017, 4:21am

My Apologies, there was a line there that should have been commented, this is why it run forever.

use_cuda = torch.cuda.is_available()
# use_cuda = False

FloatTensor = torch.cuda.FloatTensor if use_cuda else torch.FloatTensor
LongTensor = torch.cuda.LongTensor if use_cuda else torch.LongTensor
Tensor = FloatTensor


# if torch.cuda.is_available():    
#     print("WARNING: You have a CUDA device, so you should probably run with --cuda")
    
# ! watch -n 1 nvidia-smi
# ! nvidia-smi -l 1

# nvidia-settings -q GPUUtilization -q useddedicatedgpumemory
# You can also use:
# ! watch -n0.1 "nvidia-settings -q GPUUtilization -q useddedicatedgpumemory"

# ! pip install git+https://github.com/wookayin/gpustat.git@master
    
# ! watch --color -n1.0 gpustat
# ! gpustat
# ! watch -n 5 nvidia-smi --format=csv --query-gpu=power.draw,utilization.gpu,fan.speed,temperature.gpu

I updated the code,

rfid_gao · September 6, 2017, 8:52pm

Dear Quant,
I have another issue, where can I get trainLabels.csv
DATA_ROOT =’/home/john/Downloads/data’
IMG_PATH = DATA_ROOT + '/train/'
IMG_EXT = '.png’
IMG_DATA_LABELS = DATA_ROOT + ‘/trainLabels.csv’

I put the orginal train data in train:
john@john-GS63VR-6RF ~/Downloads/data/train $ ls -l
total 181876
-rw-r–r-- 1 john john 158 Mar 30 2009 batches.meta
-rw-r–r-- 1 john john 31035704 Mar 30 2009 data_batch_1
-rw-r–r-- 1 john john 31035320 Mar 30 2009 data_batch_2
-rw-r–r-- 1 john john 31035999 Mar 30 2009 data_batch_3
-rw-r–r-- 1 john john 31035696 Mar 30 2009 data_batch_4
-rw-r–r-- 1 john john 31035623 Mar 30 2009 data_batch_5
-rw-r–r-- 1 john john 88 Jun 4 2009 readme.html
-rw-r–r-- 1 john john 31035526 Mar 30 2009 test_batch

But where to get trainLabels.csv?
It claim error late.
Thanks
John

QuantScientist · September 6, 2017, 10:15pm

Dear John,
At the top of the notebook it is stated:

For this to work:
Download data from https://www.kaggle.com/c/cifar-10/data
Remove headers from the CSV BEFORE running this code
in the images training folder copy 1.png to 0.png and add the same label inside training labels.

Did you get the data from Kaggle?