Some variables are not affected by .cuda()

pltrdy · April 20, 2017, 4:23pm

Hey there!

I’m starting with pytorch so I wanted to implement a neural language model.

Everything was going ok, I started facing problems when trying to work with GPU.

In fact I have a typical model that embeds, run RNN (LSTM) then use an output projection xW+b then a softmax.

My model is like:

    class RnnLm(nn.Module):
        def __init__(self, params):
            super().__init__()
            self.params = params
            
            self.embedding = nn.Embedding(num_embeddings=params.vocab_size, 
                                          embedding_dim=params.embed_dim)
            
            self.cell = nn.LSTM(input_size=params.embed_dim, 
                                hidden_size=params.hidden_size,
                                batch_first=True)
            
            self.out_w = autograd.Variable(torch.randn(params.hidden_size, params.vocab_size))
            self.out_b = autograd.Variable(torch.randn(params.vocab_size))
        
        def _embed_data(self, src):
            """Embeds a list of words 
            """
            src_var = autograd.Variable(src)
            embedded = self.embedding(src_var)
            return embedded
            
        def forward(self, inputs):
            # inputs: nested list [batch_size x time_steps]
            # emb_inputs: [bs x ts x emb_size]
            emb_inputs = self._embed_data(inputs) 
            log("Input: %s ; Embedded: %s "% (str(inputs.size()), str(emb_inputs.size())))
            

            # Running the RNN
            # o: [bs x ts x h_size]
            # h: [n_layer x ts x h_size]
            # c: [n_layer x ts x h_size]
            o, (h, c) = self.cell(emb_inputs)
            o = o.contiguous()
            self.o = o
            log("Outputs: %s" % str(o.size()))
            log("h %s" % str(h.size()))
            log("c %s" % str(c.size()))
            
            
            # Output projection
            # oo: [bs*ts x h_size]
            # logits: [bs*ts x vocab_size]
            oo = o.view(-1, params.hidden_size)
            
            logits = oo @ self.out_w + self.out_b.expand_as(logits)
            
            # Softmax
            prediction = F.log_softmax(logits)
            
            return prediction

The whole code can be seen here: https://github.com/pltrdy/pytorchwork/blob/master/rnn_lm.ipynb its quite experimental (=messy).
Trying to work with GPU, I create a object “model” then call model.cuda().
The problem then comes from out_w & out_b that are not cuda tensors

print("data type: oo: %s; out_w: %s" % (str(type(oo.data)), str(type(self.out_w.data))))

Returns:

 of type data type: oo: <class 'torch.cuda.FloatTensor'>; out_w: <class 'torch.FloatTensor'>

oo's type is ok, but out_w should be torch.cuda.FloatTensor.

Obviously, I could add some .cuda() for out_w & out_b in RnnLM.__init__ but thats fixing without learning.

Thanks for any help or suggestion.

albanD · April 20, 2017, 4:30pm

if out_w and out_b are parameters of your layer, you should declare them as nn.Parameter and not autograd.Variable. Doing this change will make nn behave as expected wrt sending weights to cuda.

pltrdy · April 20, 2017, 4:32pm

Oh, ok. Just never seen it
Anyway thank you for such a fast answer for a dummy question

pltrdy · April 20, 2017, 4:46pm

The first error is solved. But I now have a segfault. It may be related to .contiguous() (note that I don’t need contiguous when I don’t use cuda.

Note as well, my data initially comes from numpy.

albanD · April 20, 2017, 4:52pm

segfault? That unexpected.
Would you have a stacktrace to have more information about where it occurs?

pltrdy · April 20, 2017, 5:04pm

Well, segfault does not print python trace. At least, using print (thats cheap, I know), before each instruction I found that it occurs before the .backward().

Interestingly, some iterations are working ok, then, at one point it segfault

Edit: also note that the size of the output looks constant i.e. it is always the same iteration which fails

albanD · April 20, 2017, 5:21pm

I was wondering if you could use gdb to try and get more information:
run gdb --args python you_script.py --your args.
after it started, type run.
once it stops due to the segfault, type bt and print here the backtrace that it prints.

pltrdy · April 20, 2017, 5:23pm

Nice, I was looking for the --args python trick, didn’t know it.

The output is:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffad893700 (LWP 17317)]
torch::autograd::GradBuffer::addGrad(unsigned long, std::shared_ptr<torch::autograd::Variable>&&) (
    this=this@entry=0x7fffad892c40, pos=pos@entry=0, 
    var=var@entry=<unknown type in /home/moses/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so, CU 0x6e92e4, DIE 0x6fcb24>) at torch/csrc/autograd/grad_buffer.cpp:17
17	torch/csrc/autograd/grad_buffer.cpp: No such file or directory.
(gdb) bt
#0  torch::autograd::GradBuffer::addGrad(unsigned long, std::shared_ptr<torch::autograd::Variable>&&) (
    this=this@entry=0x7fffad892c40, pos=pos@entry=0, 
    var=var@entry=<unknown type in /home/moses/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so, CU 0x6e92e4, DIE 0x6fcb24>) at torch/csrc/autograd/grad_buffer.cpp:17
#1  0x00007fffed45c9a1 in torch::autograd::Engine::evaluate_function (this=this@entry=0x7fffedcd7ce0 <engine>, 
    task=...) from /home/<user>/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so
#2  0x00007fffed45dd1a in torch::autograd::Engine::thread_main (this=this@entry=0x7fffedcd7ce0 <engine>, queue=...)
   from /home/<user>/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so
#3  0x00007fffed46e87a in PythonEngine::thread_main (this=0x7fffedcd7ce0 <engine>, queue=...)
   from /home/<user>/anaconda3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so
#4  0x00007ffff652d870 in ?? () from /home/<user>/anaconda3/bin/../lib/libstdc++.so.6
#5  0x00007ffff7474184 in start_thread (arg=0x7fffad893700) at pthread_create.c:312
#6  0x00007ffff688cbed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb)

thx for your help