Backward is too slow

Hi
I have installed cuda8 pytorch 0.4.1 with conda.
the back-prop looks too slow.
here is the report using util.bottleneck:


here is my test code for main training process in debug mode

start1 = time.time()
for _ in range(100):
    word_in1 = torch.cuda.LongTensor(word_in)
    word_out1 = torch.cuda.LongTensor(word_out)
    label = torch.cuda.DoubleTensor(train_label)
    emb_u =  nn.functional.embedding(word_in1,syn0)
    emb_v =  nn.functional.embedding(word_out1,syn1)
    outs = torch.sigmoid(torch.sum(torch.mul(emb_u, emb_v), dim=-1))
    loss = Lossfunc.cuda()(outs, label)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
print (time.time() - start1)
    optimizer = optim.SGD([syn0, syn1], lr=alpha)
    Lossfunc = nn.BCELoss(reduction='sum')

and I found the last three lines(.zero_grad(),.backward(),.step()) occupy most of the time.
So what should i do next?

Hi,

Why do you think it is too slow?
Running the backward should be between 1 and 2x the forward pass.
Then the gradient step depends on the size of your weights.

If your Embedding layers are very large compared to the rest of the net, you can do sparse updates by using sparse=True for it. See doc. And by using an optimizer that supports sparse updates like SGD or SparseAdam.

1 Like

Thank you for your reply.
The reason I think it’s slow is that the same training process costs two minutes using numpy+CPU but an hour using pytorch+GPU.
here is my numpy code:

                    z = np.dot(syn0[context_word],syn1[word_out].T)
                    p = expit(z)
                    g = alpha * (label-p)
                    neu1e = syn1[x_]
                    syn1[x_] += np.outer(g,syn0[context_word])
                    syn0[context_word] += np.dot(g,neu1e)

There must be something wrong with my pytorch code.

According to my test, The time of backward is about 170x that of forward pass, which is 5.8s vs 0.035s in 100 cycles.

Hi,

Could you send me a full script that runs with all the sizes being the ones you use. Replacing your data with random tensors.
I guess two possible things here:

  • The graph expands accross iterations and thus the traversal for the backward becomes dead slow. I need the full running code to check that.
  • Give the simplicity of your graph and the fact that it’s mainly Embedding from what I see, you might be hitting a very bad worst case. Here again, try sparse=true for the embedding, it was made for that purpose.

Thank you very much!
Here is my code:

import torch.optim as optim
import torch
import torch.nn as nn
import numpy as np
import os
import time
if __name__ == '__main__':
    os.environ['CUDA_VISIBLE_DEVICES'] = '0'
    syn0 = torch.randn((2829,100),requires_grad=True,device='cuda')
    syn1 = torch.randn((2829,100),requires_grad=True,device='cuda')
    optimizer = optim.SGD([syn0, syn1], lr=0.025)
    Lossfunc = nn.BCELoss(reduction='sum').cuda()
    start1 = time.time()
    for index,_ in enumerate(range(40000)):
        word_in = np.random.randint(low=0, high=2829, size=32)
        word_out = np.random.randint(low=0, high=2829, size=32)
        if index%10000 == 0:
            print(('%d of 40000 (%.2f%%)')%(index,index / 400.0))
        word_in1 = torch.cuda.LongTensor(word_in)
        word_out1 = torch.cuda.LongTensor(word_out)
        label = torch.cuda.FloatTensor([1]+ [0]*31)
        emb_u = nn.functional.embedding(word_in1,syn0,sparse=True)
        emb_v = nn.functional.embedding(word_out1,syn1,sparse=True)
        outs = torch.sigmoid(torch.sum(torch.mul(emb_u, emb_v), dim=-1))
        loss = Lossfunc(outs,label)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print (time.time()-start1)

When i replace my data with random tensor, the backward become about 2x slowly than the forward pass just like you said.

But it’s still very slow, it took about 1 min to run 40000 rounds in a Tesla M40. Is there any mistake in my code writing? If not, can you give me some suggestion to speed up it? Should i use torch.set_num_thread() or torch.multiprocessing?

Thanks again!

Hi,

Some observations:

  • My first point does not happen, so no problem on that side.
  • Sparse=True make the perf slightly worst here. This is expected as your embedding is not that big.
  • In my tests, I moved all the data generation out of the loop but that does not change much.
  • Here each forward-backward-update takes <1ms on my machine. I don’t think you can expect it to be much faster. It’s just that the outer loop is very large. Try removing the python if statement in your loop, you actually see the difference in runtime.
  • The gpu usage is actually quite low, increasing the batch size to 128 still gives me a runtime of <1ms per iterations. So If you want this to run faster, increase the batch size.
  • torch.set_num_thread will only change cpu core usage for heavy operations. But you don’t do any such operation of cpu here.
  • torch.multiprocessing would allow you to do multiCPU-multiGPU. But you can’t use the one you have already fully, so there is little hope you can improve on that side.

Thank you! Your advice is very helpful to me.

The GPU usage is indeed slow, but when i want to use CPU and remove all “.cuda”, it becomes about 50x slower, why?

Well because even low GPU usage is much faster than CPU. Especially for such ops.

But the following code based on numpy could run much faster for the same training process, just using CPU(4s VS 65s):

        syn0 = np.random.uniform(low=-0.5/100, high=0.5/100, size=(2829, 100))
        syn1 = np.zeros(shape=(2829,100))
        start1 = time.time()
        for index,_ in enumerate(range(40000)):
            x_ = np.random.randint(low=0, high=2829, size=32)
            if index%10000 == 0:
                print(('%d of 40000 (%.2f%%)')%(index,index / 400.0))
            context_word = np.random.randint(low=0, high=2829, size=1)[0]
            label = np.array([1] + [0]*5)
            z = np.dot(syn0[context_word], syn1[x_].T)
            p = expit(z)
            g = 0.025 * (label - p)
            neu1e = syn1[x_]
            syn1[x_] += np.outer(g, syn0[context_word])
            syn0[context_word] += np.dot(g, neu1e)
        print(time.time() - start1)

I want to use autograd function provided by pytorch, but it slow down severely.
I tried increase the batch size, and the performance will drop correspondingly.

Hi,

It is expected that there is some overead from the autograd engine especially for such a small graph. But it looks a bit too much in this case.
I’m not super fluent in numpy code but it looks like:

  • Your context word is of size 1, while in the pytorch code it’s of size batch_size=32
  • Your label is of size 6 while in the pytorch code it’s of size 32
  • What is the expit function doing?
  • Have you tried replacing each op from your numpy code with the torch counterpart ? This should give similar runtime on cpu and speedup on gpu if the ops are big enough.

Hi~ I also encountered this problem. My code is like this:

# fd_prob: [batch_size, tgt_len, vocab_size], means words probability distribution
# bd_hyp: [batch_size, infer_len], means another output labels
# fd_bd_attn: [batch_size, tgt_len, infer_len], means edit probability distribution
# fd_p_gen: [batch_size, tgt_len, 1], means copy mode probability
batch_size, tgt_len, _ = fd_prob.size()
_, infer_len = bd_hyp.size()
# incorporate copy mode
for i in range(batch_size):
    for j in range(tgt_len):
        for k in range(infer_len):
            fd_prob[i][j][bd_hyp[i][k]] += (1 - fd_p_gen[i][j][0]) * fd_bd_attn[i][j][k]
loss = criterion(fd_prob, ground_truth)
loss.backward()

I modified output distribution by a 3-layer loop which contained lots of index, and I found that forward cost less than 1s while backward cost more than 1.5min, which is unacceptable. I think it’s due to lots of index. Have you found the solution or is there an elegant way to do this?
Also see Indexing is very slow for backpropagation

Hi,

This ia expected. Each operations that you do add an operation in the computational graph. You’re creating a huge graph here so the backward pass is going to be very slow. You will need to parallelize your operations using builtin functions and/or masking.

1 Like

Thank you for your quick reply. I fixed this problem since I used scatter_add_ to do this operation.

2 Likes

Could you provide your modified code? I meet same problem

1 Like

I meet the same problem, could you please provide the modified code?