Performance of torch.sparse

Hi, I am a new guy with pytorch. Recently, I implemented SDNE model in pytorch. But I got some weired results.
The input is one row of a graph’s adjacent list. Here, I only post the model and the strange results.

import torch
import torch.sparse as ts
import torch.nn as nn
import numpy as np
import time

class SDNE(nn.Module):

    def __init__(self,  num_units, k, d):
        super(SDNE, self).__init__()

        # auto_encoder
        auto_encoder = list()
        auto_encoder.append(nn.Linear(num_units[0], num_units[1]))

        for i in np.arange(1, k):
            auto_encoder.append(nn.Linear(num_units[i], num_units[i+1]))
        auto_encoder.append(nn.Linear(num_units[k], d))

        self.auto_encoder = nn.Sequential(*auto_encoder)

        # auto_decoder
        auto_decoder = list()
        auto_decoder.append(nn.Linear(d, num_units[k]))
        for i in np.arange(0, k):
            auto_decoder.append(nn.Linear(num_units[k - i], num_units[k-i-1]))

        self.auto_encoder = nn.Sequential(*auto_encoder)
        self.auto_decoder = nn.Sequential(*auto_decoder)

    def forward(self, x):
        start = time.time()
        y = self.auto_encoder(x)

        end_time = time.time()
        print("encoder time : " + str(time.time() - start))
        x_hat = self.auto_decoder(y)

        print("decoder time: " + str(time.time() - end_time))
        return  y

This model has 5 layers. The feature number of each layer is 4841716 , 48, 20, 48, 4841716.
When I give some inputs which is dense tensor and about 5 million elements, the result is :

encoder time : 0.14040541648864746
decoder time: 0.007679462432861328
encoder time : 0.042407989501953125
decoder time: 0.004119396209716797
backward time = 0.11002826690673828
encoder time : 0.037224769592285156
decoder time: 0.00419163703918457
encoder time : 0.0371246337890625
decoder time: 0.004086017608642578
backward time = 0.1155850887298584
encoder time : 0.037232398986816406
decoder time: 0.004139423370361328
encoder time : 0.03712105751037598
decoder time: 0.0040395259857177734
backward time = 0.11581850051879883

Then I give an input which are sparse and has only 100 non-zero elements, but the size is as same as the former. Which got the followed results:

encoder time is 0.113931
decoder time is 0.001139
encoder time is 0.046264
decoder time is 0.000139

My questions are :

  1. Why the first round of training always consumed more tine thant the rest?
  2. Using sparse should has better performance because it could save more unnessesary operations. But the reuslts didn’t show that. Why?

THANKS if anyone can reply to me.


  1. This is not worrying and could be cause by a few things. The first cuda op is always going to be slow as we lazily initialize cuda only when need it. Also since you don’t sync entering the forward, there might be some ops from before your main loop that are not completed (large data transfers or preprocessing).
  2. The thing is that the GPUs are really (really) good at number crunching and really bad at logic. Working with sparse tensors makes a lot of things much more logic that just number crunching. Unless your matrix is very very sparse (<1% values), there is no chance that the sparse version is going to be faster.
    Note that this is true if you have large enough workloads. For small ones, everything is simply going to be hidden by just the cost of launching the task on the GPU which is non negligible.

I think I get your point. By the way, is the workload in this case still too small?

It is hard to get a definite answer on that. It will depends a lot on the GPU you have and the ops you’re doing.
Linear layers going from 48 to 20 dimensions definitly are (but the cost of moving everything back to the cpu just to do this op is even more).
A Linear layer with 5M inputs should be ok for most recent GPUs.

Thanks. It helps a lot. The GPUs I am using are Nvidia P100 and K20m.