Better way to forward sparse matrix

11130 · July 29, 2018, 1:30am

I have a very sparse dataset that is organized as a scipy sparse csr_matrix and it is too large to convert it to a single dense numpy array. For now, I can only extract part of it and convert that part to an numpy array, then to a tensor and forward the tensor. But the csr_matrix to numpy array step is still awfully time-consuming. I wonder whether there is a better method to feed the sparse matrix.

viraat · July 29, 2018, 4:19pm

There seems to be experimental support for sparse matrices in PyTorch. I’ve never used them before but maybe this will be helpful - torch.sparse

EDIT: You might want to have a look at this discussion on GitHub regarding the state of sparse tensors in PyTorch.

11130 · July 30, 2018, 3:57am

Thank you for your timely reply. I read torch.sparse in PyTorch documents before posting but wasn’t aware of the github discussion.

Right now I have a solution as below, which is quite fast:


def spy_sparse2torch_sparse(data):
    """

    :param data: a scipy sparse csr matrix
    :return: a sparse torch tensor
    """
    samples=data.shape[0]
    features=data.shape[1]
    values=data.data
    coo_data=data.tocoo()
    indices=torch.LongTensor([coo_data.row,coo_data.col])
    t=torch.sparse.FloatTensor(indices,torch.from_numpy(values).float(),[samples,features])
    return t

But it is still not very helpful. When I print(t[0]), it says RuntimeError: Sparse tensors do not have strides. Then how should I extract a minibatch of it?

The .to_dense() method is impossible because it returns RuntimeError: $ Torch: not enough memory: you tried to allocate 141GB. Buy new RAM! at /pytorch/aten/src/TH/THGeneral.c:218

viraat · July 30, 2018, 11:36am

I’m able to reproduce the same error when I run similar code on PyTorch 0.4.1. I can’t help you out here. Maybe @albanD or @smth have some insights.

albanD · July 30, 2018, 12:03pm

Hi,

That should work.
At the moment, you cannot access elements of sparse tensors that way, you can access indices and values directly.
But you can still perform some pointwise operations on them and use them in matrix matrix multiplications.

What do you want to do with them? What are the operations your net needs to be able to do with them that are not available?

11130 · July 31, 2018, 2:56am

Thanks a lot!

I need to sample a mini-batch out of the whole dataset, feed a classifier that mini-batch and update the weights of the classifier. If mini-batch sampling is possible, I can finish the task.

My PyTorch version is 0.4.0. If there is some mechanism to do that, it will be great

albanD · July 31, 2018, 9:13am

Hi,

I am afraid functions like .index_select() are not available at the moment and you would need them to get a mini batch from your dataset.
You could potentially keep your array as a scipy one. Extract the minibatch from the scipy array (I expect this is possible but I don’t know). And then convert the minibatch to a torch (sparse) tensor just before feeding it to your net.

viraat · July 31, 2018, 9:48am

@11130 you could think of contributing to this thread - Sparse tensor use cases