Optimizing diagonal stripe code

I need to get a diagonal stripe of the matrix. Say, I have a matrix of size KxN, where K and N are arbitrary sizes and K>N. Given a matrix:

[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]

From it I would need to extract a diagonal stripe, in this case, a matrix MxV size that is created by truncating the original one:

[[ 0  x  x]
 [ 3  4  x]
 [ x  7  8]
 [ x  x  11]]

So the result matrix is:

[[ 0  4  8]
 [ 3  7  11]]

Now, an additional problem that I face is that I have a tensor of, say, size [150, 182, 91], the first part is just the batch size while the matrix I am interested in is the 182x91 one. (sizes here are just examples)

I need to run a function on the 182x91 matrix for each of the 50 dimensions separately.

Well, I have a solution to this actually, but the code makes my whole model at least 20 times slower (credits for the code to layog from StackOverflow). Here is the piece of code that would do it in Pytorch 0.4 (cuda 9.1):

In [1]: import torch

In [2]: def stripe(a):
   ...:     i, j = a.size()
   ...:     assert(i > j)
   ...:     out = torch.zeros((i - j, j))
            # this is probably the bottleneck part
   ...:     for diag in range(0, i - j):
   ...:         out[diag] = torch.diag(a, -diag)
   ...:     return out

In [3]: a = torch.randn((182, 91)).cuda()

In [5]: output = stripe(a)

In [6]: output.size()
Out[6]: torch.Size([91, 91])

In [7]: a = torch.randn((150, 182, 91))
# we map the stripe function over the first dimension of the tensor using torch.unbind
# this is a potential bottleneck
In [8]: output = list(map(stripe, torch.unbind(a, 0)))

In [9]: output = torch.stack(output, 0)

In [10]: output.size()
Out[10]: torch.Size([150, 91, 91])

Any potential for optimization here?

For reference, if that helps, here is the same implementation done in numpy:

>>> import numpy as np
>>> 
>>> def stripe(a):
...    a = np.asanyarray(a)
...    i, j = a.shape
...    assert i >= j
...    k, l = a.strides
...    return np.lib.stride_tricks.as_strided(a, (i-j+1, j), (k, k+l))
... 
>>> a = np.arange(24).reshape(6, 4)
>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])
>>> stripe(a)
array([[ 0,  5, 10, 15],
       [ 4,  9, 14, 19],
       [ 8, 13, 18, 23]])

Hi,

You can use the torch.as_strided() function that has the same definition as the numpy one.
So your stripe implementation should work for torch.tensors as well.

I would use it instead of the for loop in stripe? How do I get the strides before that then?
Thank you!

To get the stride, use the stride function: t.stride().
In your original comment, you said that the numpy implementation you provide is what you want. You can implement this exact function in pytorch by using t.size() (to replace a.shape), t.stride() (to replace a.strides) and torch.as_strided() (to replace np.lib.stride_tricks.as_strided()). Then you can use it the same way you would use th numpy version.

Thank you. I have changed the stripe function and it returns a proper stripe:

def stripe(a):
    i, j = a.size()
     assert i >= j
     k, l = a.stride()
     return torch.as_strided(a, (i - j, j), (k, k+1))

a = torch.randn((182, 91)).cuda()

output = stripe(a)
# output.size()
# torch.Size([91, 91])

a = torch.randn((150, 182, 91))
output = list(map(stripe, torch.unbind(a, 0)))
output = torch.stack(output, 0)

# output.size()
# torch.Size([150, 91, 91])

Now I am facing some obscure PyTorch 0.4 error after using that stripe function when the model computes backwards loss:

Traceback (most recent call last):
  File "runner.py", line 305, in <module>
    main()
  File "runner.py", line 249, in main
    loss = model.update(batch)
  File "J:\PyCharmProjects\tac-self-attention\model\rnn.py", line 67, in update
    loss.backward()
  File "J:\Anaconda_Python3_6\envs\cuda2\lib\site-packages\torch\tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "J:\Anaconda_Python3_6\envs\cuda2\lib\site-packages\torch\autograd\__init__.py", line 89, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Tensor: invalid storage offset at c:\programdata\miniconda3\conda-bld\pytorch_1524546371102\work\aten\src\thc\generic/THCTensor.c:759
(cuda2)

This doesn’t occur when using the version with .diag(). Any ideas what causes it?

Hi,

I’m afraid the joint use of unbind, your stripe function and stack makes it so that some storage offset becomes negative.
Why not use a batch stripe version instead (that should be much faster as well) ?

def batch_stripe(a):
    b, i, j = a.size()
    assert i >= j
    b_s, k, l = a.stride()
    return torch.as_strided(a, (b, i - j, j), (b_s, k, k+1))
1 Like

Yes, exactly what I was looking for from the very beginning. Thanks a lot! Using your batch_stripe() function brings back the fast training speed I had before using stack and diag.

I’ve got another question, is it possible instead of getting:

[[ 0  x  x]
 [ 3  4  x]
 [ x  7  8]
 [ x  x  11]]

To get a reverse stripe like this:

[[ x   x   2]
 [ x   4   5]
 [ 6   7   x]
 [ 9   x   x]]

Can’t figure out how to properly modify the function above. Thank you!

I guess you could by shifting the beginning of a by linearizing the last two dimensions + narrow. Then set the strides to (k, k-1) for the last two dimensions.

Thank you, could you elaborate more? I don’t understand how that would be done.

I have a potential implementation in numpy (but I guess different to what you suggested):

>>> def reverse_stripe(a):
...     a = np.asanyarray(a)
...     i, j = a.shape
...     assert i >= j
...     k, l = a.strides
...     return np.lib.stride_tricks.as_strided(a[j:], (i-j, j), (k, l-k))
... 
>>> a = np.arange(24).reshape(6, 4)
>>> reverse_stripe(a)
array([[12,  9,  6,  3],
       [16, 13, 10,  7],
       [20, 17, 14, 11]])

If I take that directly to pytorch with:

            def batch_stripe(a):

                b, i, j = a.size()
                assert i > j
                b_s, k, l = a.stride()

                # left top to right bottom
                # return torch.as_strided(a, (b, i - j, j), (b_s, k, k + 1))

                # left bottom to right top
                return torch.as_strided(a[j:], (b, i-j, j), (b_s, k, l-k))

I am again getting the error from before:

 File "J:\Anaconda_Python3_6\envs\cuda2\lib\site-packages\torch\autograd\__init__.py", line 89, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Tensor: invalid storage offset at c:\programdata\miniconda3\conda-bld\pytorch_1524546371102\work\aten\src\thc\generic/THCTensor.c:759

I guess some parameter is not tracked properly again, I guess because I am modifying a?

l-k is going to be negative here no? pytorch does not allow for negative stride.

It seems to run just fine, I get the output I need using l-k, the error occurs later when the loss is computed. I don’t think it’s negative, since the matrix is flipped using a[j:] (not sure about this part).

No solution found yet, anyone has any suggestion on how to get the reverse stripe or how to fix the “unreachable” error in my solution? thank you!

Yea, when I run that as pure numpy I get an error that there are negative strides involved and they are not supported:

def batch_stripe_numpy(a):

     # this doesn't work in the current PyTorch version
     # since negative strides are not supported

     a = a.cpu().detach().numpy()
     b, i, j = a.shape
     assert i >= j
     b_s, k, l = a.strides
     strided_result = np.lib.stride_tricks.as_strided(a[j - 1:], (b, i - j, j), (b_s, k, l - k))
     return torch.from_numpy(strided_result).type(torch.FloatTensor).to("cuda")

Can you explain what you mean here:

I guess you could by shifting the beginning of a by linearizing the last two dimensions + narrow. Then set the strides to (k, k-1) for the last two dimensions.

I don’t understand your suggestion, maybe a code snippet of what you suggest doing would help? thank you!

Another try to just do it in numpy:

            def reverse_stripe(a):
                a = a.cpu().detach().numpy()
                *sh, i, j = a.shape
                assert i > j
                *st, k, m = a.strides
                strided_result = np.lib.stride_tricks.as_strided(a[..., j - 1:, :], (*sh, i - j + 1, j), (*st, k, m - k))
                return torch.from_numpy(strided_result).type(torch.FloatTensor).to("cuda")

Here is the error message:

ValueError: some of the strides of a given numpy array are negative. This is currently not supported, but will be added in future releases.

I see that there is a way to flip the matrix, which would allow me to use the original implementation. The flip function is defined here https://github.com/pytorch/pytorch/issues/229:

def flip(x, dim):
    dim = x.dim() + dim if dim < 0 else dim
    indices = [slice(None)] * x.dim()
    indices[dim] = torch.arange(x.size(dim) - 1, -1, -1,
                                dtype=torch.long, device=x.device)
    return x[tuple(indices)]

So I can do:

def batch_stripe(a):
    b, i, j = a.size()
    assert i >= j
    b_s, k, l = a.stride()
    return torch.as_strided(a, (b, i - j, j), (b_s, k, k+1))
# this simulates numpy stripe(a[..., ::-1, :])[..., ::-1, :] ???
output = flip(batch_stripe(flip(a), 3)), 3)   

but when I set .dim to 3 I get this:
RuntimeError: dimension out of range (expected to be in range of [-3, 2], but got 3)

Is dim actually referring to tensor rank in this case in PyTorch? dim=2 seems to work, but I am not sure if that is correct.

No sure why you would like to set the argument dim=3 when the tensor a is of dimension 3 thus only has dimensions 0,1,2. Maybe this is what you want?

import torch


def flip(x, dim):
    indices = [slice(None)] * x.dim()
    indices[dim] = torch.arange(x.size(dim) - 1, -1, -1,
                                dtype=torch.long, device=x.device)
    return x[tuple(indices)]


def batch_stripe(a):
    b, i, j = a.size()
    assert i >= j
    b_s, k, l = a.stride()
    return torch.as_strided(a, (b, i-j+1, j), (b_s, k, k+l))


if __name__ == '__main__':
    a = torch.arange(24).view(2,4,3)
    print('original tensor:')
    print(a)
    print('inverse stripe:')
    print(batch_stripe(flip(a, -1)))
1 Like

Thanks, that seems to cover my use case!

Have to reopen, since I have some issues with this on PyTorch 0.4.1 and Cuda 9.2 / CuDNN 7.1.4. Previous I ran 0.4.0 with Cuda 9.0 and CuDNN 7.0.5.

Basically, flip is already supported in 0.4.1 so the code looks like the following:

import torch

def batch_stripe(a):
    b, i, j = a.size()
    assert i >= j
    b_s, k, l = a.stride()
    return torch.as_strided(a, (b, i-j, j), (b_s, k, k+l))

if __name__ == '__main__':
    a = torch.arange(24).view(2,4,3)
    print('original tensor:')
    print(a)
    print('inverse stripe:')
    flipped = torch.flip(a, [2])
    print(batch_stripe(flipped))

Unfortunately, I am getting the following error in my code:

RuntimeError: cuda runtime error (59) device-side assert triggered at /opt/.../THCTensorCopy.cpp:21

And it is pointing to the line that does the flip, when I use CUDA_LAUNCH_BLOCKING=0. If I use CUDA_LAUNCH_BLOCKING=1, then it points to some unrelated part in the code. I see that this only happens with CuDNN 7.1 and not CuDNN 7.0.

This works fine with the small code example above, but fails in my project. Any ideas? Maybe there is a better way of doing the batch_stripe in PyTorch 0.4.1?

I will try to provide an example that can reproduce the error, but the codebase is too big for that right now.

Got it working, it was some mismatch in dimensions after all. Unfortunately, CUDA error pointed to something unrelated more or less.