Index_select() becomes slow with very long columns

Hi, I just found that calling index_select() on long columns (dim=1) are much slower than doing it on long rows (dim=0). In fact, for large enough matrices, it’s much faster to first transpose the matrix and call index_select() on rows!

I wonder if it’s a known performance issue and if there’s any way to mitigate the problem?

Here’s the code I used to test. (I’m using Python 2.7.12, PyTorch 0.2.0.post1, and GTX 1080). It builds a 256*N matrix (for various N) and rearranges its rows. Then it does the same for columns.

import time
import numpy as np
import torch

DIM = 256
idxs = np.random.permutation(DIM)
idxs = torch.LongTensor(idxs).cuda()

def do_test(dim, sz, trans=False):
    if dim == 0:
        # Rearrange 256 rows, each size 'sz'.
        A = torch.cuda.FloatTensor(DIM, sz)
    else:
        # Rearrange 256 columns, each size 'sz'.
        A = torch.cuda.FloatTensor(sz, DIM)
    A.uniform_(-1.0, 1.0)

    B = A.new(A.shape)

    torch.cuda.synchronize()

    T0 = time.time()
    for step in xrange(10):
        if trans:
            T = A.t().clone()
            T.index_select(dim=1-dim, index=idxs, out=B)
        else:
            A.index_select(dim=dim, index=idxs, out=B)

    torch.cuda.synchronize()
    T1 = time.time()
    print '  %6d : Elapsed %.3f ms' % (sz, (T1 - T0) / 10 * 1000.0)

sizes = [100, 200, 500, 1000, 2000, 5000, 10000, 20000, 50000, 100000]

print 'Rearranging rows:'
for sz in sizes: do_test(0, sz)
print 'Rearranging columns:'
for sz in sizes: do_test(1, sz)
print 'Rearranging columns (with transpose):'
for sz in sizes: do_test(1, sz, True)

Result:

Rearranging rows:
     100 : Elapsed 0.051 ms
     200 : Elapsed 0.049 ms
     500 : Elapsed 0.056 ms
    1000 : Elapsed 0.066 ms
    2000 : Elapsed 0.084 ms
    5000 : Elapsed 0.142 ms
   10000 : Elapsed 0.235 ms
   20000 : Elapsed 0.419 ms
   50000 : Elapsed 0.999 ms
  100000 : Elapsed 1.937 ms

Rearranging columns:
     100 : Elapsed 0.048 ms
     200 : Elapsed 0.051 ms
     500 : Elapsed 0.065 ms
    1000 : Elapsed 0.111 ms
    2000 : Elapsed 0.298 ms
    5000 : Elapsed 1.166 ms
   10000 : Elapsed 2.953 ms
   20000 : Elapsed 7.699 ms
   50000 : Elapsed 22.267 ms
  100000 : Elapsed 44.207 ms

Rearranging columns (with transpose):
     100 : Elapsed 0.051 ms
     200 : Elapsed 0.054 ms
     500 : Elapsed 0.065 ms
    1000 : Elapsed 0.081 ms
    2000 : Elapsed 0.118 ms
    5000 : Elapsed 0.236 ms
   10000 : Elapsed 0.489 ms
   20000 : Elapsed 1.472 ms
   50000 : Elapsed 8.336 ms
  100000 : Elapsed 14.236 ms

As you can see, index_select(dim=1) is much slower than dim=0: more importantly, it grows faster than O(N): N=100000 is about 400 times slower than N=1000, and first transposing the matrix is about three times faster (for N=100000), even considering the time spent on transposing.

Does anyone know what’s going on here?

1 Like