Speed up the implementation without for

I have a tensor A, a tensor B and a index tensor C
for example:
A = [[1, 2, 3], [4, 5, 6]]
B = [[10, 20, 30], [40, 50, 60]]
C = [0, 0]

I would like to use B to update A,
A[C[0]] += B[0]
A[C[1]] += B[1]

So far, I am using for loop to achieve this, which is very slow. The for loop snippet is
for i, c_i in enumerate (C ):
A[c_i] += B[i]

Any suggestions on how to get rid of for loop or speed up the implementation. Thanks a lot. The above is just a simplified version and in practice, the A and B tensors are very large.

Hello Jet!

I am not aware of any way to do this in pytorch without a loop.

Furthermore, I doubt that there is such a way. My reasoning runs
as follows:

The purpose of using “tensor” operations (whatever those are) is
to be able to do things “in parallel” (whatever that may mean). But,
as you describe your desired calculation, it is inherently sequential.
Also, I don’t think it is possible to describe a useful semantics for
this type of operation that isn’t inherently sequential.

If all of the indices in your index tensor C were guaranteed to be
distinct, then one could parallelize the operations. But your specific
example, C = [0, 0], provides a good counter-example.

Noting that C[0] = C[1] = 0, we can rewrite:

A[C[0]] += B[0]
A[C[1]] += B[1]

as:

A0tmpa = A[0]
A0tmpa += B[0]
A[0] = A0tmpa
A0tmpb = A[0]   # because C[1] = C[0] = 0
A0tmpb += B[1]
A[0] = A0tmpb

But suppose – in order to parallelize it – we wrote is as:

A0tmpa = A[0]
A0tmpb = A[0]   # because C[1] = C[0] = 0
A0tmpa += B[0]
A0tmpb += B[1]
A[0] = A0tmpa
A[0] = A0tmpb

In this case, you would get a different result, and probably not the
result you would expect or want.

Or just switching two lines would screw things up:

A0tmpa = A[0]
A0tmpa += B[0]
# A[0] = A0tmpa
# A0tmpb = A[0]   # because C[1] = C[0] = 0
A0tmpb = A[0]   # because C[1] = C[0] = 0
A[0] = A0tmpa
A0tmpb += B[1]
A[0] = A0tmpb

So the sequence of operations (which, if enforced, will break
parallelization) matters.

Here’s a script that shows – in a somewhat contrived way by
using floating-point round-off – that even if you perform each of
A[C[0]] += B[0] and A[C[1]] += B[1] “atomically,” the order
of the two atomic operations matters:

import torch
torch.__version__

delt = 16777216.0  # to get some round-off error

A = torch.FloatTensor ([[1, 2, 3], [4, 5, 6]])
D = A.clone()
B = torch.FloatTensor ([[delt, delt, delt], [-delt, -delt, -delt]])
C = torch.LongTensor ([0, 0])

A
A[C[0]] += B[0]
A
A[C[1]] += B[1]
A

D
D[C[1]] += B[1]
D
D[C[0]] += B[0]
D

And here is the output:

>>> import torch
>>> torch.__version__
'0.3.0b0+591e73e'
>>>
>>> delt = 16777216.0  # to get some round-off error
>>>
>>> A = torch.FloatTensor ([[1, 2, 3], [4, 5, 6]])
>>> D = A.clone()
>>> B = torch.FloatTensor ([[delt, delt, delt], [-delt, -delt, -delt]])
>>> C = torch.LongTensor ([0, 0])
>>>
>>> A

 1  2  3
 4  5  6
[torch.FloatTensor of size 2x3]

>>> A[C[0]] += B[0]
>>> A

 1.6777e+07  1.6777e+07  1.6777e+07
 4.0000e+00  5.0000e+00  6.0000e+00
[torch.FloatTensor of size 2x3]

>>> A[C[1]] += B[1]
>>> A

 0  2  4
 4  5  6
[torch.FloatTensor of size 2x3]

>>>
>>> D

 1  2  3
 4  5  6
[torch.FloatTensor of size 2x3]

>>> D[C[1]] += B[1]
>>> D

-1.6777e+07 -1.6777e+07 -1.6777e+07
 4.0000e+00  5.0000e+00  6.0000e+00
[torch.FloatTensor of size 2x3]

>>> D[C[0]] += B[0]
>>> D

 1  2  3
 4  5  6
[torch.FloatTensor of size 2x3]

And indeed, the results of iterating over the elements of C in
different orders differs.

One could imagine adding a function to pytorch that does what
you want so that you wouldn’t have to write an explicit loop. But
because of the relatively strict sequencing required to give this
operation useful semantics, such a pytorch function would pretty
much have to be using a loop (or loop equivalent) under the hood,
so it’s unlikely doing so would buy you any significant speed-up.

Best.

K. Frank

… And this operation is called scatter_add, it uses atomicAdd on cuda, on cpu it is a loop, but c++ loop != python loop.

A.scatter_add(0, C.view(2,-1).expand(-1,3), B)

PS about your rounding example, many cuda ops are non-deterministic, so we live with it, trying not to sum billions with one in float32.

Hello Alex!

Thanks for that – you’re absolutely right. And I agree that the use
of “atomicAdd” (assuming it means what it sounds like) eliminates
the need for explicit sequencing (such as with a for loop). Multiple
cuda operations can attempt to update a given element of A
“simultaneously,” and they should be able to do so on a first-come,
first-served basis, without any conflicts.

Sadly I can’t try this, as I’m stuck several layers down in the pytorch
archeological strata with version 0.3.0, which lacks this function.

Yes, that’s fair. The purpose of my admittedly-contrived example was
to show that when C contains duplicate indices there is an ordering
issue, at least in principle.

As an aside, for scatter(), where the ordering issue is fully real (not
just differing in floating-point round-off), the documentation warns of
this:

Moreover, as for gather(), the values of index must be between 0 and self.size(dim) - 1 inclusive, and all values in a row along the specified dimension dim must be unique.

(Although – at least in version 0.3.0 – unique indices are not enforced
nor checked for.)

Best.

K. Frank