So I am using PyTorch for some numerical calculation, and my problem can’t be vectorized because NestedTensor has yet to function in stable PyTorch release… Currently, I am using map function to do some tensor calculation. Here are two questions:

Is there a more efficient way to do the parallel computation in PyTorch? e.g., I know there is a tf.map_fn in TensorFlow, is there anything similar in PyTorch?

Should I decide to use CUDA for computation, is the usage of map function gonna slow my algorithm?

I thought map runs in parallel? So Python map doesn’t really work on PyTorch CUDA end?
It’s indeed not feasible to run my problem using existing function if NestedTensor is not available… The issue is that I have to make a list of tensors of different sizes but the same dimensions, this makes map the only possible solution.

call function for each element… omg that’s so slow
Is there any plan to maybe implement a torch.map_fn in the future? I mean, tf.map_fn definitely get some good speed from what I heard about. After NestedTensor gets implemented my problem will get solved, but create a .map_fn will definitely help in a broad sense.

But as mentioned above, there is very little you cannot do with pytorch-only functions now. If you want to share a code sample with your function implemented with a for-loop, we might be able to help.

An example of my purpose is, lets say we hate a list of square matrices of two dimensions, called A, a list of some matrices of two dimensions, B, of the same list length as A. I’m trying to write a function to find the list of quadratic form B.T@A@B.

e.g.

A = [torch.randn(i, i) for i in range(2001)]
B = [torch.randn(i, 3) for i in range(2001)]
[B[i].T@A[i]@B[i] for i in range(2001)] # this is the target!

See my updated with an example, I would say padding with 0 might well be very wasteful… that’s one example, second example could be finding the inverse, which will be even worse in that situation.

Right but for such tasks, you want to do a very large number of small tasks.
So you won’t be able to run this on the GPU as it will be super slow.
And you will have to run this one by one on the CPU in a for-loop. Moving that for loop from python to cpp will speed it up but won’t give you "good’ perf I’m afraid…

You can use the cpp inline extensions to try this:

import torch
from torch.utils import cpp_extension
import time
cpp_source = """
std::vector<torch::Tensor> test_fn(std::vector<torch::Tensor> A, std::vector<torch::Tensor> B) {
std::vector<torch::Tensor> result(A.size());
for (int i=0; i<A.size(); ++i) {
result[i] = A[i].mm(B[i]);
}
return result;
}
"""
def py_version(A, B):
C = []
for a, b in zip(A, B):
C.append(a.mm(b))
return C
mod = cpp_extension.load_inline(
name="mod",
cpp_sources=cpp_source,
functions=["test_fn",],
)
for max_val in [10, 100, 1000, 1500]:
print("")
print("TESTING FOR {}".format(max_val))
A = [torch.randn(i, i) for i in range(1, max_val)]
B = [torch.randn(i, 3) for i in range(1, max_val)]
t1 = time.time()
res = mod.test_fn(A, B)
cpp_time = time.time() - t1
t1 = time.time()
res = py_version(A, B)
py_time = time.time() - t1
print("cpp time: {}".format(cpp_time))
print("py time: {}".format(py_time))

Well that depends on how large each op is. If they are big enough, we already paralellize each op.
In the case where they are super super small, yes you would gain from a parallelization on top.

But to get back to the main question:

we don’t have a map

I don’t think there is any plan to make one

It would be a lot of work

It would have limited benefit has it would only help very small, cpu-only tasks.