Fast way to use `map` in PyTorch?

kaiseryet · February 24, 2020, 3:25am

So I am using PyTorch for some numerical calculation, and my problem can’t be vectorized because NestedTensor has yet to function in stable PyTorch release… Currently, I am using map function to do some tensor calculation. Here are two questions:

Is there a more efficient way to do the parallel computation in PyTorch? e.g., I know there is a tf.map_fn in TensorFlow, is there anything similar in PyTorch?
Should I decide to use CUDA for computation, is the usage of map function gonna slow my algorithm?

Thanks.

albanD · February 24, 2020, 3:11pm

Hi,

I’m afraid there is no map in pytorch.
If all the operations are very small, single threaded CPU will be the fastest I’m afraid.

If you can share your problem, maybe we can help you achieve some parallelization using the existing functions though.

kaiseryet · February 24, 2020, 4:11pm

I thought map runs in parallel? So Python map doesn’t really work on PyTorch CUDA end?
It’s indeed not feasible to run my problem using existing function if NestedTensor is not available… The issue is that I have to make a list of tensors of different sizes but the same dimensions, this makes map the only possible solution.

albanD · February 24, 2020, 4:32pm

Depending on the op you need to do, you could pack then into a single Tensor with enough info to do your function

I thought map runs in parallel?

This is not enough to get good performance if the function needs to be applied to each entry I’m afraid.

So Python map doesn’t really work on PyTorch CUDA end?

You can still use it, it’s just that it will call the python function for every element. Which is going to not be optimal

kaiseryet · February 24, 2020, 7:19pm

call function for each element… omg that’s so slow
Is there any plan to maybe implement a torch.map_fn in the future? I mean, tf.map_fn definitely get some good speed from what I heard about. After NestedTensor gets implemented my problem will get solved, but create a .map_fn will definitely help in a broad sense.

Yaroslav_Bulatov · February 24, 2020, 7:32pm

What would help here is an analogue of Jax’s pmap. I think @ezyang is experimenting with related vmap

albanD · February 24, 2020, 7:38pm

Well, for best perf, you want to move your function into the inner loop in cpp or cuda code.
Which you cannot do using a python function.

Also remember that NestedTensors are built on top of pytorch. So if you can do your code with NestedTensor, you can do it without

kaiseryet · February 24, 2020, 9:18pm

Too complicated for my purpose…

albanD · February 24, 2020, 11:10pm

It is for most use case.

But as mentioned above, there is very little you cannot do with pytorch-only functions now. If you want to share a code sample with your function implemented with a for-loop, we might be able to help.

kaiseryet · February 26, 2020, 12:49pm

An example of my purpose is, lets say we hate a list of square matrices of two dimensions, called A, a list of some matrices of two dimensions, B, of the same list length as A. I’m trying to write a function to find the list of quadratic form B.T@A@B.

e.g.

A = [torch.randn(i, i) for i in range(2001)]
B = [torch.randn(i, 3) for i in range(2001)]
[B[i].T@A[i]@B[i] for i in range(2001)] # this is the target!

albanD · February 26, 2020, 3:28pm

And the sizes are wildly different?
Does packing all the A matrices in a single Tensor with 0 padding very wasteful in your case?

kaiseryet · February 26, 2020, 3:32pm

See my updated with an example, I would say padding with 0 might well be very wasteful… that’s one example, second example could be finding the inverse, which will be even worse in that situation.

albanD · February 26, 2020, 4:07pm

Right but for such tasks, you want to do a very large number of small tasks.
So you won’t be able to run this on the GPU as it will be super slow.
And you will have to run this one by one on the CPU in a for-loop. Moving that for loop from python to cpp will speed it up but won’t give you "good’ perf I’m afraid…

You can use the cpp inline extensions to try this:

import torch
from torch.utils import cpp_extension
import time

cpp_source = """
std::vector<torch::Tensor> test_fn(std::vector<torch::Tensor> A, std::vector<torch::Tensor> B) {
    std::vector<torch::Tensor> result(A.size());

    for (int i=0; i<A.size(); ++i) {
        result[i] = A[i].mm(B[i]);
    }

    return result;
}
"""

def py_version(A, B):
    C = []
    for a, b in zip(A, B):
        C.append(a.mm(b))
    return C

mod = cpp_extension.load_inline(
    name="mod",
    cpp_sources=cpp_source,
    functions=["test_fn",],
)


for max_val in [10, 100, 1000, 1500]:
    print("")
    print("TESTING FOR {}".format(max_val))
    A = [torch.randn(i, i) for i in range(1, max_val)]
    B = [torch.randn(i, 3) for i in range(1, max_val)]

    t1 = time.time()
    res = mod.test_fn(A, B)
    cpp_time = time.time() - t1
    t1 = time.time()
    res = py_version(A, B)
    py_time = time.time() - t1

    print("cpp time: {}".format(cpp_time))
    print("py time: {}".format(py_time))

kaiseryet · February 26, 2020, 4:08pm

finally have to brought up cpp…

when will NestedTensor implement in the stable release though? @cpuhrsch pls?

albanD · February 26, 2020, 4:10pm

My example is more to show that you most likely don’t want cpp:

TESTING FOR 1500
cpp time: 0.2952902317047119
py time: 0.27726316452026367

This is the timings I get for a simple mm with the inputs you gave

cc @cpuhrsch for the nested tensor release

kaiseryet · February 26, 2020, 4:11pm

sorry, scared when read cpp.

But wait, Isn’t map supposed to be faster than for due to parallelisation?

albanD · February 26, 2020, 4:21pm

Well that depends on how large each op is. If they are big enough, we already paralellize each op.
In the case where they are super super small, yes you would gain from a parallelization on top.

But to get back to the main question:

we don’t have a map
I don’t think there is any plan to make one
- It would be a lot of work
- It would have limited benefit has it would only help very small, cpu-only tasks.
If your ops are small, you don’t want to use CUDA

kaiseryet · February 26, 2020, 4:23pm

okay fair enough. I’m just giving an example here.

hypnagogic · May 24, 2020, 12:11am

Just a quick thought from a cursory skim of this problem. I’s wondering whether a purely multiprocessing flow may help (on cpus) here:

chunk the lists (A, B)
use a multiprocessing.Pool.map(func, zip(A, B))

something like (quick pseudocode)

def chunks(a:List, b:List, n:int) -> Generator;

nproc = os.cpu_count() - 1
with Pool(proceses=nproc) as pool:
  func = functools.partial(process_matrices, *params)
  pool.map(func, zip(a_chunk, b_chunk))