Running multiple basic matrix operations (add/sum etc) on GPU asynchronously?

TL;DR: I want to subtract a a list of vectors from a Matrix asynchronously.

What I have is a matrix of size N x 512. I have another matrix of size M x 512 where M < N always.

What I want is to subtract all the vectors from the matrix element wise asynchronously (I want to save M results).

When M = 1 the case is simple and here is an example:

a = torch.from_numpy(np.arange(50).reshape(5, 10))
b = torch.from_numpy(np.arange(10).reshape(1, 10))
a.sub(b)

which gives the correct output;

tensor([[ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
        [10, 10, 10, 10, 10, 10, 10, 10, 10, 10],
        [20, 20, 20, 20, 20, 20, 20, 20, 20, 20],
        [30, 30, 30, 30, 30, 30, 30, 30, 30, 30],
        [40, 40, 40, 40, 40, 40, 40, 40, 40, 40]])

Is there any way for me to then do this for M > 1 and save the results in a non-sequential manner?