Parallelizing linear algebra and intensive computations

Good evening,

I have been implementing some raycasting operations using numpy and thought moving them to pytorch to make use of the gpu parallelization.
However I am struggling to use some functions like cross product over list/tensors of tensors.
For exemple I want to compute the cross product between 10 vectors and 6000 other vectors.
With numpy I would broadcast them and get thing along this:

pvec = np.cross( directions[None,:,:],v0v2[:,None,:])

With pytorch it seems to be a problem as cross requires same size tensors and broadcasting apparently is not available for this methods.
Any idea on how to do something similar efficiently ?

Also what would be the a good way to do some computations in parallel in a good way ?
For example the same 10 operations for a lot of vectors.
In cuda I see how the kernel would work executing the same code, but directly in pytorch does not seem so clear, is it even possible ?

Yours Justin

You could manually broadcast the tensors as shown in this example:

# numpy
directions = np.random.randn(3, 3)
v0v2 = np.random.randn(3, 3)
pvec = np.cross(directions[None, :, :], v0v2[:, None, :])

# PyTorch with manual broadcasting
d, v = torch.from_numpy(directions[None, :, :]), torch.from_numpy(v0v2[:, None, :])
d = d.expand(v.size(0), -1, -1)
v = v.expand(-1, d.size(1), -1)
p = torch.cross(d, v, dim=2)

# Compare
print(torch.allclose(torch.from_numpy(pvec), p))