How to parallelize code, sorta like a cuda kernel

I have 1-dimensional input data, 10000 features long.

I’m currently iterating over the data, examining it in chunks of 3, i.e. indices 0:3, 1:4, 2:5, …, 9997:10000. With this ‘window’ of 3 inputs, I run the data through a standard MLP, with a hidden layer (linear), that outputs a single value. It’s important that the MLP portion does not share weights, and that there are literally 10000-2 copies of the network, each with their own unique weights, producing a total of 10000-2 outputs due to there being no padding.

I’m currently using a ModuleList to store this, and in my .forward() pass, I iterate over each module, passing in an appropriately indexed slice of my input vector. The net is training and validation loss is going down, which is nice… but I was wondering if there was a better (parallelizeable?) pytorch way of doing this w/o writing my own cuda kernel?

create the weights as self.weight = nn.Parameter(torch.randn(...)) and then use the torch.bmm function to do batch matrix multiplies.

This is probably a much better approach than doing a 10000-element for-loop in python.

Here’s an example of manually speficying nn.Parameter weights and using the functional interface, rather than using nn Modules.