I have 1-dimensional input data, 10000 features long.
I’m currently iterating over the data, examining it in chunks of 3, i.e. indices 0:3, 1:4, 2:5, …, 9997:10000. With this ‘window’ of 3 inputs, I run the data through a standard MLP, with a hidden layer (linear), that outputs a single value. It’s important that the MLP portion does not share weights, and that there are literally 10000-2 copies of the network, each with their own unique weights, producing a total of 10000-2 outputs due to there being no padding.
I’m currently using a ModuleList to store this, and in my .forward() pass, I iterate over each module, passing in an appropriately indexed slice of my input vector. The net is training and validation loss is going down, which is nice… but I was wondering if there was a better (parallelizeable?) pytorch way of doing this w/o writing my own cuda kernel?