Partial connectivity between neural network layers

Hi,

I want to create a neural network layer such that the neurons in this layer are not fully connected to the neurons in layer below.

For example, there are two adjacent neuron layers with 1000 neurons and 300 neurons. Lets name the first layer A and the second layer B. The output of layer A serves as the input of layer B. The neurons 1:3 in layer B are connected to neurons 1:10 in layer A, neurons 4:6 in layer B are connected to neurons 11:20 in layer A, and so on.

Now, since the two layers are partially connected. It is possible to compute the output of every 3 neurons in layer B in parallel on the GPU. I could compute the output of neuron 1:3, 4:6, 7:9, … in parallel. This would be straightforward to do in CUDA C code.

Is this possible in pytorch using any of the existing functions or by creating a new C function. Mainly, I want to utilize the capabilities of the autograd package to avoid having to compute gradients myself and not loose any improvements in speed. Also, I would like to have a code that can utilize more than 1 GPU (if they are available to me).

Thank you for any guidance.

Sorry, I had intermittent internet which resulted in multiple postings. Is it possible to delete the other post?

Regards,
Shirin

This should be possible by using multiple nn.Linear and cat then together. To achieve parallelism, you may look into torch.multiprocessing.

torch.multiprocessing would result in launching separate processes for each set of sub-neurons (neurons 1:3, 4:6, … in layer B). Although it is possible that these processes get executed on the GPU. But, they would nevertheless be treated as separate processes by the GPU scheduler. I think a more native way to do things would be to launch a single CUDA kernel (with large number of threads) that processes the set of sub-neurons (neurons 1:3, 4:6, … in layer B) in separate threads but as a single process. Also, in this case I can avoid having to do any concatenation, myself.

I have not looked into the speeds for two approaches but I think the approach using torch.multiprocessing might be less efficient (speed-wise). Please correct me if I am wrong.

I agree that for the CUDA kernel approach to work, my partial connectivity needs to have some structure. it can not be absolutely random. And, this is true for my case.