Parallel training of a network with multiple branches

Please forgive me for hijacking this thread, but I do have the same question and would very much like some more detail and especially syntax.

In big picture I’m looking to define something like the net image in this post, but with the arrows reversed. In other words, I want the input to be a set of identical-size tensors which each process through one or more layers of their own subnets (learning local representations) before feeding into one larger layer for further processing. The target hardware is a GPU, and as the original poster noted it seems like looping over the subnets will be inefficient.

In my specific case I have a set of voxels of interest (so not a cube), and within each voxel a specific x,y,z coordinate (real values) and a 10-bit one-hot encoding for the point in the voxel (also reals since tensors can have only one dtype). Thus for this example an input size of (N, 13) where N is fixed and on the order of 50-100 depending on the model. Ideally I want each of the 13-element vectors to process in their own fully-connected multi-layer subnets before feeding into a single larger layer.

The use of convolutional layers is intriguing, but I am not clear on the required syntax to define non-overlapping subnets without necessitating smaller output layers for the subnets as I see in most examples for convolutional layers.

Interested to read constructive suggestions and comments on any of this.