Idiomatic way of using the same model on subparts of the input?

I am trying to implement the following architecture, where NetB’s are the same network, and as such share the same gradient at the update.

Each Net B takes as input a subpart of the output of Net A (potentially overlapping). The final output is a concatenation of all NetB’s output.
There is also a subtlety with the training process : Different learning rates are used for each item of a mini-batch, but those learning rates are only known after all items of the batch have been processed. So gradients need to be saved individually until the update. (I am training both net A and net B)

I am looking for the best way to implement this architecture, so that it can be ported to GPU and is efficient with gradients. The closest feature I have found is this which is not what I am looking for.
(I am using the c++ front-end)