While working on a problem related to question-answering(MRC), I have implemented two different architectures that independently give two tensors (probability distribution over the tokens). Both the tensors are of dimension (batch_size,512). I wish to obtain the final output of the form (batch_size,512). How can I combine the two tensors using trainable weights and then train the model on the final prediction?

Additional Information:

So in the forward function of my NN model, I have used BERT model to encode the 512 tokens. These encodings are 768 dimensional. These are then passed to a nn.Linear(768,1) to output a tensor of shape (batch_size,512,1). Apart from this, I have another model built on top of the BERT encodings that also yields a tensor of shape (batch_size, 512, 1). I wish to combine these two tensors to finally get a tensor of shape (batch_size, 512, 1) which can be trained against the output logits of the same shape using CrossEntropyLoss. Also, since both the output tensors are first normalized (softmax) then combined, please suggest a way to compute the CrossEntropyLoss using the final output tensor.

Please share the PyTorch code snippet if possible.