How to implement sum fusion for two stream CNN

I have implemented 2 streams CNN for action recognition. The first stream is for the spatial model for features and the second one is for temporal features (optical flow)
I did train each model separately and for the next step, I want to apply sum fusion and then make the prediction for the class using softmax.
my questions are:
1- for the fusion, I just get the weights for the classifier of each stream and then I did a simple tensor addiction. Is that the correct way to do it? I
2- for the class prediction in the last step using the softmax, should I make the training with both spatial and optical flow data? I am so confused.
I did read papers but they did not go into detail and the codes that I find in GitHub did not make sense to me
Your help is welcome. Thank you in advance