These two should be equivalent even if they define different computational graphs (potentially doing broadcasting in a different way).
That means that, during training, because of floating point precision, they can end up giving noticeably different results as errors are amplified by the training.
Do you have a small code sample to reproduce the behavior you present?
Unfortunately I do not have a simple code showing the behavior. I only have a working version of a more developed model.
I will try do make one and post it here as soon as possible.
Mean while, in your opinion, what function should normally be used in this case?
Both will be correct. You can see changing from one to the other having the same effect as changing the random seed that you set at the beginning of your script: all the numbers you will get will be different but if your model is robust, both should converge to a similar solution in terms of performance.
What is the reason behind having both matmul and bmm potentional doing the exact same task? What exactly differ from one another? Is there any advantage in using one over another?