Why is the broadcasting semantics of PyTorch so weird? Also, why do we need broadcasting semantics?

Please let me know why is it that way

You can check the note about broadcasting here. But the main point in there is the first sentence “Many PyTorch operations support NumPy Broadcasting Semantics .”: we follow numpy’s semantic because many users are used to them.