I have gone through some code (specifically features visualization) and noticed that they applied torch.cumsum(feature). I understand the cumsum operation but confused what kind of benifits it bring to the final result in terms of accuracy, speed stability etc? i.e., applying torch.cumsum to attention matrix.