Negative training loss

I tried to implement Mamba: Linear-Time Sequence Modeling with Selective State Spaces, but I got some negative training loss , any ideas how to get around it ?