Optimization issue with MSE loss in Pytorch

I made a classic matrix factorization model for movie recommendation system using keras using batch size 128, Stochastic Gradient Descent and mse loss on the 20M movielens dataset. It gave me the minimum loss of 0.59 in 5 epochs. Then I recreated the exact same model ( the architecture was same and I also checked the total params in both to be same just for confirmation ) in Pytorch. The loss that I get for every batch in Pytorch is around 0.007 (in case of keras it came down from 1 to 0.5). When I multiply it with 128, the loss is in the range of 0.9 so I thought of comparing the pytorch loss after multiplying with 128. Now the issue is that the Pytorch model takes forever to train as I have had more than 40 epochs but the loss won’t go below 0.78 (after multiplying with 128) and the results are poor in comparison to keras model. Can anyone please explain me what I did wrong. My guess was that I did something wrong with the training loop. It would also help if someone shows how a training loop is supposed to be with Stochastic Grad descent in Pytorch with batch size 128 without the default pytorch DataLoader because I guess it does not support csv as of now. Thanks in advance

The Link for keras model code is: https://gist.github.com/Yash-567/344ad748be4c4d3df1344eb506e38d58

The link for Pytorch implementation is: https://gist.github.com/Yash-567/3da2cc10ffc261f565d5cfa0b040f544