How to implement Squared-Loss Penalty so that it runs fast and for different parameters ranges in PyTorch?

How to implement the Squared-Loss Penalty so that is runs fast? According to this thesis?

Multipass Deep Q-Networks (wits.ac.za)