Most of the functions are implemented in C / CUDA. For reference, here is the C implementation of NLLLoss