Having troubel in understanding what loss is currently in use

aquorio15 · November 24, 2023, 10:49am

I was going through this hugging face code and I am having trouble understanding what loss the model is currently using . Although I know most seq2seq models uses CrossEntrophy loss but I don’t see the definition anywhere in the code

huggingface/transformers/blob/aca6288ff42cebded5421020f0ff088adeb446dd/examples/language-modeling/run_clm.py

Actually I want to train the model with a new custom loss. I have trained a baseline model and its working fine.

Thank You

tom · November 24, 2023, 11:15am

The transformers library seems to wire the loss into the model’s forward, e.g. Llama it is CrossEntropyLoss on the next token, indeed.

github.com

huggingface/transformers/blob/7293fdc5b9cc809c2aa2ceb84f903ad47e5c06f0/src/transformers/models/llama/modeling_llama.py#L1064-L1069


      
          loss_fct = CrossEntropyLoss()
          shift_logits = shift_logits.view(-1, self.config.vocab_size)
          shift_labels = shift_labels.view(-1)
          # Enable model parallelism
          shift_labels = shift_labels.to(shift_logits.device)
          loss = loss_fct(shift_logits, shift_labels)

I am not sure whether I can recommend following that style in modelling.

For easier to understand implementations, A. Karpathy’s NanoGPT or Lightning.ai’s LitGPT (disclosure: I do some work for them) might be good choices.

Best regards

Thomas

aquorio15 · November 24, 2023, 12:08pm

Thanks for the quick reply, actually currently I am using pre-trained llama models from hugging face. I want to fine-tune llama with a weighted loss function. Any idea how I can integrate it in the transformer library? I found some links related to that but it does not seem to working.

tom · November 24, 2023, 12:23pm

You could subclass the model and reimplement the forward with your modification or so.

However, I do think that fundamentally, transformers is a library targeted towards people using the models as-is. The other two repos that I linked (and A. Karpathy’s earlier MinGPT both have as a reference point) deliberately want do differently: there, it is intended that you take and modify the code rather than using them just as a library.

Best regards

Thomas