Hi !
I am trying to train huggingface
's implementation of the GPT2
model from scratch (meaning I am using their architecture but not using pre-trained weights) but I noticed by looking into the code here https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_gpt2.py that there doesn’t seem to be an implementation for a causal mask.
I could write an ugly for
loop and feed each of my sequences one token at a time to the network which would be super unefficient. I could also chop up each of my examples token by token, pad them and feed it like a batch, which is probably faster but doesn’t feel super satisfacting.
Has any of you worked closely with huggingface’s transformers before ? Do you know if there is an implementation of casal mask that I missed, or another way to do what I am describing ?
PS : Yes, I have already read huggingface’s blogpost on training from scratch, but it’s mostly incomplete and the relevant parts concerning training are left out.