Huggingface's GPT2 : implement causal attention?

Hi !

I am trying to train huggingface's implementation of the GPT2 model from scratch (meaning I am using their architecture but not using pre-trained weights) but I noticed by looking into the code here that there doesn’t seem to be an implementation for a causal mask.
I could write an ugly for loop and feed each of my sequences one token at a time to the network which would be super unefficient. I could also chop up each of my examples token by token, pad them and feed it like a batch, which is probably faster but doesn’t feel super satisfacting.

Has any of you worked closely with huggingface’s transformers before ? Do you know if there is an implementation of casal mask that I missed, or another way to do what I am describing ?

PS : Yes, I have already read huggingface’s blogpost on training from scratch, but it’s mostly incomplete and the relevant parts concerning training are left out.

1 Like

Hi, I think the causal masking you are referring to happens here. The bias parameter is a lower triangular matrix with (max_ctx_len, max_ctx_len) dimensions. The linked line performs slicing of this matrix with the appropriate sequence length for the current input.


Hardcoding a 1024 * 1024 matrix seems very unefficient.