Param.grad is None when using loss.backward() for xlnet

In the below code snippet, when I try and iterate through model.parameters() in order to obtain the param.grad data, I get a

AttributeError: ‘NoneType’ object has no attribute ‘data’

signifying that the backward pass, which is done via the loss.backward() did not store the gradient. This code worked for BERT and Electra, so not sure why it’s not working with XLNet. Any help would be appreciated.

def xln(input_ids, labels, epochs = 2):

    learning_rate = 0.02
    model = XLNetForTokenClassification.from_pretrained('xlnet-base-cased', num_labels = 2,
    #zero's our gradient
    for epoch in range(epochs):
        for index, sequences in enumerate(input_ids):

            outputs = model(torch.tensor(sequences).unsqueeze(0),
                            labels = torch.tensor(labels[index]).unsqueeze(0)

            #""" this loss variable will point to the entire model """
            loss = outputs.loss
            """compute gradient. We should now have in model.parameters()"""
            for param in model.parameters():
       -= learning_rate *
            #zero gradient 
    return model
model = xln(x_inputs,  y_labels)

The majority of the parameters have a valid gradient, while some don’t and you can check it via:

for name, param in model.named_parameters():
    if param.grad == None:
        print(name, 'is None')
        print('param {}: {}'.format(name, param.grad.abs().sum()))

The parameters without a .grad attribute are:

transformer.mask_emb is None
transformer.layer.0.rel_attn.r_s_bias is None
transformer.layer.0.rel_attn.seg_embed is None

I’m not familiar with this model implementation, but guess this is on purpose.
@stas would most likely be familiar with the implementation. :wink:

Ok, so the mask is obviously NOT a part of the gradient, but how is the segment embedding and it’s respective bias not differentiable? according to XLNet paper

Architecturally, different from BERT that adds an absolute segment embedding to the word embedding at each position, we extend the idea of relative encodings from Transformer-XL to also encode the segments. Given a pair of positions i and j in the sequence, if i and j are from the same segment, we use a segment encoding sij = s+ or otherwise sij = s−, where s+ and s− are learnable model parameters for each attention head. In other words, we only consider whether the two positions are within the same segment, as opposed to considering which specific segments they are from. This is consistent with the core idea of relative encodings; i.e., only modeling the relationships between positions. When i attends to j, the segment encoding sij is used to compute an attention weight aij = (qi + b)sij, where qi is the query vector as in a standard attention operation and b is a learnable head-specific bias vector.

It clearly states the relative segement embeddings and the relative segment bias are learnable parameters. Are they, by default, only learnable during pre training, and the learning is “turned off” during fine tuning?

@Ben_Nicholl, may I suggest that this question probably better belongs to Issues · huggingface/transformers · GitHub or as this doesn’t have much to do with pytorch.

You will be able to see quickly in that some parameters don’t always participate in the graph and then, of course, they won’t have the grad set. There are quire a few conditionals, so unless you hit the right combination that includes those, this is what you get. I hope it makes sense.

I am not familiar with this particular model myself, but quickly searching for the params @ptrblck identified all appeared enclosed in some if.