In the below code snippet, when I try and iterate through model.parameters() in order to obtain the param.grad data, I get a

AttributeError: ‘NoneType’ object has no attribute ‘data’

signifying that the backward pass, which is done via the loss.backward() did not store the gradient. This code worked for BERT and Electra, so not sure why it’s not working with XLNet. Any help would be appreciated.

def xln(input_ids, labels, epochs = 2):
learning_rate = 0.02
model = XLNetForTokenClassification.from_pretrained('xlnet-base-cased', num_labels = 2,
output_hidden_states=True,
output_attentions=True,
)
model.train()
#zero's our gradient
model.zero_grad()
for epoch in range(epochs):
for index, sequences in enumerate(input_ids):
outputs = model(torch.tensor(sequences).unsqueeze(0),
token_type_ids=None,
attention_mask=None,
labels = torch.tensor(labels[index]).unsqueeze(0)
)
#""" this loss variable will point to the entire model """
loss = outputs.loss
"""compute gradient. We should now have grad.data in model.parameters()"""
loss.backward()
for param in model.parameters():
print('param',param.grad)
param.data -= learning_rate * param.grad.data
#zero gradient
model.zero_grad()
return model
model = xln(x_inputs, y_labels)

Ok, so the mask is obviously NOT a part of the gradient, but how is the segment embedding and it’s respective bias not differentiable? according to XLNet paper

Architecturally, different from BERT that adds an absolute segment embedding to the word embedding at each position, we extend the idea of relative encodings from Transformer-XL to also encode the segments. Given a pair of positions i and j in the sequence, if i and j are from the same segment, we use a segment encoding sij = s+ or otherwise sij = s−, where s+ and s− are learnable model parameters for each attention head. In other words, we only consider whether the two positions are within the same segment, as opposed to considering which specific segments they are from. This is consistent with the core idea of relative encodings; i.e., only modeling the relationships between positions. When i attends to j, the segment encoding sij is used to compute an attention weight aij = (qi + b)⊤sij, where qi is the query vector as in a standard attention operation and b is a learnable head-specific bias vector.

It clearly states the relative segement embeddings and the relative segment bias are learnable parameters. Are they, by default, only learnable during pre training, and the learning is “turned off” during fine tuning?

You will be able to see quickly in modeling_xlnet.py that some parameters don’t always participate in the graph and then, of course, they won’t have the grad set. There are quire a few conditionals, so unless you hit the right combination that includes those, this is what you get. I hope it makes sense.

I am not familiar with this particular model myself, but quickly searching for the params @ptrblck identified all appeared enclosed in some if.