Element 0 of tensors does not require grad and does not have a grad_fn

Hi @ptrblck, first of all, thank you so much for replying to all these questions and in many other threads as well.

I’m also facing the same problem, and I have read all the replies, and couldn’t find one that is the same. So here’s mine, I hope you can help me too.

I’m trying to train the DETR code taken from this notebook:

class DETRdemo(nn.Module):
    """
    Demo DETR implementation.

    Demo implementation of DETR in minimal number of lines, with the
    following differences wrt DETR in the paper:
    * learned positional encoding (instead of sine)
    * positional encoding is passed at input (instead of attention)
    * fc bbox predictor (instead of MLP)
    The model achieves ~40 AP on COCO val5k and runs at ~28 FPS on Tesla V100.
    Only batch size 1 supported.
    """
    def __init__(self, num_classes, hidden_dim=256, nheads=8,
                 num_encoder_layers=6, num_decoder_layers=6):
        super().__init__()

        # create ResNet-50 backbone
        self.backbone = resnet50()
        del self.backbone.fc

        # create conversion layer
        self.conv = nn.Conv2d(2048, hidden_dim, 1)

        # create a default PyTorch transformer
        self.transformer = nn.Transformer(
            hidden_dim, nheads, num_encoder_layers, num_decoder_layers)

        # prediction heads, one extra class for predicting non-empty slots
        # note that in baseline DETR linear_bbox layer is 3-layer MLP
        self.linear_class = nn.Linear(hidden_dim, num_classes + 1)
        self.linear_bbox = nn.Linear(hidden_dim, 4)

        # output positional encodings (object queries)
        self.query_pos = nn.Parameter(torch.rand(100, hidden_dim))

        # spatial positional encodings
        # note that in baseline DETR we use sine positional encodings
        self.row_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
        self.col_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))

    def forward(self, inputs):
        # propagate inputs through ResNet-50 up to avg-pool layer
        x = self.backbone.conv1(inputs)
        x = self.backbone.bn1(x)
        x = self.backbone.relu(x)
        x = self.backbone.maxpool(x)

        x = self.backbone.layer1(x)
        x = self.backbone.layer2(x)
        x = self.backbone.layer3(x)
        x = self.backbone.layer4(x)

        # convert from 2048 to 256 feature planes for the transformer
        h = self.conv(x)

        # construct positional encodings
        H, W = h.shape[-2:]
        pos = torch.cat([
            self.col_embed[:W].unsqueeze(0).repeat(H, 1, 1),
            self.row_embed[:H].unsqueeze(1).repeat(1, W, 1),
        ], dim=-1).flatten(0, 1).unsqueeze(1)

        # propagate through the transformer
        h = self.transformer(pos + 0.1 * h.flatten(2).permute(2, 0, 1),
                             self.query_pos.unsqueeze(1)).transpose(0, 1)

        print(self.linear_class(h).shape)
        print(self.linear_bbox(h).sigmoid().shape)
        
        # finally project transformer outputs to class labels and bounding boxes
        # return {'pred_logits': self.linear_class(h), 
        #        'pred_boxes': self.linear_bbox(h).sigmoid()}
        return self.linear_class(h) # <-- i mod it to return only logits

Here is my training loop:

for x, y in data_loader:
    x = x.to(device)
    y = y.to(device)

    if train_mode:
        self.optimizer.zero_grad()

    pred = self.model(x)
    
    loss = self.criterion(pred, y) ## self.criterion = nn.MSELoss()

    if train_mode:
        print(123) # <-- printed
        loss.backward() # <-- RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
        print(456)
        self.optimizer.step() 

Your code works fine using this code snippet:

model = DETRdemo(10)
criterion = nn.MSELoss()
data = torch.randn(1, 3, 224, 224)
target = torch.randn(1, 100, 11)
out = model(data)
loss = criterion(out, target)
loss.backward()

(I’m not sure about the validity of the output shape and I just created a matching target for this example.)
Could you check, if you’ve disabled gradient calculation globally e.g. via torch.autograd.set_grad_enabled(False)?

1 Like

facepalm. yes my problem is there is torch.autograd.set_grad_enabled(False) in the notebook. thank you @ptrblck.

Hello guys

I am trying to train a Se_ResNext model using pretrainedmodels

I want to finetune the model in the last layer only to not have 1000 output classes but only three. I have tried many things that are suggested here but nothing seems to work for me and I keep getting the same error

element 0 of tensors does not require grad and does not have a grad_fn

Which only goes away if I do not use param.required = False. But I need the model to have the pretrained weights in all the layers.

Any suggestions would be very helpful thank you :grinning:

@ptrblck I don’t know what I’ve done wrong obviously I don’t have much experience I 've tried almost everything you’ve suggested in your answers. So your help is more than appreciated

model = pretrainedmodels.__dict__["se_resnext50_32x4d"](pretrained="imagenet",num_classes = 1000)



for param in model.parameters():
    param.requires_grad = False

num_ftrs = model.last_linear.in_features
model = nn.Sequential(*list(model.children())[:-1])
model.fc = nn.Linear(num_ftrs,3)

import torch.optim as optim
acc_list = []
running_loss = 0.0

opt = optim.Adam(model.fc.parameters())
criterion = nn.CrossEntropyLoss()

train_dl = DataLoader(trainset, batch_size=64)
val_dl = DataLoader(valset,batch_size=64,)
total_step = len(train_dl)
eval_accu = []
val_correct = 0
val_total = 0
val_running_loss = 0.0

for epoch in range(3):  # loop over the dataset 
    model.train(True)
    for i, (inputs,labels) in enumerate(train_dl, 0):

        opt.zero_grad()
       
        outputs = model(inputs)
        loss = criterion(outputs, torch.max(labels, 1)[1])
        loss.backward()
        opt.step()
       
        total = labels.size(0)
        _, predicted = torch.max(outputs.data, 1)
        
        correct = (predicted == labels).sum().item()
        acc_list.append(correct / total)

        # print statistics
        running_loss += loss.item()
    model.train(False)   
    with torch.no_grad():
        model.eval()
        for data in val_dl:
            images, labels = data
           
            outputs = model(images)
           
            val_loss = criterion(outputs, torch.max(labels, 1)[1])
            
            _, predicted = torch.max(outputs.data, 1)
            val_total += labels.size(0)
            
            val_correct = (predicted == labels).sum().item()
            eval_accu.append(val_correct / val_total)
            val_running_loss += val_loss.item()
            print('Epoch [{}/{}], Step [{}/{}], Loss: {:.3f}, Accuracy: {:.2f}%, Val_Accuracy:{:.2f}%'
            .format(epoch + 1, 4, i + 1, total_step, running_loss / 64,(correct / 64) * 100,(val_correct / 64)*100))
        
    

print('Finished Training')

My error excactly is

Traceback (most recent call last):
  File "seresnext_model.py", line 442, in <module>
    loss.backward()
  File 
line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File  line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

I don’t know how the pretrained model is implemented, but these lines of code look a bit wrong:

num_ftrs = model.last_linear.in_features
model = nn.Sequential(*list(model.children())[:-1])
model.fc = nn.Linear(num_ftrs,3)

In the first line you are using model.last_linear.in_features and are assigning a new layer using these in_features to model.fc.
Assuming that last_linear is a real layer and used, this would most likely mean that fc is a new attribute, which is never used.
If that’s the case, assign the new nn.Linear layer to model.last_linear and it should work.

1 Like

That did the trick thank you very much for your help :smiley: You are a savior

I’m getting the same error message and I can’t work out why… I’m using the code on two different computers, one with and one without cuda. The error only happens on the one using the cpu, and everything worked just fine before I added the .to(device).

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model.train()
dataset = TensorDataset(Variable(Tensor(trainX)), Variable(Tensor(trainY)))
trainloader = DataLoader(dataset, batch_size=batch_size, pin_memory=True)

for e in range(epochs):
    for idx, (images, labels) in enumerate(trainloader):
        optimizer.zero_grad()
        output = model(images.to(device))
        loss = criterion(output, labels.to(device))
        loss.backward()
        optimizer.step()

The error message is RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn and happens at the line loss.backward()

Usually these error can happen, if you are detaching some tensors from the computation graph as described here. Could you check, if this might be the case here?

Hi I am having same issue but for the EfficientNet not the ResNet can you please help me?from efficientnet_pytorch import EfficientNet
device = torch.device(“cuda” if torch.cuda.is_available() else “cpu”)
model = EfficientNet.from_pretrained(‘efficientnet-b0’)

for param in model.parameters():
param.requires_grad = False

model.classifier_layer = nn.Sequential(
nn.Linear(1280 , 512),
nn.BatchNorm1d(512),
nn.Dropout(0.2),
nn.Linear(512 , 256),
nn.Linear(256 , 2)
)

criterion = nn.CrossEntropyLoss()

optimizer = optim.Adam(model.parameters(), lr=1e-4)
model.to(device)

epochs = 300
steps = 0
running_loss = 0
print_every = 10
train_losses, test_losses = [], []

for epoch in range(epochs):
for inputs, labels in trainloader:

    steps += 1
    inputs, labels = inputs.to(device), labels.to(device)
    optimizer.zero_grad()
    
    logps = model.forward(inputs)
    loss = criterion(logps, labels)
    
    loss.backward()
    optimizer.step()
    
    running_loss += loss.item()
    
    if steps % print_every == 0:
        test_loss = 0
        accuracy = 0
        model.eval()
        with torch.no_grad():
            for inputs, labels in testloader:
                inputs, labels = inputs.to(device), labels.to(device)
                
                logps = model.forward(inputs)
                batch_loss = criterion(logps, labels)
                test_loss += batch_loss.item()
                
                ps = torch.exp(logps)
                top_p, top_class = ps.topk(1, dim=1)
                equals = top_class == labels.view(*top_class.shape)
                accuracy += torch.mean(equals.type(torch.FloatTensor)).item()

        train_losses.append(running_loss/len(trainloader))
        test_losses.append(test_loss/len(testloader))
        
        print(f"Epoch {epoch+1}/{epochs}.. "
              f"Train loss: {running_loss/print_every:.3f}.. "
              f"Test loss: {test_loss/len(testloader):.3f}.. "
              f"Test accuracy: {accuracy/len(testloader):.3f}")
        
        running_loss = 0
        model.train()

torch.save(model, ‘mymodelAA_new.pth’)

Could you check, if model.classifier_layer is set before you are assigning the nn.Sequential container to it?
If not, then note that this layer will never be used and since you are freezing all other parameters of the model, you’ll encounter this error.

thank you for you answer, just now I changed the name “model.classifier_layer” to “model._fc” and it works now, am I fixing it in the right way? could it be because EfficientNet uses _fc rather than classifier_layer?

This sounds reasonable and you could either check it by printing the model (print(model)), which would show all layers and should thus also show the _fc layer or by checking the source code of the implementation, which would show the initialization of all layers as well as their usage.

Thank you seems like it is

)
(_bn1): BatchNorm2d(1280, eps=0.001, momentum=0.010000000000000009, affine=True, track_running_stats=True)
(_avg_pooling): AdaptiveAvgPool2d(output_size=1)
(_dropout): Dropout(p=0.2, inplace=False)
(_fc): Linear(in_features=1280, out_features=1000, bias=True)
(_swish): MemoryEfficientSwish()
)

So yeah now that i use “_fc” it works. thank you .

Hi , sorry I have a question about something like confusion matrix, how can I have something like confusion matrix here? is there any module in torch?

There might be 3rd party libraries built on PyTorch, which could provide an implementation to calculate the confusion matrix, but I would just use common libraries, such as scikit-learn, and pass the predictions as well as targets as numpy arrays to it.
E.g. take a look at sklearn.metrics.confusion_matrix to avoid “reinventing the wheel”. :wink:

@ptrblck
I am facing the same situation.

# Model.
model = my_model()

criterium = nn.MSELoss()

# Adam optimizer with learning rate 0.1 and L2 regularization with weight 1e-4.
optimizer = torch.optim.Adam(model.parameters(),lr=0.1, weight_decay=1e-4)
# Set gradient to 0.
optimizer.zero_grad()

# Feed forward.
pred = model(data)
pred_max = torch.max(pred)
pred_min = torch.min(pred)
pred = 255* (pred - depth_min) / (pred_max - pred_min )

# Loss calculation.
loss = criterium(pred , target)

# Gradient calculation.
loss.backward()

The run stops at the loss. backward().

  1. Seems this issue is caused by detach() from the torch.max().
    For my cause how could I found the max and min value before loss function
  2. Are there other functions will cause detach()?

Thanks

torch.max is not detaching the output values from the computation graph, but the indices.
Your code works fine using random input tensors:

# Model.
model = nn.Linear(1, 1)

criterium = nn.MSELoss()

# Adam optimizer with learning rate 0.1 and L2 regularization with weight 1e-4.
optimizer = torch.optim.Adam(model.parameters(),lr=0.1, weight_decay=1e-4)
# Set gradient to 0.
optimizer.zero_grad()

# Feed forward.
data = torch.randn(1, 1)
pred = model(data)
pred_max = torch.max(pred)
pred_min = torch.min(pred)
depth_min = 1
pred = 255* (pred - depth_min) / (pred_max - pred_min )

# Loss calculation.
target = torch.randn(1, 1)
loss = criterium(pred , target)

# Gradient calculation.
loss.backward()

@ptrblck
Thank you for your quick reply.
Very appreciate it.

I test the code on another machine having a differetn gpu and it works. But the 2080 gpu gives the above issue.
Is something else could cause this?

Different PyTorch versions could have had already fixed issues, but using a GPU wouldn’t change the behavior of Autograd.

Are there other operations may cause a tensor detached from the graph implicitly?