How to create bounding boxes on images for training

There is also a small issue I’ve missed before.
Since you are using a nn.Sequential module, you would have to create a custom Flatten module to reshape the conv output to fit the linear layer:

class Flatten(nn.Module):
    def __init__(self):
        super(Flatten, self).__init__()
        
    def forward(self, x):
        return x.view(x.size(0), -1)

class Net(nn.Module):
    def __init__(self, nb_classes):
        super(Net, self).__init__()
        self.base = nn.Sequential(
        nn.Conv2d(3, 6, 5),
        nn.MaxPool2d(2, 2),
        nn.Conv2d(6, 16, 5),
        Flatten(),
        nn.Linear(6*6*16, 157),
        nn.Linear(157, 84),
        nn.Linear(84, 4)
        )
        
        self.out_labels = nn.Linear(4, nb_classes)
        self.out_bbox = nn.Linear(4, 4)

    def forward(self, x):
        x = self.base(x)
        x = x.view(x.size(0), -1) 
        print(x.shape)
        x_labels = self.out_labels(x)
        x_bbox = self.out_bbox(x)
        return x_labels, x_bbox

net = Net(nb_classes=4)
x = torch.randn(1, 3, 24, 24)
output = net(x)

Thank you for your reply. I have made the adjustments you have recommended, however, this is the output I am receiving:

tensor(nan, grad_fn=<NllLossBackward>)
tensor(nan, grad_fn=<MseLossBackward>)
[1,     1] loss: nan

The full code is:

class Net(nn.Module):
    def __init__(self, nb_classes):
        super(Net, self).__init__()
        self.base = nn.Sequential(
        nn.Conv2d(3, 6, 5),
        nn.ReLU(),
        nn.MaxPool2d(2, 2),
        nn.Conv2d(6, 16, 5),
        nn.ReLU(),
        Flatten(),
        nn.ReLU(),
        nn.Linear(1256000, 157),
        nn.Linear(157, 84),
        nn.Linear(84, 4)
        )
        
        self.out_labels = nn.Linear(4, nb_classes)
        self.out_bbox = nn.Linear(4, 4)

    def forward(self, x):
        x = self.base(x)
        print(x.shape)
        x = x.view(x.size(0), -1) 
        x_labels = self.out_labels(x)
        x_bbox = self.out_bbox(x)
        return x_labels, x_bbox

net = Net(nb_classes=4)
x = torch.randn(1, 3, 512, 640)
output = net(x)
criterion_label = nn.CrossEntropyLoss()
criterion_bbox = nn.MSELoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

for epoch in range(1):

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        images, labels, bbox = data
        images = Variable(images)
        labels = Variable(labels)
        bbox = Variable(bbox).float()
        
        optimizer.zero_grad()
        outputs_labels, outputs_bbox = net(images)
        loss_label= criterion_label(outputs_labels, labels)
        print(loss_label)
        loss_bbox = criterion_bbox(outputs_bbox, bbox)
        print(loss_bbox)
        loss = loss_label + loss_bbox
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

        print('[%d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss ))
        running_loss = 0.0

print('Finished Training')

I managed to fix the issue. It was caused by missing values in the csv file. Thank you for your help.

When I run the following:

model = net
    
for epoch in range(1):
    logs = {}
    for phase in ['train', 'validation']:
        if phase == 'train':
            model.train()
        else:
            model.eval()

        running_loss = 0.0
        running_corrects = 0

        for images, labels, bbox in dataloaders[phase]:
                
            images = Variable(images)
            labels = Variable(labels)
            bbox = Variable(bbox).float()

            outputs_labels, outputs_bbox = model(images)
            loss_label= criterion_label(outputs_labels, labels)
            loss_bbox = criterion_bbox(outputs_bbox, bbox)
            loss = loss_label + loss_bbox
            
            print('[epoch: %d] label_loss: %.3f bbox_loss: %.3f loss: %.3f' % (epoch + 1, loss_label, loss_bbox, loss))
                
            if phase == 'train':
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            running_loss += loss.detach() * images.size(0)

        epoch_loss = running_loss / len(dataloaders[phase].dataset)
            
        logs[prefix + 'log loss'] = epoch_loss.item()

I get the following output:

[epoch: 1] label_loss: 1.427 bbox_loss: 75252.742 loss: 75254.172
[epoch: 1] label_loss: 24.633 bbox_loss: 5334.305 loss: 5358.938
[epoch: 1] label_loss: 3218504960.000 bbox_loss: 2294486760540814180352.000 loss: 2294486760540814180352.000
[epoch: 1] label_loss: nan bbox_loss: nan loss: nan
[epoch: 1] label_loss: nan bbox_loss: nan loss: nan
[epoch: 1] label_loss: nan bbox_loss: nan loss: nan
[epoch: 1] label_loss: nan bbox_loss: nan loss: nan
[epoch: 1] label_loss: nan bbox_loss: nan loss: nan

I can’t seem to figure out what I may be doing wrong.

Could you check the range of your bbox targets in general and the values of bbox in the third iteration?
Maybe normalizing them for the loss calculation and denormalizing for prediction would help.

Sorry I’m exactly sure what you mean by ’ range of your bbox targets in general’

By that I mean what range the values are in, i.e. are they bound to a specific range in e.g. [0, 1] or could they be any high or low values?

I’m not sure if this is what you meant:

        loss_bbox = criterion_bbox(output_bbox, bbox)
        print(bbox.shape)
torch.Size([1, 4])
torch.Size([1, 4])
torch.Size([1, 4])

No, I meant the values themselves:

print(bbox.min(), bbox.max())

Could you add it to your training loop and check, if some values are “strange”?

These are the values that I get:

tensor(33.) tensor(213.)
tensor(34.) tensor(553.)
tensor(47.) tensor(519.)
tensor(20.) tensor(338.)
tensor(45.) tensor(454.)
tensor(21.) tensor(215.)
tensor(47.) tensor(527.)
tensor(0.) tensor(0.)
tensor(20.) tensor(213.)
tensor(22.) tensor(429.)

I don’t think I see anything strange as such. The values (I believe) are corresponding to the values in the csv file. I’m assuming this is what you meant about normalizing the bbox values? If so, would I normalize them to match either the range of the images?

Will this sample code above work for multiple bounding-boxes per image?

I use exactly the same concept for pre-trained ResNet-50 architecture which looks like;

class ResNet50(nn.Module):
    def __init__(self, num_classes=3):
        super(ResNet50, self).__init__()
        resnet = models.resnet50(pretrained=True)
        layers = list(resnet.children())[:8]
        self.features1 = nn.Sequential(*layers[:6])
        self.features2 = nn.Sequential(*layers[6:])
        self.classifier = nn.Sequential(nn.BatchNorm1d(2048), 
                                        nn.Linear(2048, num_classes))
        self.bb = nn.Sequential(nn.BatchNorm1d(2048), 
                                nn.Linear(2048, 4))
        
    def forward(self, x):
        x = self.features1(x)
        x = self.features2(x)
        x = F.relu(x)
        x = nn.AdaptiveAvgPool2d((1,1))(x)
        x = x.view(x.shape[0], -1)
        return self.classifier(x), self.bb(x)

model = ResNet50().cuda()
parameters = filter(lambda p: p.requires_grad, model.parameters())
optimizer = torch.optim.Adam(parameters, lr=0.006)
criterion = nn.CrossEntropyLoss()

And bounding boxes in this format:

inputs, targets = next(iter(trainloader)) # batch-size=2
print(targets)

tensor([[0.0000, 1.0000, 0.6492, 0.6117, 0.0203, 0.0219],
        [0.0000, 1.0000, 0.7113, 0.4547, 0.0102, 0.0109],
        [1.0000, 0.0000, 0.6271, 0.6268, 0.0073, 0.0068],
        [1.0000, 0.0000, 0.6039, 0.6328, 0.0078, 0.0083],
        [1.0000, 0.0000, 0.4901, 0.6349, 0.0063, 0.0073],
        [1.0000, 1.0000, 0.6044, 0.6117, 0.0057, 0.0057],
        [1.0000, 0.0000, 0.6096, 0.6057, 0.0057, 0.0062]])

Column-0 - bounding box image index
Column-1 - Class label {0,1,2}
Column-2:5 - Bounding boxes


    for i in range(epochs):    
        
        model.train()
        total,total_loss = 0.0, 0.0
        
        for xs, ys in train_dl:
            
            xs = xs.cuda().float()
            ys_idx = ys[:,0].cuda()
            ys_class = ys[:,1].cuda()
            ys_bb = ys[:,2:].cuda().float()
            
            print(f"Target Classes: {ys_class}\nTarget Classes Shape: {ys_class.shape}", )
            pred_class, pred_bb = model(xs)
            pred = torch.max(pred_class,1)[1]
            print(f"Predicted Classes: {pred}\nPredicted Classes Shape: {pred.shape}")
            
            loss_class = criterion(pred.float(), ys_class.long()) 

The at the loss function, I get an error;

Target Classes:  tensor([1., 1., 0., 0., 0., 1., 0.], device='cuda:0')
Target Classes:  torch.Size([7])
Passes classifier
Predicted Classes:  tensor([3, 1], device='cuda:0')
Predicted Classes:  torch.Size([2])
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

nn.CrossEntropyLoss expects raw logits as the model outputs in the shape [batch_size, nb_classes, *]. In your current code snippet you are applying torch.max on the model output and are storing the indices in pred. This will detach the tensor from the computation graph (your model won’t be trained) and will also remove the needed nb_classes dimension, so you should most likely pass pred_class directly to criterion.

1 Like

@ptrblck Still, the error remains the same, the model predicts 1 bounding box per image whereas, the target has 7 bounding boxes (there could be any no. of bbox in a image) for batch-size of 2.

Target Classes:  tensor([1., 1., 0., 0., 0., 1., 0.], device='cuda:0')
Target Classes Shape:  torch.Size([7])

Predicted Classes:  tensor([[-0.0507, -0.4791, -0.9245],
       [ 0.0292,  0.4890,  0.9582]], device='cuda:0', grad_fn=<AddmmBackward>)
Predicted Classes Shape:  torch.Size([2, 3])

When I try to relate this with a standard YoloV3 model, I still don’t get the idea. I wonder the error I am getting has something to do with the regression layer. Really appreciate the guidance! The only difference in the above model I could think of is my model is making predictions per image instead of per bbox.

I’m not sure what’s exactly creating the issue and would need more information about the input and all output shapes.
Based on your previous post I assume that “Predicted Classes” refers to the output of self.classifier?
If so, then the target shape also doesn’t match, as you would be returning logits for 3 classes and for 2 samples, while the target seems to contain the class indices for 7 samples.