How to create bounding boxes on images for training

ptrblck · April 22, 2019, 9:32pm

There is also a small issue I’ve missed before.
Since you are using a nn.Sequential module, you would have to create a custom Flatten module to reshape the conv output to fit the linear layer:

class Flatten(nn.Module):
    def __init__(self):
        super(Flatten, self).__init__()
        
    def forward(self, x):
        return x.view(x.size(0), -1)

class Net(nn.Module):
    def __init__(self, nb_classes):
        super(Net, self).__init__()
        self.base = nn.Sequential(
        nn.Conv2d(3, 6, 5),
        nn.MaxPool2d(2, 2),
        nn.Conv2d(6, 16, 5),
        Flatten(),
        nn.Linear(6*6*16, 157),
        nn.Linear(157, 84),
        nn.Linear(84, 4)
        )
        
        self.out_labels = nn.Linear(4, nb_classes)
        self.out_bbox = nn.Linear(4, 4)

    def forward(self, x):
        x = self.base(x)
        x = x.view(x.size(0), -1) 
        print(x.shape)
        x_labels = self.out_labels(x)
        x_bbox = self.out_bbox(x)
        return x_labels, x_bbox

net = Net(nb_classes=4)
x = torch.randn(1, 3, 24, 24)
output = net(x)

ahmed · April 24, 2019, 3:45pm

Thank you for your reply. I have made the adjustments you have recommended, however, this is the output I am receiving:

tensor(nan, grad_fn=<NllLossBackward>)
tensor(nan, grad_fn=<MseLossBackward>)
[1,     1] loss: nan

The full code is:

class Net(nn.Module):
    def __init__(self, nb_classes):
        super(Net, self).__init__()
        self.base = nn.Sequential(
        nn.Conv2d(3, 6, 5),
        nn.ReLU(),
        nn.MaxPool2d(2, 2),
        nn.Conv2d(6, 16, 5),
        nn.ReLU(),
        Flatten(),
        nn.ReLU(),
        nn.Linear(1256000, 157),
        nn.Linear(157, 84),
        nn.Linear(84, 4)
        )
        
        self.out_labels = nn.Linear(4, nb_classes)
        self.out_bbox = nn.Linear(4, 4)

    def forward(self, x):
        x = self.base(x)
        print(x.shape)
        x = x.view(x.size(0), -1) 
        x_labels = self.out_labels(x)
        x_bbox = self.out_bbox(x)
        return x_labels, x_bbox

net = Net(nb_classes=4)
x = torch.randn(1, 3, 512, 640)
output = net(x)

criterion_label = nn.CrossEntropyLoss()
criterion_bbox = nn.MSELoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

for epoch in range(1):

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        images, labels, bbox = data
        images = Variable(images)
        labels = Variable(labels)
        bbox = Variable(bbox).float()
        
        optimizer.zero_grad()
        outputs_labels, outputs_bbox = net(images)
        loss_label= criterion_label(outputs_labels, labels)
        print(loss_label)
        loss_bbox = criterion_bbox(outputs_bbox, bbox)
        print(loss_bbox)
        loss = loss_label + loss_bbox
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

        print('[%d, %5d] loss: %.3f' % (epoch + 1, i + 1, running_loss ))
        running_loss = 0.0

print('Finished Training')

ahmed · April 24, 2019, 10:33pm

I managed to fix the issue. It was caused by missing values in the csv file. Thank you for your help.

ahmed · April 25, 2019, 11:35am

When I run the following:

model = net
    
for epoch in range(1):
    logs = {}
    for phase in ['train', 'validation']:
        if phase == 'train':
            model.train()
        else:
            model.eval()

        running_loss = 0.0
        running_corrects = 0

        for images, labels, bbox in dataloaders[phase]:
                
            images = Variable(images)
            labels = Variable(labels)
            bbox = Variable(bbox).float()

            outputs_labels, outputs_bbox = model(images)
            loss_label= criterion_label(outputs_labels, labels)
            loss_bbox = criterion_bbox(outputs_bbox, bbox)
            loss = loss_label + loss_bbox
            
            print('[epoch: %d] label_loss: %.3f bbox_loss: %.3f loss: %.3f' % (epoch + 1, loss_label, loss_bbox, loss))
                
            if phase == 'train':
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            running_loss += loss.detach() * images.size(0)

        epoch_loss = running_loss / len(dataloaders[phase].dataset)
            
        logs[prefix + 'log loss'] = epoch_loss.item()

I get the following output:

[epoch: 1] label_loss: 1.427 bbox_loss: 75252.742 loss: 75254.172
[epoch: 1] label_loss: 24.633 bbox_loss: 5334.305 loss: 5358.938
[epoch: 1] label_loss: 3218504960.000 bbox_loss: 2294486760540814180352.000 loss: 2294486760540814180352.000
[epoch: 1] label_loss: nan bbox_loss: nan loss: nan
[epoch: 1] label_loss: nan bbox_loss: nan loss: nan
[epoch: 1] label_loss: nan bbox_loss: nan loss: nan
[epoch: 1] label_loss: nan bbox_loss: nan loss: nan
[epoch: 1] label_loss: nan bbox_loss: nan loss: nan

I can’t seem to figure out what I may be doing wrong.

ptrblck · April 25, 2019, 11:58am

Could you check the range of your bbox targets in general and the values of bbox in the third iteration?
Maybe normalizing them for the loss calculation and denormalizing for prediction would help.

ahmed · April 25, 2019, 12:01pm

Sorry I’m exactly sure what you mean by ’ range of your bbox targets in general’

ptrblck · April 25, 2019, 12:02pm

By that I mean what range the values are in, i.e. are they bound to a specific range in e.g. [0, 1] or could they be any high or low values?

ahmed · April 25, 2019, 12:37pm

I’m not sure if this is what you meant:

        loss_bbox = criterion_bbox(output_bbox, bbox)
        print(bbox.shape)

torch.Size([1, 4])
torch.Size([1, 4])
torch.Size([1, 4])

ptrblck · April 25, 2019, 12:52pm

No, I meant the values themselves:

print(bbox.min(), bbox.max())

Could you add it to your training loop and check, if some values are “strange”?

ahmed · April 25, 2019, 12:59pm

These are the values that I get:

tensor(33.) tensor(213.)
tensor(34.) tensor(553.)
tensor(47.) tensor(519.)
tensor(20.) tensor(338.)
tensor(45.) tensor(454.)
tensor(21.) tensor(215.)
tensor(47.) tensor(527.)
tensor(0.) tensor(0.)
tensor(20.) tensor(213.)
tensor(22.) tensor(429.)

I don’t think I see anything strange as such. The values (I believe) are corresponding to the values in the csv file. I’m assuming this is what you meant about normalizing the bbox values? If so, would I normalize them to match either the range of the images?

_joker · March 17, 2021, 6:08am

Will this sample code above work for multiple bounding-boxes per image?

I use exactly the same concept for pre-trained ResNet-50 architecture which looks like;

class ResNet50(nn.Module):
    def __init__(self, num_classes=3):
        super(ResNet50, self).__init__()
        resnet = models.resnet50(pretrained=True)
        layers = list(resnet.children())[:8]
        self.features1 = nn.Sequential(*layers[:6])
        self.features2 = nn.Sequential(*layers[6:])
        self.classifier = nn.Sequential(nn.BatchNorm1d(2048), 
                                        nn.Linear(2048, num_classes))
        self.bb = nn.Sequential(nn.BatchNorm1d(2048), 
                                nn.Linear(2048, 4))
        
    def forward(self, x):
        x = self.features1(x)
        x = self.features2(x)
        x = F.relu(x)
        x = nn.AdaptiveAvgPool2d((1,1))(x)
        x = x.view(x.shape[0], -1)
        return self.classifier(x), self.bb(x)

model = ResNet50().cuda()
parameters = filter(lambda p: p.requires_grad, model.parameters())
optimizer = torch.optim.Adam(parameters, lr=0.006)
criterion = nn.CrossEntropyLoss()

And bounding boxes in this format:

inputs, targets = next(iter(trainloader)) # batch-size=2
print(targets)

tensor([[0.0000, 1.0000, 0.6492, 0.6117, 0.0203, 0.0219],
        [0.0000, 1.0000, 0.7113, 0.4547, 0.0102, 0.0109],
        [1.0000, 0.0000, 0.6271, 0.6268, 0.0073, 0.0068],
        [1.0000, 0.0000, 0.6039, 0.6328, 0.0078, 0.0083],
        [1.0000, 0.0000, 0.4901, 0.6349, 0.0063, 0.0073],
        [1.0000, 1.0000, 0.6044, 0.6117, 0.0057, 0.0057],
        [1.0000, 0.0000, 0.6096, 0.6057, 0.0057, 0.0062]])

Column-0 - bounding box image index
Column-1 - Class label {0,1,2}
Column-2:5 - Bounding boxes


    for i in range(epochs):    
        
        model.train()
        total,total_loss = 0.0, 0.0
        
        for xs, ys in train_dl:
            
            xs = xs.cuda().float()
            ys_idx = ys[:,0].cuda()
            ys_class = ys[:,1].cuda()
            ys_bb = ys[:,2:].cuda().float()
            
            print(f"Target Classes: {ys_class}\nTarget Classes Shape: {ys_class.shape}", )
            pred_class, pred_bb = model(xs)
            pred = torch.max(pred_class,1)[1]
            print(f"Predicted Classes: {pred}\nPredicted Classes Shape: {pred.shape}")
            
            loss_class = criterion(pred.float(), ys_class.long())

The at the loss function, I get an error;

Target Classes:  tensor([1., 1., 0., 0., 0., 1., 0.], device='cuda:0')
Target Classes:  torch.Size([7])
Passes classifier
Predicted Classes:  tensor([3, 1], device='cuda:0')
Predicted Classes:  torch.Size([2])
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

ptrblck · March 17, 2021, 6:26am

nn.CrossEntropyLoss expects raw logits as the model outputs in the shape [batch_size, nb_classes, *]. In your current code snippet you are applying torch.max on the model output and are storing the indices in pred. This will detach the tensor from the computation graph (your model won’t be trained) and will also remove the needed nb_classes dimension, so you should most likely pass pred_class directly to criterion.

_joker · March 17, 2021, 6:36am

@ptrblck Still, the error remains the same, the model predicts 1 bounding box per image whereas, the target has 7 bounding boxes (there could be any no. of bbox in a image) for batch-size of 2.

Target Classes:  tensor([1., 1., 0., 0., 0., 1., 0.], device='cuda:0')
Target Classes Shape:  torch.Size([7])

Predicted Classes:  tensor([[-0.0507, -0.4791, -0.9245],
       [ 0.0292,  0.4890,  0.9582]], device='cuda:0', grad_fn=<AddmmBackward>)
Predicted Classes Shape:  torch.Size([2, 3])

_joker · March 18, 2021, 8:47pm

When I try to relate this with a standard YoloV3 model, I still don’t get the idea. I wonder the error I am getting has something to do with the regression layer. Really appreciate the guidance! The only difference in the above model I could think of is my model is making predictions per image instead of per bbox.

ptrblck · March 19, 2021, 4:21am

I’m not sure what’s exactly creating the issue and would need more information about the input and all output shapes.
Based on your previous post I assume that “Predicted Classes” refers to the output of self.classifier?
If so, then the target shape also doesn’t match, as you would be returning logits for 3 classes and for 2 samples, while the target seems to contain the class indices for 7 samples.