How to create bounding boxes on images for training

I have created a custom dataloader for the KAIST pedestrian detection dataset. The following is the dataset:

class DataSet(Dataset):
    def __init__(self, csv_path, root_dir):
        self.to_tensor = transforms.ToTensor()
        self.data_info = pd.read_csv(csv_path)
        self.root_dir = root_dir
        self.image_arr = np.asarray(self.data_info.iloc[:, 0])
        self.label_arr = np.asarray(self.data_info.iloc[:, 18])
        self.data_len = len(self.data_info.index)

    def __getitem__(self, index):
        single_image_name = self.image_arr[index]
        img_as_img =
        img_as_tensor = self.to_tensor(img_as_img)
        single_image_label = self.label_arr[index]
        return (img_as_tensor, single_image_label)

    def __len__(self):
        return self.data_len

if __name__ == "__main__":
    # Call dataset
    trainset =  \
        DataSet(csv_path = 'train/images/annotations.csv',
               root_dir = 'train/images/')
    testset = \
        DataSet(csv_path = 'test/images/test_annotations.csv',
               root_dir = 'train/images/')

    trainloader =,
    testloader =,

classes = ('', 'person', 'cyclist','people', 'person?')

To view the image with its associated class I am using the following method:

import matplotlib.pyplot as plt
import numpy as np
import torchvision

# functions to show an image
def imshow(img):
    img = img      # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))

# get some random training images
dataiter = iter(trainloader)
images, labels =

# show images
# print labels
print(' '.join('%5s' % classes[labels[j]] for j in range(4)))

What I would like to do now is to add bounding boxes on the images for training. The values for the bounding boxes are stored in the csv in the following format:
19, 220, 27, 46
66, 210, 27, 49

I have followed a few examples online, but haven’t managed to get it working. Any advice would be greatly appreciated.

Are all bounding boxes stored in a single csv file with the corresponding image file name?
If so, you could just load the corresponding line(s) for the current sample you are loading in __getitem__.
Could you explain a bit what you’ve tried so far and where you got stuck?

Yes the values of the bounding boxes are stored in the same csv file. Well what I wanted to do (not sure if this is the right approach) is load the images, labels and bounding boxes into a dataloader for training. I tried some basic examples using the matplotlib and openCV. What I’m stuck on is on how to load the data in a way that the network will know that it is a bounding box. is there a basic formula/method for something like this? if my approach is wrong, please do let me know. I’m sorry if this confusing.

These are some of the examples I was referring to:

Your approach sounds fine.
You could load a single image, label and the corresponding bounding box in the __getitem__ of your custom Dataset.

I think the easiest way would be to treat this task as a regression use case, i.e. you would provide the coordinates of your bounding boxes as the labels and use a criterion like nn.MSELoss to train your model. The model’s output would be floating point numbers in the desired interval (e.g. normalized to [-1, 1]).

Thank you for your reply. I have added self.label_arr = np.asarray(self.data_info.iloc[:, 3:]) in the __innit__ and single_image_bbox = self.label_arr[index] in the __getitem__ which returns images, labels and the bounding boxes.

I have also implemented the nn.MSELoss criterion as you advised. So for the training, I have the following:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 125 * 157, 157) 
        self.fc2 = nn.Linear(157, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x) 
        return x
net = Net()
criterion = torch.nn.MSELoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

for epoch in range(1):
    losses = utils.AverageMeter()
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        images, labels, bbox = data
        images = Variable(images)
        labels = Variable(labels)
        bbox = Variable(bbox)
        outputs = net(images)
        labels = labels.float()
        loss = criterion(outputs, bbox)
        running_loss += loss.item()
        #print('[%d, %2d] loss: %.3f' % (epoch + 1, i + 1, running_loss))
        running_loss = 0.0

print('Finished Training')

But I receive the following error:

RuntimeError: Expected object of type torch.FloatTensor but found type torch.DoubleTensor for argument #2 'target'

Is there something that I am missing or have done improperly?

Numpy uses float64 by default if I’m not mistaken, so you should convert your target to float32 before passing it to the loss function using bbox = bbox.float().

As a small side note: Variables are deprecated since PyTorch 0.4.0 so if you are using a newer version (which is highly recommended :wink: ), you can just use tensors instead.

Thank you very much for your quick reply. That seems to have worked but now I am getting the following error:

RuntimeError: input and target shapes do not match: input [4 x 10], target [4 x 4]

Does this have something to do with my neural network?

I managed to figure this out. It was caused because of the output of my network. Fixed it but changed self.fc3 = nn.Linear(84, 10) to self.fc3 = nn.Linear(84, 4)

I have now managed to do this but i am a bit confused. If the labels are the bounding boxes, how to i train the network to predict if its detecting person, cyclist etc. ? What i mean is that originally I was training the network with images and labels (person, cyclist etc.) but now the labels are the bounding boxes. So how would I go about training the network as to what it is detecting in the bounding boxes? I’m sorry if this is a bit confusing.

You could use different heads in your model and the corresponding loss functions for the different targets.
I.e. your classification head could be a linear layer which outputs the class logits, while the bboxes would be provided by another linear layer on top of the model. This would also mean that your model would return more than one output tensor.

I have set the two loss functions as:

criterion_label = nn.CrossEntropyLoss()
criterion_bbox = nn.MSELoss()

The training epoch is set as follows:

for epoch in range(1):

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        images, labels, bbox = data
        images = Variable(images)
        labels = Variable(labels)
        bbox = Variable(bbox).float()
        outputs = net(images)
        loss_label= criterion_label(outputs, labels)
        loss_bbox = criterion_bbox(outputs, bbox)
        loss = loss_label + loss_bbox
        running_loss += loss.item()
        print('[%d, %2d] loss: %.3f' % (epoch + 1, i + 1, running_loss))
        running_loss = 0.0

However, this is the output that I get:

[1,  1] loss: nan
[1,  2] loss: nan
[1,  3] loss: nan
[1,  4] loss: nan
[1,  5] loss: nan
[1,  6] loss: nan
[1,  7] loss: nan
[1,  8] loss: nan
[1,  9] loss: nan
[1, 10] loss: nan

I’m not sure if this is what you meant.

The idea using different criteria is correct, however your model should also return different outputs for the classification and regression task:

output_label, output_bbox = net(images)
loss_label = criterion_label(output_label, labels)
loss_bbox = criterion_bbox(output_bbox, bbox)

I have just made the updates and i get the following error:

ValueError                                Traceback (most recent call last)
<ipython-input-25-a174178ab267> in <module>
      8         bbox = Variable(bbox).float()
      9         optimizer.zero_grad()
---> 10         output_label, output_bbox = net(images)
     11         loss_label = criterion_label(output_label, labels)
     12         loss_bbox = criterion_bbox(output_bbox, bbox)

ValueError: too many values to unpack (expected 2)

You would have to adapt the model as described in my last post:

Here is a small dummy example:

class MyModel(nn.Module):
    def __init__(self, nb_classes):
        super(MyModel, self).__init__()
        self.base = nn.Sequential(
            nn.Conv2d(3, 6, 3, 1, 1),
            nn.Conv2d(6, 12, 3, 1, 1),
        self.out_labels = nn.Linear(12*6*6, nb_classes)
        self.out_bbox = nn.Linear(12*6*6, 4)
    def forward(self, x):
        x = self.base(x)
        x = x.view(x.size(0), -1)
        x_labels = self.out_labels(x)
        x_bbox = self.out_bbox(x)
        return x_labels, x_bbox

model = MyModel(nb_classes=10)
x = torch.randn(1, 3, 24, 24)
output_labels, output_bbox = model(x)

sorry would this value, would it be the images from the dataloader?

Yes, I just used some random input to demonstrate the code usage.
You should of course use your image tensors with probably a different shape.

This is the model that I have created based on your example:

class Net(nn.Module):
    def __init__(self, nb_classes):
        super(Net, self).__init__()
        self.base = nn.Sequential(
        nn.Conv2d(3, 6, 5),
        nn.MaxPool2d(2, 2),
        nn.Conv2d(6, 16, 5),
        nn.Linear(16 * 125 * 157, 157),
        nn.Linear(157, 84),
        nn.Linear(84, 4)
        self.out_labels = nn.Linear(3*512*640, nb_classes)
        self.out_bbox = nn.Linear(3*512*640, 4)

    def forward(self, x):
        x = self.base(x)
        x = x.view(x.size(0), -1) 
        x_labels = self.out_labels(x)
        x_bbox = self.out_bbox(x)
        return x_labels, x_bbox

net = Net(nb_classes=4)
x = ()
output_labels, output_bbox = net(x)

However, I am getting the following error:

TypeError: conv2d(): argument 'input' (position 1) must be Tensor, not tuple

There are some issues in your code:

  • self.base doesn’t have any non-linearities, so you should add e.g. nn.ReLU() between the layers
  • self.base outputs an activation of shape [batch_size, 4], so out_labels and out_bbox should have in_features=4
  • x is defined as an empty tuple. Initialize it as a random tensor (as shown in my example code) or use your data tensors instead.

Thank you so very much. I seem to have got it working. I am testing it using a
sample size of my full dataset so the loss is looking quite high. I will have to test it using the full dataset to make sure that it does indeed work as intended. Again, thank you for all you assistance. I really appreciated it. :grinning:

I know you helped me solve this problem some time ago, but I have only managed to get back to it recently and I am not getting the results that I was expecting. So for:

  • self.base outputs an activation of shape [batch_size, 4] , so out_labels and out_bbox should have in_features=4

For this point do you mean that I should change to:

    self.out_labels = nn.Linear(4, nb_classes)
    self.out_bbox = nn.Linear(4, 4)

Because when I do that, I get the following error:

RuntimeError: size mismatch, m1: [4 x 314000], m2: [4 x 4] at /pytorch/aten/src/TH/generic/THTensorMath.cpp:940