My model don't converge, and I don't know if I am calculating the loss correctly

I am trying to build a CNN model that, given an input image, outputs a set of curves (list of lists of points).
For this task, I got inspiration on Single Shot Detection Networks. My model outputs a 32x32 grid, where each grid cell has a prediction of whether there is or there’s not a point in that cell.
So, each cell outputs 3 scores and 3 points. The scores mean, respectively, if there is a point in that cell, if this point has a previous neighbor, and if this point has a next neighbor. The 3 points mean the location of the point that lays in this cell, the location of the previous neighbor, and the location of the next neighbor.

Based on this tutorial of SSD “”, I used a pre-trained VGG16 architecture and inserted 2 convolutional layers at the end to make my predictions.

For the loss calculation, I am using L1Loss for the points’ locations, and BCEWithLogitsLoss for the scores.

Sadly, my model arent converging, and I am afraid that maybe I have implemented something wrong. I am kind of new in Pytorch and deep learning, so I don’t even know if this model makes sense. I am training for 200 epochs, and the loss gets stuck around 0.65 from epoch 20.

Sorry for the long question, and sorry if my explanation gets confused.

There is the code of my loss calculation:

class MyLoss(nn.Module):
    def __init__(self, grid_size=32):
        super(MyLoss, self).__init__()
        #  grid_size => The size of the grid, in cells
        self.grid_size = grid_size
        self.cell_size = 1.0/self.grid_size

        self.logist_bce = nn.BCEWithLogitsLoss(reduction="mean")
        self.smooth_l1 = nn.L1Loss(reduction="mean")

    def forward(self, predicted_locs, predicted_scores, targets):
        Forward propagation.
        :param predicted_locs: a tensor of dimensions (N, 3, 32, 32, 2)
        :param predicted_scores: a tensor of dimensions (N, 3, 32, 32)
        :param targets: list of list of tensor (n_points, 2)
        batch_size = predicted_locs.size(0)

        assert self.grid_size == predicted_locs.size(2) == predicted_scores.size(2)
        assert self.grid_size == predicted_locs.size(3) == predicted_scores.size(3)

        true_locs = torch.zeros((batch_size, 3, self.grid_size, self.grid_size, 2),
                                dtype=torch.float).to(device)  # (N, 3, 32, 32, 2)
        true_scores = torch.zeros((batch_size, 3, self.grid_size, self.grid_size)).to(device)  # (N, 3, 32, 32)

        # torch.set_printoptions(threshold=5000)
        # For each image
        for i in range(batch_size):
            for row in targets[i]:
                cells = torch.floor(row/self.cell_size).type(torch.long)  # (n_points, 2)
                cells = cells[:, 0]*self.grid_size + cells[:, 1]  # (n_points)
                true_scores[i][0].view(self.grid_size**2)[cells] = 1
                true_locs[i][0].view(self.grid_size**2, 2)[cells] = row
                true_scores[i][1].view(self.grid_size**2)[cells[1:]] = 1
                true_locs[i][1].view(self.grid_size**2, 2)[cells[1:]] = row[0:-1]
                true_scores[i][2].view(self.grid_size**2)[cells[0:-1]] = 1
                true_locs[i][2].view(self.grid_size**2, 2)[cells[0:-1]] = row[1:]

        loc_loss = self.smooth_l1(predicted_locs[true_scores == 1], true_locs[true_scores == 1])
        score_loss = self.logist_bce(predicted_scores, true_scores)
        return score_loss+loc_loss