Predicting X Y Coordinates - Output converges to the same number and gets stuck

Hello All,
Probably a newbie mistake but I’ve run out of ideas, so hoping someone can point me in the right direction?

My goal is for the model to predicate the X Y co-ordinates on an image based on features in that image. The issue is the output are converging to the same co-ordinates for all images, and getting stuck.

I have managed to overfit on the training data using the Adam and SGD optimizer; so I know its learning something. I have tried different models e.g. Resnet18, and created my own Conv2d to simplify, but ran into the same problem.

The current model is a Resent34 using a AdaBound optimizer. I have about 500 images and have augmented them to create about 3500 all scaled down and greyed (typical 20% validation)

model_ft = models.resnet34(pretrained=False)
num_of_channels = 1
model_ft.conv1 = nn.Conv2d(num_of_channels, 64, kernel_size=3, stride=1, padding=0,bias=False)

num_ftrs = model_ft.fc.in_features
model_ft.fc = nn.Linear(512, 2)
model_ft =

As you can see I changed the output to only two features one for X and another Y. The AdaBound Optimizer is:

optimizer_ft = optim2.AdaBound(

lr= 1e-4,
betas= (0.9, 0.999),
final_lr = 0.1,
eps= 1e-8,
amsbound=False, )

I have written my own Loss function, which works out the MSE Loss of the distance between actual and predicated co-ordinates:

def loss_function(outputs, actuals):
total = 0
for i in range(0, len(outputs)):
x = (outputs[i][0] - actuals[i][0])
y = (outputs[i][1] - actuals[i][1])
total += torch.sqrt((x2 + y2))
mean = 1.0/len(outputs) * total
return mean

And the main training loop is as follows:

for i in range(0, length, BATCH_SIZE):
if phase == ‘train’:
inputs = training_batch[i:i+BATCH_SIZE].view(-1,1, scaled_h,scaled_w )
actuals = training_label[i:i+BATCH_SIZE]
inputs = validation_batch[i:i+BATCH_SIZE].view(-1,1, scaled_h,scaled_w )
actuals = validation_label[i:i+BATCH_SIZE]
inputs =
actuals =
# zero the parameter gradients

            with torch.set_grad_enabled(phase == 'train'):
                outputs = model(inputs)
                loss = loss_function(outputs, actuals)
                # backward + optimize only if in training phase
                if phase == 'train':

Here is a sample output of the a single batch after it’s trained for about 20 epochs (as you can see all the numbers are the same):

tensor([[153.2917, 93.2165],
[153.2917, 93.2165],
[153.2917, 93.2165],

[153.2917, 93.2165],
[153.2917, 93.2165]], device=‘cuda:0’, grad_fn=)

tensor([[138., 173.],
[175., 125.],
[270., 182.],

[174., 76.],
[282., 36.]], device=‘cuda:0’)

The Loss is: tensor(64.1859, device=‘cuda:0’, grad_fn=)

Right, hopefully above has given enough insight/clues on what I’m not understanding.

I have sneaky feeling that I don’t have enough features on the image to get the accuracy I want.



So, I just read this post here from 4 odd years ago, and may be I’m asking too much.

Maybe a better approach as the post suggests to be split the image into grids and ask it to give me the probability it is that grid? It won’t be as precise, but then again I suppose the final co-ordinates where always going to be approximation anyway.

What was misleading is that my current approach started off so well, and the actual predictions I was getting on test data where wrong, but they were not unreasonable, you could see the logic.

Any thoughts would be appreciated?