Detecting arbitrary quadrilateral corners

My problem is:

Given an image of some quadrilateral object, find the coordinates of the corners.

I have an idea about how I might start with this problem but I would love some feedback, suggestions, critiques from the community.

(image not real data, but just to illustrate the problem clearly)

So I’m thinking to use an off the shelf vision classifier, resize the image to fit the input and modify the output to predict 8 values, which will be interpreted as 4 (x, y) coordinate pairs.

So, somewhat inspired by YOLO regression - I’m thinking to replace the last layer of say, mnasnet, with an 8 output linear layer - pass the outputs of the linear layer through sigmoid functions to map them to the (-1, +1) range and interpret these values as normalised coordinates wrt. the centre.

eg. say the left-most corner in the attached image is 25% of the width of the image across from the left and 30% of the way down. mapping this to the range (-1, +1), taking the centre as the reference point, and taking the directions “up” and “left” to be positive, the x label would be 0.5 and the y label would be 0.4.

other corners would be labelled similarly, so for the top right corner, “up” and “right” would be taken to be positive.

So then the problem becomes a simple multiple regression problem and I can use mean squared error for the loss function ( torch.nn.MSELoss )

Now these are arbitrary quadrilaterals, so there may be “corner cases” in which none or multiple corners will be in any given quadrant of the image. In order to deterministically label each datum, I’m thinking to first assign whichever corner is closest (euclidean distance) to the top-left corner of the image as the “top left” corner, then label whichever of the remaining three corners is closest to the top right of the image as the “top right” corner and so on until all four corners are assigned.

Does that sound like a reasonable approach?

It also occurred to me that I could use polar coordinates (angle, distance) labelling the data by first, second, third, fourth coordinate encountered in a clockwise rotation from the centre. this feels a little less arbitrary, somehow. thoughts?