So I am fairly new to PyTorch and have only experience with training simple neural network architectures so far. Currently I have been thinking if it’s possible to implement affine image transformations (rotation/scaling/translations etc.) and use the MSE for image registration (ultimately calculating the transformation matrix via gradient descent). So there are 2 parts to this that I am not sure how to proceed.

The transformation matrix is a pixel-wise operation. Is there a good way to express it as a matrix multiplication for an entire image?

Why do you say that the rtansformation matrix is a pixel-wise operation? If there’s an affine transformation on the whole image, it has the form x' = M*x + b, where M and b are uniquely defined over the whole image. Each pixel will have a different result, but that’s only because we feed different inputs…

The way I see your problem, is you want to predict the affine transformation that was applied to an original image, so that you can correct it. Training is easy enough, you take some input images and apply random transforms on it, then predict the various components of the transforms (specifically the M and b from above). However, I don’t know if it makes much sense in general. How do you know what is the original image? Why should the rotated version, for example, be a transformation of the original and not the original itself?

One use case I can think of is that you have affine distortions or deformations (as described in the Matlab page for affine transforms) for a specific lens / camera setup and your dataset is only images taken with that lens / camera setup. But then you only have to do calibration once and apply the resulting correction on every image.

I am quite new to image processing as well, so please do correct me if I am wrong. In case of affine transformations, aren’t the matrices for rotation/scaling/etc 3 * 3? Where each row corresponds to a dot product operation on the x-coordinate, y-coordinate and the intensity respectively. If my image were say 1000 * 1000 * 1 (and mind you, these are just the intensities, excluding the x, y coordinates), how would I go about multiplying a 3*3 matrix with it? Fairly basic from what you say, but I am probably missing something very obvious here…

Also about the use case, I think this is a fairly common problem in medical image registration. You have multiple snapshots of the patient at different times, across different modalities. So the patient could have moved a bit, the camera setup could be different, but essentially, it’s an image of the same patient part.

I think you would simply work on the original problem (like cancer classification or else), and treat the variability (rotation, scaling, translation) as data augmentation. If you want to normalize the images (w.r.t. the variability) so that every image has the same orientation, scale, position, then that is another problem! Is that what you want to achieve?