Absolute error stagnates in image-to-image regression task

Hi all,

This is my first post on what has been an immensely helpful forum. Any suggestions are welcome.

I am currently working on a side project to help out a PhD student at my university. The task is image-to-image regression on simple, small (28x72) spectrograms. The goal is to train a network to be able to accurately predict corresponding output spectrograms from a given input. The task is supervised; I have access to roughly 90k input/output pairs. My current dataset uses around 18k of these.

I have designed several custom networks to accomplish this task. Fully convolutional autoencoders, vanilla fully-connected networks, etc. GANs/transformers are overkill, given the simplicity of the images.

I have observed the best results so far using a combination of cosine loss, SSIM loss, euclidean distance, and absolute error (implemented as L1 loss -reduction ‘sum’). All of my networks achieve “good” results, in that loss descends smoothly. Predicted spectra are visually almost identical to ground truths, however, the issue is with the magnitude of predicted values. All of the parameters defined by the shape of the spectra are excellent (thanks to cosine/SSIM similarity), but the parameters related to intensity/magnitude are way off (up to 10x the target value).

There are no errors in my code: I have learnt what I need on that front by simply trawling through the PyTorch discussions. Both Cosine and SSIM metrics reach an acceptable 96+% within a few epochs, and the rest of the training is a slow process of reducing absolute error. My question is about strategies for reducing this error for predicted values. Would weighting the euclidean distance/L1 dynamically throughout training potentially help? I have experimented with waiting until similarity reaches a threshold value before adding these extra terms into the loss function, also with varying degrees of weight decay.

For a sample in my test set, the best absolute error is around 0.25, ranging to >4 for very difficult outliers for all experiments/networks, which is unacceptable, given that the target values are exceedingly small, ranging from 0 (the majority of the image) to a small quantity of intermediate values (0.001 to 0.01) with only one or two defined peaks ( reaching <1.5 in extreme cases).

TL/DR: Image regression generates predictions that are visually identical to ground truths; what are some strategies/techniques for further reducing the magnitude of these predicted values?