Spearman correlation very low for one output channel and high for the other


I am trying to use ViT as a regressor for my time series data.
My input data size is N * 189 * 150 * 3 [N-batch size, 189- different signal channels, 150 - frame size and 3 is xyz coordinates]. I created an image representation of size N * 224 * 224 * 3 by repeating channels and frames horizontally and vertically respectively.

I put an FC on top of ViT to output N * 900 feature which I reshaped to match the output timeseries shape N * 2 * 150 * 3.

I am just finetuning the last encoder layer (11) and the fc layer. I normalized the input as per ViT requirement and did not normalize the actual output matrix (that I am trying to regress).

Trained for 100 epochs and decreasing learning rate every 10th epoch. Base lr 0.01 adam optimizer. Loss: L2+0.001 * L1

Evaluation of the second channel of the output matrix shows a good enough Spearman correlation (0.67) but for channel one is very poor (0.20). The two output channels represent different sensor signals.

Could you please explain or direct me towards any intuitions, mistakes or training considerations that I am missing? Thanks a lot.

Update: I tested with normalized (sklearn standardscaler) output vector as well. Now the regressions for both channels are worse 0.20 and 0.10.