The distribution of prediction is way distorted from the labels'

I am using some CNN layers to do regression task with MSE LOSS FUNCTION. the distribution of label is kind of normal distribution, but after few iterations of training, the prediction distribution gets so weird.
the right side is label distribution and left side is predictions’

the model structure

model4(
  (conv_channel1): Sequential(
    (0): Conv2d(3, 16, kernel_size=(2, 2), stride=(1, 1), padding=(2, 2))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(16, 32, kernel_size=(2, 2), stride=(1, 1), padding=(2, 2))
    (4): ReLU()
    (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (conv_channel2): Sequential(
    (0): Conv2d(3, 16, kernel_size=(2, 2), stride=(2, 2), padding=(3, 3))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(16, 32, kernel_size=(2, 2), stride=(2, 2), padding=(3, 3))
    (4): ReLU()
    (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (conv_channel3): Sequential(
    (0): Conv3d(1, 8, kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(2, 2, 2))
    (1): ReLU()
    (2): MaxPool3d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv3d(8, 16, kernel_size=(3, 2, 2), stride=(1, 1, 1), padding=(1, 1, 1))
    (4): ReLU()
    (5): MaxPool3d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  # the output of above three Conv Layer are flattened and as the input of dense layer
  (dense): Sequential(
    (0): Linear(in_features=864, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=64, bias=True)
    (3): ReLU()
    (4): Linear(in_features=64, out_features=1, bias=True)
    (5): Tanh()
  )
)

any idea is appreciated!

It is hard to debug a training session with only the architecture of the network. Could you share the training code?
How does the loss behave during the training session? Is it decreasing properly? Do you normalize your inputs? Just throwing some ideas…

sure.
I find. there was a mistake when preprocessing the input. I take absolute value of input which makes the x is asymmetric, but after fix it, the distribution still uneven.

class model4(nn.Module):

    def __init__(self, in_channels=3):
        super(model4, self).__init__()

        self.conv_channel1 = nn.Sequential(
            nn.Conv2d(
                in_channels=in_channels,
                out_channels=16,
                kernel_size=2,
                stride=1,
                padding=2
            ),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),

            nn.Conv2d(
                in_channels=16,
                out_channels=32,
                kernel_size=2,
                stride=1,
                padding=2
            ),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2)
        )

        self.conv_channel2 = nn.Sequential(
            nn.Conv2d(
                in_channels=in_channels,
                out_channels=16,
                kernel_size=2,
                stride=2,
                padding=3
            ),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),

            nn.Conv2d(
                in_channels=16,
                out_channels=32,
                kernel_size=2,
                stride=2,
                padding=3
            ),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
        )

        self.conv_channel3 = nn.Sequential(
            nn.Conv3d(
                in_channels=1,
                out_channels=8,
                kernel_size=3,
                stride=1,
                padding=2
            ),
            nn.ReLU(),
            nn.MaxPool3d(kernel_size=2),

            nn.Conv3d(
                in_channels=8,
                out_channels=16,
                kernel_size=(3, 2, 2),
                stride=(1, 1, 1),
                padding=(1, 1, 1)
            ),
            nn.ReLU(),
            nn.MaxPool3d(kernel_size=2)
        )

        self.dense = nn.Sequential(
            nn.Linear(864, 128),
            nn.ReLU(),

            nn.Linear(128, 64),
            nn.ReLU(),

            nn.Linear(64, 1),
            nn.Tanh()
        )

    def forward(self, x):
        x = x.view(-1, 3, 4, 16)
        output_channel1 = self.conv_channel1(x)
        output_channel2 = self.conv_channel2(x)
        input_3d = x.unsqueeze(1)
        output_channel3 = self.conv_channel3(input_3d)
        x = torch.cat((output_channel1.view(output_channel1.size(0), -1),
                       output_channel2.view(output_channel2.size(0), -1),
                       output_channel3.view(output_channel3.size(0), -1)), 1)

        output = self.dense(x) * 3
        return output

and the distribution of input is

by the way, the negative label is quite easy to predict from my experience since the label is the stock return and the short alpha is hard to trade for some restriction which may make it relatively stable.