The last layer of your model decides your output. I am guessing you are using a CNN as you mentioned color–so, your last layer should have an output channel size of 1 if you want the output to be single channeled tensor. Something like nn.Conv2d(in_channels,1,kernel_size,stride,padding)
This would allow you to train, however , you would have extra weights which are not being trained against any loss function.