Different results on single vs multiple output network

So I’m trying to estimate a network with 4 real valued outputs. The network uses dropout on all the layers except the last one, the reason for this is so I can capture the model uncertainty (bayesian approximation). The problem is that the network has poor performance when it’s predicting 4 output but get adequate results when the output is only one.

I have done sanity checks i.e. checking the complexity of the network so it can overfit on a small set of observations. Also normalizing both input and output. The std of the outputs differ much.

Anyone has faced similar problem? I know this is not a direct question regarding the torch framework but hope I can get some answers. Trained with MSE loss and adam. The network is given by

class NETWORK(nn.Module):

	def __init__(self, layers = [40, 2048, 2048, 1024, 1024, 4], droprate = 0.1):
		super(NETWORK, self).__init__()
		self.p = droprate
		modules, n = [], len(layers) - 1
		for i in range(n):
			modules.append(nn.Linear(layers[i], layers[i + 1]))
			if i + 1 != n:
		self.net = nn.Sequential(*modules)
	def forward(self, x):
		return self.net(x)