Hello,
I have some data that I’d like to model in the following way:
I have n observations, each of which have a fully observed set of covariates x_i.
For n_obs of these observations, there is a corresponding continuous response observation, y_i. For n_mis of these observations, there is no associated y_i. n = n_obs + n_mis.
I believe the data to be missing not at random, so it would not make sense to simply train a model on the n_obs observations.
Instead, I would like to build one model that incorporates all available data. I would like to structure the network as follows:
Input Layer: 6 covariates, all fully observed.
Hidden Layer: 5 neurons, each with RELU activation.
Output Layer 1: 1 neuron with linear activation. The goal of this output is to predict the continuous y variable for the data.
Output Layer 2: 1 neuron with sigmoid activation. The input to this layer is the one neuron in output layer 1. The goal of this is to predict if the response is observed or not.
The model above is quite similar to a similar model I have written in probabilistic programming languages. However, none of these languages have neural network implementations, and I believe that my model may have somewhat complex and nonlinear interactions.
Essentially, I’m really interested in predicted y, but I’d like to use the n_mis observations to train the model, because I believe that the missing values would have had lower y values had they been observed. This network architecture seems like a way to enforce this sort of relationship.
There are two questions I have relevant to the implementation of the above model:

What loss function should I use for the output from output layer 1 and output layer 2. My usual intuition for layer 1 would be to use the MSE, and my usual intuition for layer 2 would be to use logloss. However, these surely will not be on the same scale, which I feel could lead to serious issues. I’m entirely fine simply adding the loss from output layer 1 and output layer 2 upon training, but I’m just curious if anyone has any suggestions for what sort of loss should be used such that layer 1 and layer 2 have loss on the same scale.

How would I go about implementing a neural network architecture like this in pytorch? Here is my start for the code
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(6, 5) # input layer > hidden layer
self.fc2 = nn.Linear(5, 1) # hidden layer > y
self.fc3 = nn.Linear(1, 1) # y > 1 for observed, 0 otherwise
def forward(self, x):
x = F.relu(self.fc1(x))
x = self.fc2(x)
x = F.sigmoid(self.fc3(x))
return x
However, I understand that this will not work for backpropagating the loss as I would like to, because only the final output layer is returned in the forward pass. I don’t want to restructure the network into something that has 1 output layer with 2 neurons, because I specifically am trying to force the decision boundary for observation to be linear in y.
So, does anyone have ideas about the best way to format this code? I also know that I’m going to have an issue where I’m defining the loss differently for the n_obs and n_miss points and I’m not sure how to do that either. But I figure that the answer to this issue will depend heavily on how the network itself will be defined.
Any help is appreciated! I’m also open to the idea that to do something like this I should really be using a probabilistic package with a neural network implementation like Pyro, but I wanted to understand how I might implement this in PyTorch before moving on to that.
Eric.