I am trying to create a denoising auto-encoder that imputes missing data. I only have a small number of data points ~100 each with ~200 features of which up to 20% maybe be missing for a given data point.
My intention is to train an auto-encoder using SGD (batch size = 1) to impute the missing data for subsequent analysis.
Within each batch (i.e. a single data point) I would like to do dropout in the first hidden layer along the where I drop the columns where features are missing. Is this something that is achievable within the pytorch framework?
Could you explain your use case a bit more?
As far as I understand it, you would like to create a linear layer with the same number of input and output features (i.e. 200 in your case) and zero out the activations where the features of your input were “missing”.
Note that a linear layer creates a fully connected weight matrix, such that each input features will be multiplied with a certain weight to create each output activation.
I’m not sure how using dropout at specific indices should help. Do you have a reference paper for this method?
I want to create an embedding for problem in the physical sciences, the dataset i’ve curated has a lot of missing entries and collecting data is not viable as the relevant experiments are prohibitively expensive. I want to experiment on as to whether I can train an auto-encoder with a custom loss function that excludes the entries corresponding to missing data in the output layer. If I can get near zero reconstruction loss on this custom loss function then I can reasonably take the results of the output layer as an approximation to the dense featurisation. There is also the question of whether the output layer or the latent representation would be a better embedding to use for subsequent modelling.
After thinking a bit more I realise I can achieve pretty much the same effect just by using zeros as my padding token for my missing entries. Doing this there may be an issue with the fact that some of the features are boolean with a binary encoding but I guess I will just have to run the experiment and see what the result is.
I am not sure if zeroing out the activation after the first layer is sensible because we may have co-dependence in the subsequent layers on the bias for a missing entry, but these are all things I can experiment with. If I have no bias in the first layer and use an activation that is non-zero for zero like softplus then reset the activations this may be the way to go as then the masking would provide the desired regularisation effect without impacting upon the one-hot features.
I am not going off any particular reference paper just the belief that denoising autoencoders with masked noise should be able to be re-purposed for data imputation. From a quick literature search I know this is not an original idea but nor did I find anything that I thought was an improvement on my basic idea.
I don’t think this question will be particularly useful for other users and so feel free to delete it.
That sounds like an interesting idea!
Not at all! I really like these new ideas and would like to know how your experiments worked out!
Here is a small code snippet zeroing out the activations of the linear layer based on the “missing” features of the previous input. My definition of missing is just setting the feature to zero values.
def forward(self, x, pre_x):
mask = (pre_x != 0).all(0).float()
x = x * mask
N, C = 10, 5
x = torch.randn(N, C)
# Set a few features to zero
x[:, [0, 3]] = 0.
lin = nn.Linear(5, 5)
drop = MyDropout()
output = lin(x)
output = drop(output, x)
Let me know, if that would work for you or if I misunderstood your use case.
I am having the same issue, however, my problem is how to implement this across an entire batch. Do you have any suggestions for how to implement this an entire batch, while avoiding having to loop over each sample?
My code doesn’t use a for loop and should work for the batch.
Could you post your approach so that we can have a look?
Thanks for your solution! I’ve got a similar problem I believe. I’m approximating a 2D function on a grid such that my output is of size Nx2xHxW.
I have training data for each of the N samples in the form of 2xHxW matrix, however, it has some of the 1x2 tuples missing from the HxW grid, so I set those values to an unacceptable value, say 999, before computing the loss. I compute an MSE loss with a mask similar to yours and exclude those output nodes from adding to the loss.
Input to the network is of size Nx2 and I have a decoder architecture with a linear layer followed by M ConvTranspose2d layers.
My questions are:
- Do you agree that I should not backpropagate from all the output nodes for which I have the missing training data?
- Secondly, should I dropout all the nodes from all the layers that correspond to the output nodes with missing training data?
- Third, how do I achieve it for say a simple network with 1 linear layer followed by BatchNorm and Activation, 1 ConvTranspose2d layer followed by BatchNorm and Activation and 1 Output ConvTranspose2d layer?
Please help as my network is not learning with missing data.