Hi dear forum!
I’m dealing with intensive care data at the moment (see MIMIC-IV on physionet.org). There are many missing values in there and I’m trying some methods to deal with those NaNs directly by embedding them properly without imputing these missing values e.g. with a mean/median.
I created a NanEmbedding
layer, see below. I takes in a batch of 1-dimensional feature vectors that can contain NaNs. Each feature is projected to an out_size
-dimensional vector using its own linear layer. All feature embedding vectors are then summed up, whereas the vectors of features with a NaN are set to 0 (or ignored) during the summation. This allows the embedding to distuingish between a regular value, an input of 0, and a NaN:
class NanEmbed(torch.nn.Module):
def __init__(self, in_size, out_size, use_conv=True):
super().__init__()
self.in_size = in_size
self.out_size = out_size
# create embedding weights
self.emb_layers = torch.nn.ModuleList([torch.nn.Linear(1, out_size) for _ in range(in_size)])
def forward(self, x):
# embed each feature into a larger embedding vector of size out_size
out = torch.stack([layer(x[:, i].unsqueeze(1)) for i, layer in enumerate(self.emb_layers)], dim=-1)
# method 1 for setting NaNs to 0
#with torch.no_grad():
# out[torch.isnan(out)] = 0
# method 2 (current method)
out = torch.nan_to_num(out)
emb = out.mean(dim=-1)
# method 3 (slow and also yields NaN grads)
#mask = torch.isnan(x)
#bs = x.shape[0]
#emb = torch.stack([out[i][:, torch.where(mask[i])[0]].sum(dim=-1) / self.in_size
# for i in range(bs)])
return emb
Now, I have two big problems:
- Applying these individual linear layers in a for-loop is slow and ugly. Hence, I attempted to use a solution from How to apply different kernels to each example in a batch when using convolution? - #3 by postBG and apply it to 1-D. I replace my list of linear layers by:
conv = torch.nn.Conv1d(in_size, in_size * out_size, 1, stride=1, padding=0, groups=in_size, bias=True)
. This projects my input of shape(batch_size, feature_num==in_size, 1)
to(batch_size, in_size * out_size, 1)
. So it seems to work if I reshape the output. My question: Does this do the right thing, i.e. apply individual linear weights to each single feature? - The bigger problem: this approach does not work, the gradients of the weights of a linear layer belonging to a feature are NaN as soon a single value of the corresponding feature is NaN in the batch. So the network does not learn at all. I am showing three methods I tried in the forward pass. I don’t understand at all why the gradients turn to NaN… I would be happy for any hints, as far as I see I might need to write a custom backward function, but I’ve never done that before so some help would be greatly appreciated.
Hope you’re having a nice day and I’m looking forward to any responses!
Anton