Hi dear forum!
I’m dealing with intensive care data at the moment (see MIMIC-IV on physionet.org). There are many missing values in there and I’m trying some methods to deal with those NaNs directly by embedding them properly without imputing these missing values e.g. with a mean/median.
I created a
NanEmbedding layer, see below. I takes in a batch of 1-dimensional feature vectors that can contain NaNs. Each feature is projected to an
out_size-dimensional vector using its own linear layer. All feature embedding vectors are then summed up, whereas the vectors of features with a NaN are set to 0 (or ignored) during the summation. This allows the embedding to distuingish between a regular value, an input of 0, and a NaN:
class NanEmbed(torch.nn.Module): def __init__(self, in_size, out_size, use_conv=True): super().__init__() self.in_size = in_size self.out_size = out_size # create embedding weights self.emb_layers = torch.nn.ModuleList([torch.nn.Linear(1, out_size) for _ in range(in_size)]) def forward(self, x): # embed each feature into a larger embedding vector of size out_size out = torch.stack([layer(x[:, i].unsqueeze(1)) for i, layer in enumerate(self.emb_layers)], dim=-1) # method 1 for setting NaNs to 0 #with torch.no_grad(): # out[torch.isnan(out)] = 0 # method 2 (current method) out = torch.nan_to_num(out) emb = out.mean(dim=-1) # method 3 (slow and also yields NaN grads) #mask = torch.isnan(x) #bs = x.shape #emb = torch.stack([out[i][:, torch.where(mask[i])].sum(dim=-1) / self.in_size # for i in range(bs)]) return emb
Now, I have two big problems:
- Applying these individual linear layers in a for-loop is slow and ugly. Hence, I attempted to use a solution from How to apply different kernels to each example in a batch when using convolution? - #3 by postBG and apply it to 1-D. I replace my list of linear layers by:
conv = torch.nn.Conv1d(in_size, in_size * out_size, 1, stride=1, padding=0, groups=in_size, bias=True). This projects my input of shape
(batch_size, feature_num==in_size, 1)to
(batch_size, in_size * out_size, 1). So it seems to work if I reshape the output. My question: Does this do the right thing, i.e. apply individual linear weights to each single feature?
- The bigger problem: this approach does not work, the gradients of the weights of a linear layer belonging to a feature are NaN as soon a single value of the corresponding feature is NaN in the batch. So the network does not learn at all. I am showing three methods I tried in the forward pass. I don’t understand at all why the gradients turn to NaN… I would be happy for any hints, as far as I see I might need to write a custom backward function, but I’ve never done that before so some help would be greatly appreciated.
Hope you’re having a nice day and I’m looking forward to any responses!