Missing values (nan) in medical sequence data

Hi all, I am a medical informatics and new in data science, so I am a noob.
I read this paper about simulating data using an LSTM model with PyTorch. A method for generating synthetic longitudinal health data | BMC Medical Research Methodology | Full Text

I want to follow this paper (they have additional documentation but just on the surfeace) and in our community, medical informatics/health care/epidemiology missing values are a problem. They seem to have solved this. As I understood them correctly they use an LSTM model because it could handle missing values and I’d like to also simulated data based on our in-house studies as well with the missing values. All to promote data sharing in the medical domain.

  1. Does anybody have an idea or tipp how the tensors should look like regarding missing values?
  2. Or should I use methods to fill the missing values up?
  3. But would that distort the output data?
  4. Or is it in general not a good idea to simulate real data with missing data?

Currently I ignore patients with missing values and I can simulate data but I experiment with the different parameters, because currently the simulated data is not as distributed as the original data. But that would be the goal.

In general the process would be:
We have studies. I simulate all of them. External scientists can get access to simulated data and can create models with out infrastructure and we execute their models on the real data and they get the outputs: simulated vs. real. My main goal is to build the infrastructure around this but the simulating part is one of the major elements of it but I am not a data scientist. Just to be clear.

My testing data looks like this:

  • 1618 patients (babies)
  • 3 visits / time steps
  • at each visit height and weight where measured

In the future I want to add more features.

Here is the scaled data:

Shape of X: (1294, 3, 2)
tensor([[[0.1420, 0.1960],
         [0.3119, 0.3719],
         [0.4693, 0.5729]],

        [[0.2495, 0.2211],
         [0.3580, 0.3970],
         [0.4575, 0.5729]],

        [[0.2495, 0.3719],
         [0.3852, 0.4975],
         [0.6456, 0.7236]],

        ...,

        [[0.1410, 0.1709],
         [0.2857, 0.3719],
         [0.3888, 0.4975]],

        [[0.2188, 0.2462],
         [0.3703, 0.4724],
         [0.5285, 0.5729]],

        [[0.2134, 0.2965],
         [0.3038, 0.3970],
         [0.5118, 0.6734]]])

Here is the model

import torch.nn as nn
import torch.optim as optim
class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers):
        super(LSTMModel, self).__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        h0 = torch.zeros(self.lstm.num_layers, x.size(0), self.lstm.hidden_size).to(x.device)
        c0 = torch.zeros(self.lstm.num_layers, x.size(0), self.lstm.hidden_size).to(x.device)
        out, _ = self.lstm(x, (h0, c0))
        out = self.fc(out)  # Output for all time steps
        return out

# my parameters
input_size = 2  # Number of features
hidden_size = 648
output_size = 2  # Number of features to predict
num_layers = 4

model = LSTMModel(input_size, hidden_size, output_size, num_layers)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.0000998)

# Training loop
train_losses = []
val_losses = []
num_epochs = 50
for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    for inputs, labels in train_dataloader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)  # Compare output sequences with target sequences
        loss.backward()
        optimizer.step()
        running_loss += loss.item() * inputs.size(0)
        train_losses.append(loss.item())
    epoch_loss = running_loss / len(train_dataloader.dataset)
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {epoch_loss:.4f}")

    model.eval()
    val_loss = 0.0
    with torch.no_grad():
        for inputs, labels in val_dataloader:
            outputs = model(inputs)
            loss = criterion(outputs, labels)  # Ensure dimensions match
            val_loss += loss.item() * inputs.size(0)
            val_losses.append(val_loss)
    val_loss /= len(val_dataloader.dataset)
    print(f"Validation Loss: {val_loss:.4f}")

print("Training complete")

I’d appreciate and advise.