Time series and LSTM model

somayyeh_hasanzadeh · May 27, 2023, 5:49am

Hi dear @ptrblck
I confused, I have a CSV dataset that has 47 columns, and I want to predict glucose after 30 minutes. every glucose level measures for 5 minutes by CGM. Only the glucose change with time and others are static. now I don’t know what should I do for my model. can you help me please?
Thank regards

this is the schema of my data

ptrblck · May 27, 2023, 6:49am

I don’t know which model architecture would work the best for your use case, unfortunately.
Also, I would recommend avoid tagging specific users as it could demotivate others to post a valid response and you might tag someone who might know have a good answer as in this case.

J_Johnson · May 27, 2023, 7:42am

This data isn’t that dissimilar from the Titanic dataset, with the exception of the time series of glucose levels. Here is what I would try:

Separate your inputs by category and normalize the data between 0 and 1. For glucose, you may just want to set the maximum to whatever the highest recorded is. Minimum can be zero. So you simply have norm_value= (value)/(maximum) in that case. Normalize the glucose targets with the same maximum.
The glucose levels, keep them in a sequential format, separated from the other data. Then pass those into a 1d Conv net.
You might find adding a positional encoder to the glocuse data before input will help, if you have varying amounts of glocuse readings in the data(i.e. one is 60 minutes of prior data while another is only 30 minutes, etc.). But the encoding should be done such that the time you’re predicting, that is the target, is fixed to zero. Positional Encoder example. This probably won’t give much benefit, though, if every glucose input sequence is the same length.
If the glucose data is variable in the number of data points, you can include an nn.AdaptiveAvg1d layer to the end of the Conv net.
Reshape the Conv net outputs and attach them to the rest of the data with torch.cat().
Put the data from 5 into a fully connected neural network, with the final layer giving an output of size 1.
Use nn.L1loss() on the output and target.
An optim.Adam() would probably be sufficient for backprop.

somayyeh_hasanzadeh · May 29, 2023, 5:57pm

Thanks for your help,
But can you explain how can I save my output of cnn and combine with other data? I confused a little
Thank regards
@J_Johnson

J_Johnson · May 29, 2023, 6:27pm

Here is an overly simplified code example:

import torch
import torch.nn as nn

class CNN_FC_Model(nn.Module):
    def __init__(self, nonseq_channels, seq_channels=1, hidden_size=64, output_size=1):
        super(CNN_FC_Model, self).__init__()
        #CNN branch, add more layers before avgpool as needed
        self.cnn1 = nn.Sequential(nn.Conv1d(in_channels=seq_channels, out_channels=hidden_size, kernel_size=(3,)),
                                 nn.MaxPool1d(kernel_size=(2,), stride=(2,)), nn.ReLU())
        self.avgpool=nn.AdaptiveAvgPool1d(output_size=1)

        # fully connected branch
        self.relu = nn.ReLU()
        self.dropout=nn.Dropout(p=0.3)
        self.fc1 = nn.Linear(nonseq_channels, hidden_size)
        self.fc2 = nn.Linear(hidden_size*2, hidden_size)
        self.fc3 = nn.Linear(hidden_size, output_size)

    def forward(self, seq_data, nonseq_data):
        #run sequential data through CNN
        seq_data = self.cnn1(seq_data)
        seq_data = self.avgpool(seq_data).flatten(1)
        
        #run non-sequential data through first fc1 layer
        nonseq_data = self.relu(self.dropout(self.fc1(nonseq_data)))

        #combine the data and continue through the fully connected layers
        all_data = torch.cat([seq_data, nonseq_data], dim=1)
        all_data = self.relu(self.dropout(self.fc2(all_data)))
        all_data = self.fc3(all_data)
        return all_data

model=CNN_FC_Model(20)

nonseq_data = torch.rand((32, 20)) #batch_size, miscellaneous data points
seq_data = torch.rand((32, 1, 10)) #batch_size, number of channels, sequence length

out = model(seq_data, nonseq_data)

print(out.size())

You should probably include more cnn layers, unless the sequence isn’t very long.

somayyeh_hasanzadeh · May 30, 2023, 1:09pm

I so appreciate for your helping, but still I have problem,
you consider my dataset has 1 column that is glucose and just this column has 604078 rows and other features are 50 columns and 188 rows, if I replace duplicated.
If I want to train this model, how can I define my train function?
can you give an example of train function for this model you said?
should I use train loader for this?
Thanks alot @J_Johnson

J_Johnson · May 30, 2023, 1:43pm

I see one of the columns specifies “PtID”. It seems that may mean “Patient ID”. And so each Patient should be treated as a separate data sample. How many Glucose readings are there for each “PtID”? Is it a fixed amount or does it vary between patients?

@somayyeh_hasanzadeh
Regarding the vanilla Pytorch Dataset class, you can specify the data in this class any way you like. You will want to specify the __len__ and __getitem__ in your Dataset class.

Here is one example:

class FaceLandmarksDataset(Dataset):
    """Face Landmarks dataset."""

    def __init__(self, csv_file, root_dir, transform=None):
        """
        Arguments:
            csv_file (string): Path to the csv file with annotations.
            root_dir (string): Directory with all the images.
            transform (callable, optional): Optional transform to be applied
                on a sample.
        """
        self.landmarks_frame = pd.read_csv(csv_file)
        self.root_dir = root_dir
        self.transform = transform

    def __len__(self):
        return len(self.landmarks_frame)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()

        img_name = os.path.join(self.root_dir,
                                self.landmarks_frame.iloc[idx, 0])
        image = io.imread(img_name)
        landmarks = self.landmarks_frame.iloc[idx, 1:]
        landmarks = np.array([landmarks])
        landmarks = landmarks.astype('float').reshape(-1, 2)
        sample = {'image': image, 'landmarks': landmarks}

        if self.transform:
            sample = self.transform(sample)

        return sample

Found here: Writing Custom Datasets, DataLoaders and Transforms — PyTorch Tutorials 2.0.1+cu117 documentation

You could set the __len__ to the max value in the PtID column. Then in __getitem__ use idx to filter patients. However, that does assume there are no patient IDs skipped. If an ID gets skipped, you should instead count unique values in that column.

somayyeh_hasanzadeh · May 30, 2023, 2:08pm

this is the file of cgm dataset, and this is the others:

now the first one change with time, and the glucose reading aren’t fixed, and PtID means patient ID.
I cofused how can I write my dataset,
I have two files that one file has 604078 and the other one only 200 rows,
can you help me please. @J_Johnson

J_Johnson · May 30, 2023, 3:55pm

For starters, the DeviceTm has time gaps. So you will either need a positional encoder. Or you can fill time gaps with padding. See one of my earlier posts in this thread for a link for how to set up a position encoder.

Then you need to set the __len__ in the Dataset. In this case, you can use np.unique() targetted on the correct row for PtID. That will give you an array of all unique patient IDs. And then get the length of that array to insert in __len__. But store the unique values array because you can use that in the __getitem__ function.

The above assumes that each Patient ID has the same number of data points.

What is still not clear to me is whether each PtID has an equal number of glucose readings. Now let’s assume they do not. In that case, you can set an arbitrary length for the input sequence, say 40 (i.e. the last 2 hours) and then prepare your data samples such that each consecutive 2 hours and 30 minutes of data, where there is a glucose reading at the final time(i.e. at 2h and 30m), comprises a sample. Two hours for model input data and the final time for your target.

So the __len__ value would also need to be defined differently.

In the above method, you’d need to initially fill in all time gaps with a zero reading. This is how you can do that with Pandas, except you can put 0 instead of NaN.

Second, you’d need to count the number of non-zero targets that have at least 45 data points prior. I.e.

total_samples = 0
PtID_idx = 2 # 2 if id is in column C, otherwise adjust
Glucose_idx = 4

for id in unique_patients:
    patient_glucose = x[:,x[PtID_idx,:]==id] # filter all of the values that correspond with that patient id
    #can fill patient_glucose gaps with zeros here, with the above linked method; just get it back into numpy before continuing
    try:
        patient_glucose = patient_glucose[:,45:] #remove first 45 values for that patient
    except:
        continue
    patient_glucose = patient_glucose[:,patient_glucose[Glucose_idx,:]==0] # remove the gaps, remaining values will be added to total sample count, since targets should be non-zero
    total_samples += patient_glucose.shape[1] # add values to targets

You may need to add an additional condition for excluding gaps 45 or larger(or possibly even 20 and larger).

From there, you can use similar conditioning arguments in __getdata__ to acquire your samples from both data tables.

The dataset is small enough that you could preprocess it all in advance and store it in memory.

somayyeh_hasanzadeh · May 30, 2023, 4:12pm

Thanks for your answer, I will search and do this step you said, @J_Johnson
May I have your telegram ID, please?
Thank regards

J_Johnson · May 31, 2023, 3:25am

Hello @somayyeh_hasanzadeh. I sent you a direct message. Please check.

somayyeh_hasanzadeh · May 31, 2023, 6:37am

Hello, sure
Thanks alot