Data with different dimensions

chocolocked · September 18, 2018, 9:23pm

Hi,

I’m trying to define the Dataset class for our EHR data to be able to utilize the DataLoader, but it comes in the format of a list of list of list for a single subject, see below.

Basically the entire thing is a medical history for a single patient.
For each second-level list of list, e.g.[ [0], [ 7, 364, 8, 30, … 11, 596]] this indicates a single record in the patient’s history, where [0] is a visiting time indicator, and [ 7, 364, 8, 30, … 11, 596] corresponding to the medical codes regarding this visit.

So there are inconsistent dimensions with regard to the length of visit codes, like for this patient, he/she has the visit codes varying 16, 16, 18 for his/her 1st, 2nd and 3rd visit. But each patient might also have varying length of historical records, like this person has 3 records, but another might have 10, or 20 or 34.

Just at lost about what to do to process and prepare this data in the format that Dataloader can process for models later.

Any hints or suggestions would be appreciated!

ptrblck · September 18, 2018, 11:09pm

How would you like to get or process the data further?
Using an own colalte_fn for your DataLoader you can just return the medical history as you’ve saved it:

# Create Data
data = [[[random.randint(0, 100)], torch.randint(0, 500, (random.randint(3, 10),)).tolist()]
        for _ in range(random.randint(20, 30))]


class MyDataset(Dataset):
    def __init__(self, data):
        self.data = data
        
    def __getitem__(self, index):
        data = self.data[index]
        return data
    
    def __len__(self):
        return len(self.data)


def my_collate(batch):
    return list(batch)

dataset = MyDataset(data)
loader = DataLoader(
    dataset,
    batch_size=10,
    shuffle=False,
    collate_fn=my_collate
)

x = next(iter(loader))
print(x)

I’m not sure the code is that useful for you, as now you are basically getting the data as a nested list.
Do you want to create one tensor for each patient and feed it to the model?
Or would an approach from NLP be more appropriate where we should use padding for the shorter medical recordings?

chocolocked · September 19, 2018, 3:56pm

Thanks so much! It helps a lot, as currently our model just takes the nested list and go from there.

But just curious if I want to take a look at the NLP approach and do the padding (2D) for both code length of a single visit and number of visits for each patient, would you mind pointing me to relevant materials (links) ?
Thank you <3

ptrblck · September 19, 2018, 4:35pm

I think pad_sequence could be a good starter, but I’m not that familiar with NLP.

chocolocked · September 19, 2018, 5:36pm

Thank you! I’ll start from there!