Hello everyone, I’m currently working on a glucose prediction problem for type 1 diabetes patients. However, I’m encountering quite a few issues in the data processing step, so I’d like to ask for your opinions.
The dataset includes 17 people with type 1 diabetes, relatively balanced in gender (10 females, 7 males), with ages ranging from 23 to 70 years old and BMI from 20.3 to 36.5 kg/m². Data was collected continuously for 12 weeks (90 days). My dataset includes many patients (ID type UoM23xx, UoM24xx). Each person has different data types, stored in separate CSV files:
Glucose: continuous measurement (CGM), approximately every 5 minutes
Insulin: Basal (continuous), Bolus (injected with meals)
Nutrition: meal information (carbs, calories…)
Activity: physical activity (~15 minutes)
Sleep: Sleep status, sleep duration
Currently, my data is fragmented into many files (each patient and each data type is a separate file), so I’m wondering whether I should merge all the data into one file (or one large table) or keep the current structure and process it separately for each patient. Also, the data between patients is not synchronized in terms of time. Specifically, the data types for each person are not synchronized. Although the timestamps are the same, even among patients, the measurement times differ (for example, one person measured at 7:30, another at 7:31, or completely different starting times). I would appreciate your advice on how to handle this dataset.
I can summarize some of the issues with my dataset as follows:
-
Differences in time-step (multi-frequency)
Glucose: every 5 minutes
Activity: every 15 minutes
Nutrition, bolus insulin: event-based (irregular)
Sleep: interval-based
It is very difficult to combine them into a common time series. -
Timestamps do not match
Example: glucose at 7:30, activity at 7:45
There is no common timestamp to join directly. -
Different collection times among patients
Some patients have data from approximately 3 months ago.
Some patients have data from approximately 2 months ago.
The data is inconsistent and uneven. -
Different start times:
Person A: started 23/10/2017 19:30
Person B: 23/10/2017 19:31