Hi all, first post here. I’m going to apologise in advance for the broadness of this post and if there are better places to ask, please let me know!
I am a Machine Learning for Health masters student. The ICU unit I work with has been archiving the high frequency (500Hz) bedside monitor data to a self-built database. My original goal was to predict a physiological event a few minutes in advance from ~3 of these waveforms. Unfortunately, I ran into significant performance problems. The amount of data is huge (~900GB, billions of data points, months of wall time) and preprocessing couldn’t be done on the fly in the Dataset/Dataloader fast enough to put much load if any on the GPUs (only have 6 CPU cores) or else I would have to let the preprocessing run for days, only to find I want to change one thing and have to run it all again. Epochs were taking hours to days. My initial strategy was to just build a DIY Python pipeline with (
numpy, etc) and try to make it as parallel as possible with caching at each step along the way. I should note that our DIY database is fairly slow and can currently only return a couple hours of data in JSON via an API call, so no streaming, etc. I have since pivoted to tackling the larger issue of deploying these models into “production”.
My project is the largest (in terms of data, etc) that our group has done so far. Before, data was downsampled to 0.2 Hz and was just looking at more specific situations, so far less data was relevant (>30GB). This meant it could be kept in CSVs, naively processed, and ran through PyTorch. However, in order to see how these same models behave in real time (eg: displayed at the bedside), they need to consume streaming data from RabbitMQ, which also feeds to waveform database.
Now, instead of trying to parallelize the preprocessing, I have been trying to use PyTorch
IterableDataset after reading this article: https://medium.com/speechmatics/how-to-build-a-streaming-dataloader-with-pytorch-a66dd891d9dd I’ve been building a framework which approaches this in an entirely streaming fashion instead. Without getting too deep into the details, the framework requires researchers provide a minimal amount of config (RabMQ login, relevant data in our database, …) along a variety of Python callables/functions for: preprocessing, assigning a label, … which all operate on a standardised
Interval class. Therefore, one can toggle between running the offline pipeline to build up a gigantic processed cache which is then artificially streamed to their PyTorch model, or actually consuming the real streaming data, with just the flip of a config variable.
Of course, then there is the whole other open source side of solutions. Hadoop, Spark, … Things I am not as familiar with. I know these of course work, at massive scales, at tech companies worldwide. If there’s something out there that would allow for me to instead focus back on the Healthcare ML questions and not this data pipe-lining, I’d be extremely grateful to know.
So, I think my questions are as follows:
- How does one work with data that isn’t atomic?
- With images, you have a certain number of photos, so indexing is easy. Augmentation is also very understandable. Here, we might want to alter window size, overlap, frequency, … and if we cut everything up into the samples that actually go into our model, that means rerunning everything.
- What are some standard data science design patterns for these types of problems?
- I’m certainly far from the first person that wants to strap a neural net onto some waveform data and do outlier detection/event prediction. Obviously places like Google, etc take in far more data per second from say their clusters and want to predict failures, etc. That is highly similar to predicting a heart attach from ECG and such.
- Does cooking up a custom framework for my lab as I described above make sense?
If you made it through or even just did a quick skim, thank you very much! This has been a gigantic blocker on my research work for months now so any help is appreciated!