Unifying Preprocessing Pipelines for Machine Learning on Time Series Data (High Throughput Batch & Low Latency Streaming)

The Question/TL;DR

What frameworks, design patterns, systems, etc. exist for unifying the preprocessing of time series data in Python such that high throughput is achieved on the retrospective batch data for training machine learning models while also allowing for easy model deployment with low latency on real time streaming data for inference?

System Details

The data is raw, high frequency (50-500Hz) ICU bedside monitor signals which are recorded into a bespoke time-series database containing over 3 trillion data points which come from a RabbitMQ stream producing on the order of tens of thousands of new points per second. The database is accessible via a rather slow REST API backed in PHP or SDKs based in various languages. Researchers are really only familiar with Python based tooling (numpy, scipy, pandas, biosppy, PyTorch, Tensorflow, various GitHub repos of domain specific code from research papers, etc) while also not being interested or skilled in engineering custom deployment code for each of their models. We are needing high throughput while preparing large subsets (up to a couple terabytes raw) of the database for efficiently and easily developing models from retrospective data, while also having a very low latency (<500ms) on real time streaming data.


  1. Traditional ML Research Approach
  • Relevant retrospective data is pulled to disk, processed sequentially (usually inefficiently), and eventually a decent performing model is made.
  • This works, but is completely incompatible with the real time systems meaning deployment requires significant custom engineering. Or worse, this processing is simply inefficient and takes days to process the dataset and/or extra engineering work to parallelize. Therefore, this is not a great approach due to how the research phase is slowed down and the excessive technical work requirements for optimizing the processing as well as deploying to real time data.
  1. Custom Python Framework
  • I had started on developing a Python based framework which tried to minimize the amount of systems knowledge and code required from researchers while still allowing them freedom to perform their work however they saw fit via familiar tools (ie, Python). This ended up being quite the undertaking and has been very difficult. Worries of bugs, dealing with edge cases, performance, maintenance, adaptability, … have me looking for a better approach. Many of the problems I discovered along the way seem to be addressed by the streaming frameworks.
  1. Streaming Frameworks
  • It appears that streaming frameworks are the current leading way of feeding real time machine learning models. In particular, Spark Streaming, but also Storm, Flink, Samza, Trill, …
  • Apache Beam is currently the most interesting though I am uncertain this is the best option
    • Unified: batch and streaming can be processed in the same way
      • I recently learned Flink offers similar functionality
    • Extensible: allows for new SDKs, IO connectors, etc
      • This is important for interfacing with our bespoke database.
      • Issue: The Python SDK of Apache Beam does not have existing IO for RabbitMQ but Java SDK does
      • Allows for the execution of any Python code on the data, not just simplistic min/max/windowing and such of other streaming engines
    • Language: Allows for everything to be written in Python
      • Only other all Python offering I have found is Streamz

Thank you for your time. If there are better venues or ways to ask this question, please let me know! Especially if I have misused terminology, please let me know as that will allow me to research these issues better on my own.

1 Like

i’m curious what you ended up with.
Currently looking at Flink as a means of upgrading our proprietary framework built on python DAG graph processing. This was developed because Python is the data science standard in our company and re-engineering of the features built in research stages would be very inefficient as well as prone to errors due to mismatches in processing algorithms in the different languages.

Using the same processor for batch and streaming will go a long way, although the python implementation of Flink (v1.12) does require a bit of research to use because the documentation and examples on the internet can be a bit sparse (Also some bugs…).

We’re moving towards Azure because of the IOT edge capability maturity, but running Flink there would require a lot of setup and management of quite low level infra stuff.

Anyway, i’d be interested in hearing what you decided.