@harpone Yes, various blob storages can work, it’s a common setup for Tensorflow training w/ Google ml-engine. Datasets can be stored in GCS or BigTable, there are C++ ops wrapped in a Python API that allow that to be tied together with the TFRecord format and their data API. Edit: And thanks for sharing the azure blob dataset, curious how large each file for the scenario you’re targeting?
Cloud blob storage has its own IOPS equivalent (roughly, for this task) limitation, requests/sec. Even though the throughput on cloud to cloud network transfers are really high, requesting data over a network will have a much higher latency (bounded by a multiple of the RTT) than local storage. You have to design the system carefully to mitigate latency by increasing the size of, and reducing number of sequential requests, or by dispatching many async parallel requests. The easiest solution for those large blocks is to use a record format. For many parallel requests, i’d probably use an efficient/scalable distributed database than can store your data natively (ie binary as binary or text/JSON as text/JSON).
I wouldn’t want to write the requesting and parsing code for any of the above in Python though. You don’t actually need many CPUs if you aren’t stuck in Python, you want those CPUs to be doing more useful things. With a small pool of worker threads and a decent async (or at least non-blocking) IO subsystem, you can move and parse a lot of data efficiently using a systems language. This is hard to do effectively in Python.
This ‘Python for the interface and C/C++ for the dirty work’ paradigm is a little saddening when you realize you need to do some of the dirty work. I’m quite looking forward to see how the Tensorflow Swift experiment works out. If Swift gains momentum as both a viable server language and ML language, it could make some of the systems work for supporting ML much less of a chore. SwiftTorch FTW?