Data reading framework for PyTorch (Hive, MySQL, S3 etc.)

pritamdamania87 · May 22, 2019, 6:00am

At Facebook we are building a data reading framework for PyTorch which can efficiently read from data stores like Hive, MySQL, our internal blob store and any other tabular data sources. The framework allows for specifying complex input pipelines to read from different sources. For example if you have a table which stores handles for images, you can write SQL like code to read from the table, apply filters to select certain handles and then retrieve those handles from another data source with a few lines of code.

In addition to this, the framework supports running user-defined transforms which can either be pure python (ex: torchvision.transforms) or torchscript code. This framework can also be used with the torch.distributed package to distribute the data across multiple nodes for training. The input pipeline that the user specifies can be defined once, serialized as a plan and run on multiple remote machines if required.

The framework builds upon the OSS dataloader and dataset framework. In particular it uses IterableDataset to provide a stream based interface for data retrieved from input pipelines.

Sample code to illustrate what reading and pre-processing images would look like:

# Hive table has columns handle and partition_no. The partition column 
# for the table is partition_no
df = data.data_warehouse("mynamespace", "mytable")

# Filter to partition the data across multiple workers.
partition_filter = "hash(partition_no) % {0} = {1}".format(worker_info.num_workers, worker_id)
df = df.filter(partition_filter)

# Fetch the handle from a blobstore
df = df.map(["fetch_handle(handle) as img"])

# Rebatch the data
df = df.rebatch(batch_size=16)

# transform_image is a user supplied function to run image transforms.
ds = MyDataset(df=df, transforms=transform_image)
dl = torch.utils.data.DataLoader(ds)

for batch in dl:
    pass

We are evaluating whether it makes sense to open source this framework. For OSS users, this framework might be useful for training jobs which store large amount of data in Hive or S3 (images). Although, we would love to hear from the community whether this would be useful and also some use cases that might benefit from a framework like this.