Read dataset from TFRecord format

Hi,

I need to read data from TensorFlow protocol buffer format “TFRecord” (aka Example+Features, see https://www.tensorflow.org/api_docs/python/tf/python_io/TFRecordWriter).

Is there a solution for that out there?

Thanks!

5 Likes

Could you read the data once and save it in another format?
I just had a look at the code on how to read TFRecord files, and it looks like you need a session object and some transformations.
The easiest way would probably be to read the data from another source.
If that’s not possible, you could try to transform the data somehow from tf.float32 to a Tensor.

I’m not sure, if you can fit this code somehow into a Dataset and DataLoader.

No, session is not needed. TFRecord format is straightforward protocol buffers container. It can be read into array of integers or floats quite easily (and efficiently). The reason I am asking is to avoid coding, maybe someone already solved this?

One technical difficulty may be that TFRecords are streaming (one does not know the number of data points upfront), and Dataset interface requires len() and random-access indexing…

Thanks

Oh, ok, thanks for the information!
Could you post a link to a simple code sample loading from TFRecords without a session?

It might get a bit ugly, bit if you really need to read from TFRecords, we could try out some approaches.
Would it be possible if you host a small sample TFRecords file somewhere?

Answering my own question:

One needs TensorFlow installed to read TFRecords (yack! I hoped to avoid this). The reason for this is that there are record guards and a checksum that they put into the file, in addition to the ProtocolBuffer payload. See there.

To demonstrate how to read/write TFRecords I put a tiny project here - check it out.

It should be easy now to import TFRecord data into PyTorch by just wrapping arrays into torch.Tensors

@ptrblck I do not want to transform from TFRecord. I am planning to use the data as-is. Reason is that TFRecord io supports cloud storage out-of-the-box. Example - these are valid filenames: gs://mybucket/training/data/blah.tfrecords, or s3://mybucket/training/data/foo.tfrecords.

This is very convenient as my training process uses standard disposable cloud workers that should not store anything of a value on their local drives! So the plan is:

  • install both tf and torch
  • read data from TFRecord into torch.Tensor
  • hack torch.util.data.DatasetLoader to be able to read streaming data (no len!!!). And throw all the existing Torch Dataset machinery under the bus - it is based on random-access model, alas.
  • prove that this is reasonably fast and is not a bottleneck for training

Thanks for reading!

10 Likes

Hi @pgmmpk. Thanks for sharing this solution. I just have a question about your code:

  1. In this code what package do you use for “from tfrecord import Writer” ? I couldn’t find this package nor in tensorflow official package.

  2. Could you also elaborate, or have some examples, on how to interface tfrecord and pytorch dataloader? Thanks!

Hi @timchen

The module content is in the __init__.py: https://github.com/pgmmpk/tfrecord/blob/master/tfrecord/__init__.py

I do not have a DataLoader solution yet. That gonna be hard for two reasons:

  1. PyTorch DataLoader is using multiple workers
  2. PyTorch code is not directly usable because TF dataset does not have __len__ (size is indefinite)

But, for a simple "read and convert to torch.Tensor" loop, the answer is very simple - the unit test shows how to get arrays from TFRecord files. What is left is to just wrap them into appropriate torch.Tensor (FloatTensor or IntTensor, depending on what is in the files).

To summarize: its easy to read it into torch.Tensor. But it is hard (for me) to do it very efficiently because of items above.

When I have some results on TFRecordDatasetLoader (and benchmarks), I will share on GitHub.

4 Likes

Nice work, and I’m also trying to find a efficient way to read large amount of data when training. I think tfrecord format is good way to save data chunks, and avoids reading lots of small files, which is especially slow on hdfs.

In fact, I want to find a so-called TFRecordDatasetLoader which can not only read tfrecord chunks but also have an internal buffer as tf.dataset to provide efficient data reading pipeline.

Hi Mike @pgmmpk ,

Thanks for posting your solution!
Did you find a way to integrate tfrecord iterator with PyTorch DataLoader?

Also, @ptrblck - is there a tfreocrd-like solution in PyTorch? Essentially, I would like to handle big datasets the same way tfrecord-based dataset works - read big files and shuffle the serialised samples (like tf example protobuf) rather than reading batch_size images in every step, which results in I/O overhead.
Another issue is that our file system in which we store datasets is having trouble handling so many files, so this is another advantage for tfreocrd-based datasets.

Thank you!!
Adva

I’m not really familiar with tfrecord, but from your description it sounds like the whole dataset is being loaded into the memory first and then just the current sample is loaded. Is that right?
If so, you could load the data once and save it using torch.save. Afterwards, you could load it in your Dataset's __init__ and get a sample in __getitem__.

Please correct me, if I misunderstood the tfrecord approach.

Thanks for the fast reply!

The tfreocrd mechanism designed to handle large datasets. It based on protobuf serialisation protocol that is used for creating training “examples”. For instance, an “example” can be composed of training image and an integer label. All the examples are serialised and written to a tfrecord file (or files). This way, you can compress a dataset like ImageNet to only ~1000 records.
The benefits are: (1) file system performance since it doesn’t have to handle millions of files, (2) training performance since instead of reading a lot of small files the training-process reads few big files, and (3) dataset mobility - transfer few big files instead of many small files.

These datasets usually don’t fit the CPU memory, so I can’t load the complete dataset before training.

So tfrecord splits your dataset into several chunks and stores these files in a binary format.
E.g. for 10000 images you could have 10 tfrecord files each containing 1000 images?

I’m not sure about benefit 1 and 2.
The io might be limited, if you load a lot of files, but on the other side you would have to load a huge file before even the first iteration can start. The same goes for point 2. While the GPU is busy your multiple workers can load a new batch of images instead of a large file.

Anyway, you could probably emulate such a behavior by loading some images, store them into a tensor and save it to your file system. Then you would have to create a logic to load a new chunk based on your current index, and finally get your sample.
Let me know, if I misunderstood your use case or if you need some help figuring out the chunk loading.

This is really intriguing.

Did you get the chance to benchmark your implementation ? @pgmmpk

Specifically to read this huge dataset : https://research.google.com/youtube8m/download.html
(1.53 tb of tfrecords)
What would be an optimum approach ?

  • running a tf sess dataloader
  • Using the above implementation
  • saving as framework agnostic npy etc and then re reading into a pytorch dataloader ?

@ptrblck

I did not benchmark this. It is just reading protocol buffers, which should be pretty fast. Running tf sess dataloader should be similar in speed, but then you incur transformation from TF tensor to numpy and PyTorch tensors.

Converting to np offline and then using pytorch dataloader could be the best, as pytorch dataloader uses multiple workers.

PyTorch nightly has IterableDataset abstraction in works, that should allow plugging steraming readers into dataloader very easy, see https://pytorch.org/docs/master/data.html#dataset-types

Hi @pgmmpk
Were you able to find an efficient way of handling the tfrecord/protobuf file/s using the IterableDataset ?
I am also trying to do something similar but no success yet.
Thanks

I made a working version of this with IterableDataset but it was unusably slow. It was not the TfRecord reader’s fault but IterableDataset, it seemed like it didn’t use queuing at all.

To provide more context on why I want to use something like TFRecords, I work with audio files that are all very small, so a large dataset ends up being 90MM files totaling 5Tb. That becomes very unwieldy to move around, so serializing the encoded FLACs into a protobuf format and chunking them makes them much easier to move around and read from a network.

Minimal example:

import tensorflow as tf
import torch
import random

class TfRecordDataset(torch.utils.data.IterableDataset):

def __init__(self, tfrecord_path):
        "tfrecord_path contains tf record files"""
        self.buffer_size = buffer_size
        self.tfrecord_list = sorted(glob.glob(tfrecord_path))
        super(TfRecordDataset, self).__init__()

def __iter__(self):
        worker_info = torch.utils.data.get_worker_info()
        if worker_info is None:
            tfrecord_list = self.tfrecord_list
        else:
            worker_id = worker_info.id
            num_workers = worker_info.num_workers
            tfrecord_list = self.tfrecord_list[worker_id::num_workers]
            random.seed(worker_info.seed)

        def tf_record_iterator():
            for filename in tfrecord_list:
                    record_iterator = tf.python_io.tf_record_iterator(filename,
                                            tf.python_io.TFRecordCompressionType.NONE)
                    for string_record in record_iterator:
                        example = tf.train.Example()
                        example.ParseFromString(string_record)
                        do_stuff_with_example(example)
                        yield example

        return tf_record_iterator()

Does something look obviously wrong this implementation to the PyTorch people?

Are there any plans to make a a tfrecord like object native to pytorch? I can see there is a fair amount of interest from this thread. It is not great having to work with 2 libraries (tensorflow, pytorch) like this.

8 Likes

TFRecords serialise the data and convert it to tensors before hand I believe. Plus it uses Protocol Buffers. It doesnt load all records to the memory but loads batched records as far as I know. I had to work on tabular data and I could get about 3X speed increase when using data source as TFRecords instead of CSV using TensorFlow. I am interested in having such faster datasource in Pytorch.

The combination of TFRecords and Dataset API in tensorflow is good. It would be great if we can have the same feature here.

3 Likes

Hi,
If anyone still interested in reading TFRecords, I’ve started a project https://github.com/podgorskiy/DareBlopy. It is capable of some basic TFRecords reading and has decent performance (weirdly, seems even faster than TensorFlow, at least in one particular test case)
I used it to train my StyleGAN PyTorch implementation using TFRecords of FFHQ dataset, as well as in some other projects of mine.
Documentation is still an issue, but I’ll be fixing it soon.

5 Likes

@Stanislav_Pidhorskyi Great work, will it support var len feature (automatic padding) ?