Parquet file reader cannov vonerte tuple to pyarrow.lib.NativeFile


I want to use TorchData and DataPipes for a large parquet timeseries file with several different timeseries. I can open and work with the parquet file in polars etc, but I cannot manage with torchdata.

dp = FileLister(root=data_dir).filter(lambda fname: fname.endswith(".parquet"))
dp = FileOpener(dp)
parquet_dp = dp.load_parquet_as_df()
stream = parquet_dp.batch(batch_size=16)


TypeError: Cannot convert tuple to pyarrow.lib.NativeFile
This exception is thrown by __iter__ of ParquetDFLoaderIterDataPipe(columns=None, device='', dtype=None, source_dp=FileOpenerIterDataPipe, use_threads=False)

I am using the following:

Ubuntu 20.04
python 3.9.13
torchdata 0.5.0
torcharrow 0.2.0a0.dev20221129
pyarrow 0.8.0
polars 0.14.28

Is your Parquet file an Arrow table or similar? Would it be readable with pyarrow.parquet.ParquetFile?

If not, you may have to write your own custom DataPipe to parse it.

I created the parquet file from a polars dataframe.

The following both run without error:

table = pq.read_table(data_dir)
parquet_file = pq.ParquetFile(data_dir)

I believe I found the issue, but don’t know why it occurs. In the torchdata/datapipes/iter/util/ source code file, line 144

for path in self.source_dp:
     parquet_file = parquet.ParquetFile(path)

path here is a tuple of (data_dir, StreamWrapper<<_io.TextIOWrapper name=data_dir mode='r' encoding='UTF-8'>>) and hence the error gets thrown inside pyarrow.

I found the error, one does not need to open the file before using the parquet file reader, so the line dp = FileOpener(dp) has to be removed.