I want to use TorchData and DataPipes for a large parquet timeseries file with several different timeseries. I can open and work with the parquet file in polars etc, but I cannot manage with torchdata.
TypeError: Cannot convert tuple to pyarrow.lib.NativeFile
This exception is thrown by __iter__ of ParquetDFLoaderIterDataPipe(columns=None, device='', dtype=None, source_dp=FileOpenerIterDataPipe, use_threads=False)
Edit:
I believe I found the issue, but don’t know why it occurs. In the torchdata/datapipes/iter/util/dataframemaker.py source code file, line 144
for path in self.source_dp:
parquet_file = parquet.ParquetFile(path)
path here is a tuple of (data_dir, StreamWrapper<<_io.TextIOWrapper name=data_dir mode='r' encoding='UTF-8'>>) and hence the error gets thrown inside pyarrow.