OpenXExperienceReplay fails

Hi,
I’m trying to use OpenX API for loading a dataset:

dataset = OpenXExperienceReplay(
        "berkeley_gnm_cory_hall",
        download='force',
        streaming=False,
        root=ds_root
    )

But it fails on any dataset from this collection, similarly, with this error:

  File "~/.cache/huggingface/modules/datasets_modules/datasets/jxu124--OpenX-Embodiment/317e9044a9bb97bb1db9ea5aebf1c15f5cc3e1e071e5da025e97892e96dae22b/OpenX-Embodiment.py", line 29, in decode_image
    data = data.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  ...
  File "...", line 130, in main
    dataset = OpenXExperienceReplay(
  File ".../python3.10/site-packages/torchrl/data/datasets/openx.py", line 358, in __init__
    storage = self._download_and_preproc()
  File ".../python3.10/site-packages/torchrl/data/datasets/openx.py", line 484, in _download_and_preproc
    dataset = datasets.load_dataset(
  File ".../python3.10/site-packages/datasets/load.py", line 2096, in load_dataset
    builder_instance.download_and_prepare(
  File ".../python3.10/site-packages/datasets/builder.py", line 924, in download_and_prepare
    self._download_and_prepare(
  File ".../python3.10/site-packages/datasets/builder.py", line 1647, in _download_and_prepare
    super()._download_and_prepare(
  File ".../python3.10/site-packages/datasets/builder.py", line 999, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File ".../python3.10/site-packages/datasets/builder.py", line 1485, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File ".../python3.10/site-packages/datasets/builder.py", line 1642, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

Has anyone had a successful experience with this API? Seems like it still needs some work.

I can reprod.
We’re calling datasets.load_datasets here and the datasets lib from HF fails to do that. The error stack suggests it’s a datasets error, not TorchRL.
You can reproduce this via


import datasets

datasets.load_dataset(
                "jxu124/OpenX-Embodiment",
                "berkeley_gnm_cory_hall",
                streaming=False,
                split="train",
                cache_dir="./dump",
                trust_remote_code=True,
            )

I started a discussion here