Questions about prototype builtin datasets

Ren_Pang · May 19, 2023, 7:28pm

Hi all, I’m currently exploring builtin datasets with new standards:

Let’s take Cifar10 as an example. I have several questions:

Why are all datasets constructed as iter rather than map style? When I have an index (e.g., 2331), I can no longer use dataset[2331] like the old CIFAR10.
In this case, how to get_item for the new format dataset? Do I have to use IterToMapConverter? That’ll be quite strange because raw data format is map, I make it iter and traverse to change back to map.
What does hint_shuffling do?
```
def hint_shuffling(datapipe: IterDataPipe[D]) -> Shuffler[D]:
    return Shuffler(datapipe, buffer_size=INFINITE_BUFFER_SIZE).set_shuffle(False)
```
It’s used in all prototype datasets. It seems to wrap datapipe with a shuffler but set_shuffle(False). That seems doing nothing?
When to use Decompressor and set resource.preprocess='decompress' or 'extract'?
What’s the difference among Decompressor, resource.preprocess='decompress', resource.preprocess='extract' and using nothing?
- Cifar10 resource is a cifar-10-python.tar.gz and sets nothing. It will default call _guess_archive_loader in OnlineResource.load to generate a TarArchiveLoader
- MNIST resource is a train-images-idx3-ubyte.gz and uses a Decompressor
- cub200 resource is a CUB_200_2011.tgz uses decompress=True
How to use Transform in the new dataset API? such as AutoAugment or RandomCrop? Append corresponding augment Datapipe to current Dataset pipe?
For dataset that each image is stored in encoded image format (the old ImageFolder type. e.g., ImageNet, GTSRB),the output image format is EncodedImage -> EncodedData -> Datapoint. For dataset stored in binary (e.g., MNIST and CIFAR), the output image format is Image -> Datapoint. Why are they different? I see most transform V2 APIs are conducted on Image. Why is EncodedImage used here?