Hi all, I’m currently exploring builtin datasets with new standards:
Let’s take Cifar10 as an example. I have several questions:
- Why are all datasets constructed as
iter
rather thanmap
style? When I have an index (e.g.,2331
), I can no longer usedataset[2331]
like the oldCIFAR10
.
In this case, how toget_item
for the new format dataset? Do I have to useIterToMapConverter
? That’ll be quite strange because raw data format ismap
, I make ititer
and traverse to change back tomap
. - What does
hint_shuffling
do?
It’s used in all prototype datasets. It seems to wrap datapipe with adef hint_shuffling(datapipe: IterDataPipe[D]) -> Shuffler[D]: return Shuffler(datapipe, buffer_size=INFINITE_BUFFER_SIZE).set_shuffle(False)
shuffler
butset_shuffle(False)
. That seems doing nothing? - When to use
Decompressor
and setresource.preprocess='decompress' or 'extract'
?
What’s the difference amongDecompressor
,resource.preprocess='decompress'
,resource.preprocess='extract'
and using nothing?-
Cifar10
resource is acifar-10-python.tar.gz
and sets nothing. It will default call_guess_archive_loader
inOnlineResource.load
to generate aTarArchiveLoader
-
MNIST
resource is atrain-images-idx3-ubyte.gz
and uses aDecompressor
-
cub200
resource is aCUB_200_2011.tgz
usesdecompress=True
-
- How to use Transform in the new dataset API? such as
AutoAugment
orRandomCrop
? Append corresponding augment Datapipe to current Dataset pipe? - For dataset that each image is stored in encoded image format (the old
ImageFolder
type. e.g., ImageNet, GTSRB),the output image format isEncodedImage -> EncodedData -> Datapoint
. For dataset stored in binary (e.g., MNIST and CIFAR), the output image format isImage -> Datapoint
. Why are they different? I see most transform V2 APIs are conducted onImage
. Why isEncodedImage
used here?