Hi all, I’m currently exploring builtin datasets with new standards:
Let’s take Cifar10 as an example. I have several questions:
- Why are all datasets constructed as
mapstyle? When I have an index (e.g.,
2331), I can no longer use
datasetlike the old
In this case, how to
get_itemfor the new format dataset? Do I have to use
IterToMapConverter? That’ll be quite strange because raw data format is
map, I make it
iterand traverse to change back to
- What does
It’s used in all prototype datasets. It seems to wrap datapipe with a
def hint_shuffling(datapipe: IterDataPipe[D]) -> Shuffler[D]: return Shuffler(datapipe, buffer_size=INFINITE_BUFFER_SIZE).set_shuffle(False)
set_shuffle(False). That seems doing nothing?
- When to use
resource.preprocess='decompress' or 'extract'?
What’s the difference among
resource.preprocess='extract'and using nothing?
Cifar10resource is a
cifar-10-python.tar.gzand sets nothing. It will default call
OnlineResource.loadto generate a
MNISTresource is a
train-images-idx3-ubyte.gzand uses a
cub200resource is a
- How to use Transform in the new dataset API? such as
RandomCrop? Append corresponding augment Datapipe to current Dataset pipe?
- For dataset that each image is stored in encoded image format (the old
ImageFoldertype. e.g., ImageNet, GTSRB),the output image format is
EncodedImage -> EncodedData -> Datapoint. For dataset stored in binary (e.g., MNIST and CIFAR), the output image format is
Image -> Datapoint. Why are they different? I see most transform V2 APIs are conducted on
Image. Why is