Dataset and Dataloader

erpasd · February 15, 2019, 1:04am

Hi, I have been going through multiple tutorials but I am still have two questions about Dataset and Dataloader. Any clarification would help.

Let’s say that I am trying to processing some text, the first step is to turn my input sentences to tensors. Can I directly create a cuda tensor when intializing the Dataset? Or should I create them on cpu and then move them to cuda when the dataloader iterates through them?
How do I pick the worker numbers for the Dataloader? Is it tied to the number of GPUs available (i.e., if I have 4 gpus I should also use 4 workers to parallelize things)? What are the workers exactly doing, from the docs it reads “subprocesses to use for data loading.” are they just reading from memory? or are they also applying the transform specified in the dataset?

ptrblck · February 15, 2019, 1:10am

I would recommend to create tensors on the CPU in your Dataset and push it (asynchronously) to the GPU in your training loop. If you are using multiple workers, each worker might try to initialize a CUDA context, which will yield an error.
Each worker will load a batch of your Dataset by calling __getitem__ with batch_size indices, which also means that all preprocessing will be performed in this process. You should experiment a bit with the number of workers, as the performance depends on the number of CPUs your system has.

erpasd · February 15, 2019, 5:02pm

@ptrblck thanks for the answers! When you say the preprocessing is done by the workers what do you mean? I am guessing they take care of creating the batch (so if you define your own collate_fn they will also execute that) and what else?

ptrblck · February 15, 2019, 5:03pm

Each worker will execute the whole __getitem__ method, which usually includes loading, preprocessing etc.