__getitem__ , am I understanding it totally wrong?! prints make this claims stronger

Hi, I having a problem on long loading times from my dataset, so I insert prints to my getitem function inside dataset class ( of course inheriting from data.Dataset), this is not my first dataset class, this is not my project, but the output I got was pretty weird.

I did everything “by the book”, I’m iterating my dataset with:

    for i, (inputs, targets) in enumerate(data_loader):

During test and val.

My get Item is looking something like this:

def __getitem__(self, index):
      start_time = time.time()
      Actions
      end_time = time.time() - start_time
      print("loading:",json_path,",took:",end_time )

What I expected to see, let’s i have batch 1 and one thread, I expected to see 1 line of

loading: $json_path,took: $end_time

and one line of iteration results:

Epoch: [1][2844/26517] Time 0.103 (0.186) Data 0.049 (0.131) Loss 6.0703 (6.1430) Acc 0.000 (0.002)

For my understanding, I’m reading 1 (batch 1) file from my dataset and processing, but instead, I got a much bigger amount of lines for each.

What am I missing? It’s important for me in order to understand my main problem.

Thanks.

Hi,

Because io are slow and can be done in the background, the dataloader will ask for more elements than just the current batch.
This is particularly effective when multiple workers are used so that worker can start loading the next batch while the gpu runs the fw/bw on your network without the user needing to do fancy stuff.

Thanks, thats make sense, so one more thing that’s relates to that.
I have an issue with my network, it mostly takes arround 2 or 3 seconds to load each iteration,
But once in every epochs I’m getting a huge data loading time, like that:

Epoch: [10][1/130] Time 23355.934 (23355.934) Data 23355.545 (23355.545) Loss 5.2440 (5.2440) Acc 0.008 (0.008)

now this is the 10th Epoch, and until now it worked fine, and it’s repeating with no patterns.
I am using data shuffling.
How would you approach this problem?

How many workers do you use for your dataloader.
This could be an artefact of the preloading if you have a single worker and thus works sequentially to preload all future batches.

This specific is 32 workers on a machine with 8 gpus 12GRam

Maybe you want to play a bit with the number of workers to try and see which is the sweet spot.
Maybe all these workers overload the disk at some point and make one iteration very slow?

But 6 hours?
The validation data is on a remote disk, maybe he is trying to read everything from the disk and that’s the result? although it’s still weird because on other datasets it didn’t happen…
I’m walking in the dark…

Ho the data is on a remote disk? Could it be that the connection to the remote disk is not stable and is lost sometimes?
Do you see the same thing if you store the datas on a local disk?