Time Overhead in torchvision.datasets.ImageFolder

Prateek_Varshney · September 7, 2020, 6:56pm

Hi. I was experimenting with torchvision datasets and model architectures. I seem to be incurring a very large time overhead when running torchvision.models.Resnet18 for one epoch on ImageNet Dataset:

Dataloader_time:  [6.656000018119811e-05]
TrainD_time:  [7.61386392974854]
train_stat_time:  [46.41145703125]
stat_time1:  [0.24797286987304687]
stat_time2:  [0.0009789440035820008]
validation_time:  [27.89709765625]
epoch_time:  [164.072703125]

Here

Train_stat time is running the model on entire training dataset after each epoch to compute performance metrics on training_dataset (such as accuracy/loss)
stat_time1 is for computing class wise metrics on Training Dataset and stat_time 2 is for computing class wise metrics on Validation Dataset.
For debugging, I had downsampled the Training set to 12.8K and Validation set to 6.4K samples.

There are 2 major issues here:

Individual Times are not adding up to the epoch time
The Validation/Train_Stat Time are much more than Training Time

For comparison, running the same code on the entire torchvision.datasets.CIFAR10 gives the following results:

Dataloader_time:  [0.00016486400365829467]
TrainD_time:  [17.981506515502918]
train_stat_time:  [7.0972744140625]
stat_time1:  [0.0033064959049224855]
stat_time2:  [0.000319487988948822]
validation_time:  [0.4533729248046875]
epoch_time:  [25.934220703125]

As evident, here Training Time is the major component and the numbers seem to add up.

The only difference between the two instances is that the previous one uses datasets.ImageFolder to create the training/validation datasets and the latter uses datasets.CIFAR10.

I have verified that the time issue does not stem from image size differences in the two datasets by using the same image size crop (32 x 32) on both. (For reference, ImageNet has 256 x 256 as compared to 32 x 32 in CIFAR10/100 after standard transformations)

Any help would be deeply appreciated!!