About Training Time of pytorch and tensorflow

hdc · September 21, 2020, 2:16pm

I notice that the training time of first epoch is some kind longer than the following epochs by using the code of tensorflow version. I guess it is because tensorflow has cached all batchs of data in the first epoch, then the processing of the following epochs can be very quick?

Can I reimplement this operation with pytorch? It may reduce the training time in total.

Thanks for your advice.

pfloat · September 22, 2020, 2:50am

Hello,

We may answer you better with some code.

In the case of PyTorch, if you have enough memory, you could move the whole dataset tensor to the GPU device:

device = ...
dataset_tensor = dataset_tensor.to(device) # works for batches too, just need to loop over them

Then, start your training loop and obtain a lower training time. Anyway, the dataset batches still have to be loaded in the memory. Either way, the total execution time for the script may probably be a little lower only.

hdc · September 22, 2020, 6:25am

Thanks for your reply.

Sorry, maybe my expression is not accurate.

I mean why the training time of the first epoch is pretty longer than the training time of the subsequent epochs in tensorflow.

When performing the same operation (caching some batchs of data) in pytorch, I cannot find such a significant time reduction.

Does tensorflow use some other optimization operations that pytorch does not?

if index in self.cache:
    img, cls = self.cache[index]
else:
    ... ...
    if len(self.cache) < self.cache_size:
        self.cache[index] = (img, cls)

hdc · September 23, 2020, 3:05pm

Problem solved. I have found a bug in my code…