See if
i[1] = i[1].cuda(non_blocking=True)
makes a difference.
I think that for this to make a difference you have to specify pin_memory=True
as an argument while constructing the DataLoader
object, but I am not entirely sure. If merely setting non_blocking=True
does not give any improvement, and if you do use a DataLoader
object to construct trainset
, then try passing pin_memory=True
to its constructor, as well.