.to('cuda') taking super long time

See if

i[1] = i[1].cuda(non_blocking=True)

makes a difference.

I think that for this to make a difference you have to specify pin_memory=True as an argument while constructing the DataLoader object, but I am not entirely sure. If merely setting non_blocking=True does not give any improvement, and if you do use a DataLoader object to construct trainset, then try passing pin_memory=True to its constructor, as well.