[SOLVED] Make Sure That Pytorch Using GPU To Compute

I found that the bottleneck is DataLoader - I implemented my own DataLoader according to this code

for epoch in range(epochs):
    print (time1)
    for data in loader:
        print (time2)
        ....

time2-time1 is extremely large (~15s) while all the rest excutions inside the inner loop including forward and backprop takes <1s.