I found that the bottleneck is DataLoader - I implemented my own DataLoader according to this code
for epoch in range(epochs):
print (time1)
for data in loader:
print (time2)
....
time2-time1
is extremely large (~15s) while all the rest excutions inside the inner loop including forward and backprop takes <1s.