Batch_norm causes RuntimeError: Expected all tensors to be on the same device , but found at least two devices, cuda:0 and cpu!

Thank you for pointing it out! It turns out that running_var is also on cpu for some reason. After moving it to GPU, the problem has been resolved!