Use module.eval() after module.cuda() takes a very long time

It should be unrelated. You must have some bug in your benchmark script