Pytorch get stuck at .to('cuda') or .cuda()

I use a docker environment to do some pytorch training work, but frequently get stuck at .to(‘cuda’) or .cuda() calls. Not know why.
Some system behaviors I notice: The training process has a cpu usage of 100%, and almost 97% are sys time as shown by top 1. I used strace to debug the training process, and also used the /proc file system, found the process keep getting stuck on a poll syscall to a pipe, but I don’t know where the pipe is used in pytorch.