I am trying to speed up my pytorch code by using pytorch_lightning, which supports TPU training. However, the training runs very slowly. Each iter takes 20secs in TPU, whereas only takes half a sec in GPU.
Upon profiling, the issue is below:
pt-xla-profiler: TransferFromServerTime too frequent: 449 counts during 6 steps pt-xla-profiler: Op(s) not lowered: aten::im2col, aten::im2col_backward, Please open a GitHub issue with the above op lowering requests.
XLA does not support
Upon investigation of the pytorch code, it is
nn.unfold that is triggering
I can’t afford to rent multi-GPU instances and really need to use TPU to speed up things.
Wihout XLA supporting
im2col_backward any time soon, is there a way to replace what
nn.unfold is doing but without triggering the 2 ops ?