At least part of your performance loss is you had to reimplement complex multiplication by hand using an unfused series of operations. Fusing the operations would be a big win. You should be able to use the PyTorch JIT’s fuser to test this.
At least part of your performance loss is you had to reimplement complex multiplication by hand using an unfused series of operations. Fusing the operations would be a big win. You should be able to use the PyTorch JIT’s fuser to test this.