the different train speed between pip install torch==1.2 and build torch1.2 from source

Which libraries are you using for the custom build and which are used in the 1.2 binaries?
Also, why are you comparing the speed of such an old PyTorch version?

the libraries are listed in, the reason of comparing speed is that I want to reproduce the usr’s training speed, his pytorch 1.2 environment is build by pip install , but my only build from source since our internal platform’s limitation.

If you made sure the binary and your local build are equal you could use profiling tools such as NSIGHT or use the built-in profiler in PyTorch.
Also, note that your profiling should synchronize the device before starting and stopping the timer, but I assume you are already familiar with profiling PyTorch ops.

I have already profiling and saving to timeline.json, the most cost time of each train step is IndexPutBackward op (0.8s vs. 0.2s)