Torch1.10.2 is slower than torch1.9.1

I recently upgraded my pytorch version from 1.9.1 to 1.10.2 due to project needs. But found that the training speed dropped by 20% to 30% for the same multi-tasks model.

After profiling, I found the reason is that the code of torch1.10 has the following logic:

When using DDP, the model will assign module buffers before each forward.Time will be spent in the following sections:

I would like to know how I can avoid this assign of buffers?