What is a proper way to troubleshoot ddp speed gain?

it is a custom c++ cpu-loss provided in another work.
currently looking for a gpu version.
from initial run, gpu loss does not seem to be faster than the multi-threaded c++ cpu implementation.

still looking to find why and how to further speed up the gpu version.
thanks