Issue with using DataParallel (includes minimal code)

Can we profile how much of the 434s are spent in the forward pass when DP is not present? And how much of that is spent on GPU? This can be measured using elapsed_time . See this discussion.

Note that multi-thread cannot parallelize normal Python ops due to Python GIL, and the parallelism only kicks in when the execution does not require GIL (e.g., CPU/GPU ops that explicitly drops GIL).

1 Like