I got this error solved.
The problem is the batched data is not tensor data. It’s a list includes dict based data ( training samples + ground truth). If the input data to model is tensor data organized in NCHW mode, it works as expected.
I’m still wondering that if it’s possible to pass a list of dict objects to a model inherited from DataParallel.
The list batch data can be somehow automatically scattered appropriately on multiple GPUs?