Data Parallel on single GPU


I want to use data parallel to train my model on single GPU. I have tried to run multiple processes on same GPU by following the example of Pytorch DISTRIBUTED DATA PARALLEL, and pass the same device_id. However, it doesn’t have speedup. I searched in the forum, seems the models on same GPU might have to wait for each other. Is there a better way to implement parallel training on single GPU? Thanks.