DataParallel dosen't work

nn.DataParallel could create a memory imbalance as described in this blog post with some workarounds.
E.g. if your single GPU run worked with a batch size of 16, nn.DataParallel could yield an OOM for a batch size of 32.

That being said, we generally recommend to use DistributedDataParallel with a single process per device, as this would avoid the memory imbalance and is the fastest approach.