Data parallel tutorial

I’m guessing there may be some general suggestions and reasons. I’m getting a huge PCIe bandwidth even when I train on a single GPU - I think it’s 2x-4x that of Keras+TF. I have no clue why so much data is being sent when basically what seems reasonable to send is simply the input data and algorithmic instructions. I don’t think it’s got much to do with anything I do specifically, but more like what PyTorch is like in general DataParallel performance?

It seems to me like some totally unnecessary stuff is being exchanged between CPU and GPUs and that’s just the way PyTorch currently is.