Using MKL DNN with Distributed Data parallel (DDP)

Pulasthi · April 3, 2021, 9:09pm

Hi Everyone.

I have an autoencoder which I am training using DDP. I wanted to try and improve the performance by using MKLDNN, i tried to convert the model to MKL DNN using the following lines but at runtime, i get an Assertion error. Is MKL DNN not supported for DDP? or am i doing something wrong? any help would be highly appreciated.

    autoencoder = AutoEncoder(layers=layers)
    autoencoderMKL = mkldnn_utils.to_mkldnn(autoencoder)
    ddp_model = DDP(autoencoderMKL)

Error ###
File “/N/u2/p/pulasthiiu/git/deepLearning_MDS/nnprojects/Mnist/AutoEncodertDDPDataGenMKL.py”, line 156, in
main()
File “/N/u2/p/pulasthiiu/git/deepLearning_MDS/nnprojects/Mnist/AutoEncodertDDPDataGenMKL.py”, line 105, in main
ddp_model = DDP(autoencoderMKL)
File “/N/u2/p/pulasthiiu/python3.8/lib/python3.8/site-packages/torch/nn/parallel/distributed.py”, line 344, in init
assert any((p.requires_grad for p in module.parameters())), (
AssertionError: DistributedDataParallel is not needed when a module doesn’t have any parameter that requires a gradient.
Traceback (most recent call last):
File “/N/u2/p/pulasthiiu/python3.8/lib/python3.8/runpy.py”, line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/N/u2/p/pulasthiiu/python3.8/lib/python3.8/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/N/u2/p/pulasthiiu/python3.8/lib/python3.8/site-packages/torch/distributed/launch.py”, line 260, in
main()
File “/N/u2/p/pulasthiiu/python3.8/lib/python3.8/site-packages/torch/distributed/launch.py”, line 255, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command ‘[’/N/u2/p/pulasthiiu/python3.8/bin/python3’, ‘-u’, ‘/N/u2/p/pulasthiiu/git/deepLearning_MDS/nnprojects/Mnist/AutoEncodertDDPDataGenMKL.py’, ‘-w’, ‘40’, ‘-ep’, ‘10’, ‘-bs’, ‘8000’, ‘-rc’, ‘1024’, ‘-ds’, ‘640000’, ‘-l’, ‘768x576x432x324’]’ returned non-zero exit status 1.

Best Regards,
Pulasthi

H-Huang · April 5, 2021, 1:23pm

I don’t believe that to_mkldnn() modifies the underlying model, just the memory format of the tensors, please let me know if I am wrong. We need more info on the AutoEncoder model and what that looks like. Could you also include the code to the model? Does that have any parameters?

As a reference here is the line that is erroring out: pytorch/distributed.py at master · pytorch/pytorch · GitHub

Pulasthi · April 11, 2021, 8:40pm

Hi Huang,

Sorry about the late reply. It is a simple autoencoder, just have some logic to add layers when I specify the number of layers in the autoencoder (the code is below). Am I using the to_mkldnn function incorrectly?

Link to complete code:
Without MKL https://github.com/pulasthi/deepLearning_MDS/blob/master/nnprojects/Mnist/AutoEncodertDDPDataGen.py

With MKL: https://github.com/pulasthi/deepLearning_MDS/blob/master/nnprojects/Mnist/AutoEncodertDDPDataGenMKL.py

class AutoEncoder(nn.Module):
    def __init__(self, **kwargs):
        super().__init__()
        inner_layers = kwargs["layers"]
        encoder_layers = []
        decoder_layers = []
        num_layers = len(inner_layers) - 1
        print(f"numlayers {num_layers}")
        for x in range(num_layers):
            encoder_layers.append(nn.Linear(in_features=inner_layers[x], out_features=inner_layers[x + 1]))
            decoder_layers.append(
                nn.Linear(in_features=inner_layers[num_layers - x], out_features=inner_layers[num_layers - x - 1]))
            decoder_layers.append(nn.ReLU(True))

            if (x == num_layers - 1):
                encoder_layers.append(nn.ReLU(True))
            else:
                encoder_layers.append(nn.ReLU(True))

        self.encoder = nn.Sequential(*encoder_layers)

        self.decoder = nn.Sequential(*decoder_layers)

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x```

H-Huang · April 14, 2021, 3:33pm

Thanks for the model. I just verified that it is failing and mkldnn does change the model layers. I don’t have a lot of context on mkl dnn, but I created an issue on github to track this and loop in the right people, Support for mkldnn + ddp · Issue #56024 · pytorch/pytorch · GitHub.

Yanli_Zhao · April 14, 2021, 6:00pm

@Pulasthi does it work if you try to convert model to mkl_dnn model and run local training without DDP?

Pulasthi · April 15, 2021, 4:06am

@Yanli_Zhao let me try that out and get back to you