DataParallel, multi-GPU, dot product and identity matrix: not on device

I’m creating layers A and B and both of them stores a tensor that will be used during the forward operation.
A projects the input onto a new space by performing a dot product with a basis (mel_basis)
B uses the input to sample a identity matrix.

My solution to have this code running on multi-GPU was to instantiate these tensors on the CPU and them move it to the GPU respective GPU once they are called by DataParallel. The problem with this approach is that these tensors keep being moved from the CPU to the GPU.

Is there a way to circumvent this? If at initialization time I instantiate these tensors on the device with .cuda(), they will all end-up on device[0].

Code is below:

class A(torch.nn.Module):
    def __init__(self, n_fft, n_mel_channels=80, sampling_rate=16000):
        super(A, self).__init__()
        self.mel_basis = torch.from_numpy(
            mel(sampling_rate, n_fft, n_mel_channels)).float()

    def linear_to_mel(self, x):
        if torch.cuda.is_available():
            return torch.matmul(self.mel_basis.cuda(), x)
        else:
            return torch.matmul(self.mel_basis, x)

class B(torch.nn.Module):
    def __init__(self, n_quantization_channels):
        super(B, self).__init__()
        self.n_quantization_channels = n_quantization_channels
        self.identity_matrix = torch.eye(n_quantization_channels).float()

    def encode(self, x):
        if torch.cuda.is_available():
            return self.identity_matrix.cuda()[x.view(-1)]
        else: 
            return self.identity_matrix.[x.view(-1)]

You can initialize that tensor as a Parameter of each module. DataParallel should automatically send it to the right GPUs after that.

@richard does the parameter stay on the GPU throughout training or does DataParallel move it at every iteration?

Judging from the code it looks like whenever the forward pass is called, DataParallel replicates the model (and its associated parameters) on all of the GPUs.

Then that means that using DataParallel or the code I wrote will have no difference, i.e. the data will sit on the CPU and will be replicated on the GPUs every time the forward pass is called…

The data doesn’t sit on the CPU and get replicated on the GPUs. It sits on one GPU and gets broadcasted to the other GPUs every time the forward pass is called.

Got it! Then I’ll just initialize the tensor as Parameter and set requires_grad to False!