I’m training a network in which the first layer is a filter bank like projection and the second is a convolution. The example below is just to illustrate the problem.
When using this network inside of DataParallel with more than one GPU I run into “tensors are on different GPUs”.
I also tried to not explicitly move the tensors to the GPU with the hope that calling .cuda() on model would solve the issue. @SimonW Any thoughts on this?
Does anyone have a solution to this problem?
model = DataParallel(Network(frame_size, stride).cuda())
class Network(torch.nn.Module):
def __init__(self, frame_size, stride):
super(Network, self).__init__()
self.projection = Projection(frame_size, stride)
self.conf = torch.nn.Conv1d(frame_size/2, 1, 2, 1)
def forward(self, signal):
output = self.projection(signal)
output = self.conv(output)
return output
class Projection(torch.autograd.Function):
def __init__(self, frame_size, stride):
super(Projection, self).__init__()
x = np.array(range(0, frame_size))
f = np.array([np.sin(x*2*f*np.pi/frame_size) for f in range(0,int(frame_size/2))])
f = f.astype(float)
f = torch.from_numpy(f)
f = torch.unsqueeze(f, 1).float()
self.conv = torch.nn.Conv1d(1, int(frame_size/2),
kernel_size=frame_size,
stride = stride,
bias=False)
self.conv.weight.data = f
if torch.cuda.is_available():
self.conv.weight.data = self.conv.weight.data.cuda()
self.conv.weight.requires_grad = False
def forward(self, signals):
signals = Variable(signals)
conv_signal = self.conv(signals).data
return conv_signal
class Filterbank(torch.nn.Module):
def __init__(self, frame_size, stride):
super(Filterbank, self).__init__()
self.proj = Projection(frame_size, stride)
def forward(self, signals):
signals = torch.unsqueeze(signals, 1)
signals_proj = self.proj(signals)**2
return sin_signals
You shouldn’t attach parameters to autograd.Function. Instead, please pass them as an argument of forward. If you need them to be trainable, make them Parameters of a Module, otherwise make them Buffers of a Module.
@SimonW thanks for the reply!
I had the same error even when the parameter are defined like in the loss function below.
In this case where the projection is defined inside the loss, show the “weights” be defined as buffers?
What about the attribute self.mel_basis? What should it be defined as?
Sorry for the confusion. Let me answer it more clearly.
The problem is that a conv is initialized in a Function. Therefore, its weights are not registered as part of a nn.Module, so doing .cuda() on the module can’t change their locations, and DataParallel can’t assign them to correct GPU.
Since the projection is just a conv2d. I suggest not making an extra autograd.Function. Instead, do this:
Thanks a lot for your help, @SimonW !
With the register buffer for the loss function, i.e. mel_basis, what moves the buffer to the GPU device? Note that mel_basis is not part of a model but part of a loss function.