RuntimeError: Given groups=1, weight of size 64 3 3 3, expected input[1, 1, 480, 640] to have 3 channels, but got 1 channels instead - error after first epoch

Hello,

I am getting this error after first epoch. The model is getting trained for the very first epoch and then giving this error. I am doing segmentation with my images as .jpg files and masks as .png files. Please let me know the solution if any. Thanks in advance.

RuntimeError: Given groups=1, weight of size 64 3 3 3, expected input[1, 1, 480, 640] to have 3 channels, but got 1 channels instead

Most likely some of your (validation) images are grayscale images, thus they only use a single channel.
You could check the shape of each loaded image and repeat the channel dimension to create an RGB image (which would of course still be gray, but would contain the necessary 3 input channels).

@ptrblck, Hi, I tried with this model

class CustomDataset(Dataset):
def init(self, image_paths, target_paths, transform): # initial logic happens like transform

    self.image_paths = image_paths
    self.target_paths = target_paths
    self.transform = transforms.Compose([transforms.ToTensor()])

def __getitem__(self, index):

    image = Image.open(self.image_paths[index])
    mask = Image.open(self.target_paths[index])
    t_image = self.transform(image)
    t_mask = self.transform(mask)
    return t_image, t_mask

def __len__(self):  # return count of sample we have

    return len(self.image_paths) 

I have already tried with .convert(‘RGB’) with mask = Image.open(self.target_paths[index]). But, i could not solve this problem. could you let me know how can I solve this?

I don’t think the mask is the problem, but rather the input, since a conv layer is raising the issue.
Could you add the .convert call to t_image and retest it?

@ptrblck, Yeah. I am testing it right now. I have one more question, if I convert my input to RGB and use, can I go ahead with any batch_size. Right now, I am using batch_size = 1.
But, in some posts, I found the batch_size also creates problem.

If you make sure that all images are using 3 channels and have the same spatial size, then you should be able to increase the batch size. Let me know, if you encounter any issues with it.

@ptrblck, code is running for higher epochs after I used .convert(‘RGB’) with t_image. But, train and validation losses are giving out ‘nan’ values. Could you help in this?

Could you check, if all input samples are containing valid and finite values via torch.isfinite(input)?
If you can’t find an invalid input, please post your code so that we could have a look.

@ptrblck, I am posting my code for training here…

epochs = 10
steps = 0
running_loss = 0
print_every = 125
train_losses, test_losses = [], []
criterion = nn.MSELoss(reduction='sum')
optimizer = t.optim.Adam(model.parameters(), 5e-4, (0.9, 0.999), eps=1e-08, weight_decay=1e-4)
for epoch in range(epochs):
    for images_train, masks_train in train_loader:
        steps += 1
        images_train = images_train.to(device)
        masks_train = masks_train.type(t.LongTensor)
        #masks_train.unsqueeze_(0)
        #masks_train = masks_train.repeat(3, 1, 1, 1)
        masks_train = masks_train.reshape(masks_train.shape[0], masks_train.shape[2], masks_train.shape[3])
        masks_train = masks_train.to(device)
        optimizer.zero_grad()
        _,logps = model(images_train)
        loss = criterion(logps.float(), masks_train.float())
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        
        if steps % print_every == 0:
            test_loss = 0
            accuracy = 0
            model.eval()
            with t.no_grad():
                for inputs_val, masks_val in test_loader:
                    inputs_val = inputs_val.to(device)
                    masks_val = masks_val.type(t.LongTensor)
                    #masks_val.unsqueeze_(0)
                    #masks_val = masks_val.repeat(3, 1, 1, 1)
                    masks_val = masks_val.reshape(masks_val.shape[0], masks_val.shape[2], masks_val.shape[3])
                    masks_val = masks_val.to(device)
                    _,logps1 = model(inputs_val)
                    batch_loss = criterion(logps1.float(), masks_val.float())
                    test_loss += batch_loss.item()
                    
            train_losses.append(running_loss/len(train_loader))
            test_losses.append(test_loss/len(test_loader))                    
            print(f"Epoch {epoch+1}/{epochs}.. "
                  f"Train loss: {running_loss/print_every:.3f}.. "
                  f"Test loss: {test_loss/len(test_loader):.3f}.. ")
            running_loss = 0
            model.train()

I am getting nan values for training and validation losses. Please let me know if you find any problem with this.

Could you post the shapes of images_train, masks_train, and logps as well as their type and the content of mask_train, please?

@ptrblck, The shapes of images_train: torch.Size([16, 3, 480, 640]),
masks_train: torch.Size([16, 480, 640])
logps: torch.Size([16, 480, 640]) are here. When I check the values of masks_train, it is showing all zero values.

@ptrblck My model is giving output tensors (logps) of shape: torch.Size([16, 480, 640]). could you tell me where it is going wrong?

Usually you would use nn.CrossEntropyLoss for a multi-class segmentation, which would therefore dictate the output shape as [batch_size, nb_classes, height, width] and the labels as [batch_size, height, width] containing the class indices.

Since you are using nn.MSELoss, could you explain the output shape of your model and how you achieved this shape?

Is this expected for this sample, i.e. does this sample only contain class0 pixels?
If not, then your mask creation might have some bugs.

Thanks for your reply. My model is VGG16 after removing last three FC layers. After the fifth conv block of VGG16, I have upsampled the output and resized it to (1, 480, 640).
Since I am finding error between two images, I used MSELoss.

As you said, when I trained my network after putting .convert(‘RGB’) for input, network got trained and gave some finite loss though the testing did not go well. But, after I change my batch_size, it is giving the following error:


RuntimeError Traceback (most recent call last)
in
15 masks_train = masks_train.to(device)
16 optimizer.zero_grad()
—> 17 _,logps = model(images_train)
18 loss = criterion(logps.float(), masks_train.float())
19 loss.backward()

~/yes/lib/python3.7/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
–> 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)

in forward(self, x)
11 self.decoder = nn.Conv2d(512,1,1,padding=0,bias=False)
12 def forward(self,x):
—> 13 e_x = self.encoder(x)
14 d_x = self.decoder(e_x)
15 #e_x = nn.functional.interpolate(e_x,size=(480,640),mode=‘bilinear’,align_corners=False)

~/yes/lib/python3.7/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
–> 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)

~/yes/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
98 def forward(self, input):
99 for module in self:
–> 100 input = module(input)
101 return input
102

~/yes/lib/python3.7/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
–> 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)

~/yes/lib/python3.7/site-packages/torch/nn/modules/conv.py in forward(self, input)
343
344 def forward(self, input):
–> 345 return self.conv2d_forward(input, self.weight)
346
347 class Conv3d(_ConvNd):

~/yes/lib/python3.7/site-packages/torch/nn/modules/conv.py in conv2d_forward(self, input, weight)
340 _pair(0), self.dilation, self.groups)
341 return F.conv2d(input, weight, self.bias, self.stride,
–> 342 self.padding, self.dilation, self.groups)
343
344 def forward(self, input):

RuntimeError: CUDA out of memory. Tried to allocate 2.34 GiB (GPU 0; 10.73 GiB total capacity; 9.14 GiB already allocated; 299.25 MiB free; 9.33 GiB reserved in total by PyTorch).

Please let me know the solution if any.

You are running out of memory, so you could either decrease the batch size again or trade compute for memory using torch.utils.checkpoint.

Thanks. But, this network is getting trained only for batchsize = 1. That too, it is giving finite values for the first epoch and nan values after that. I am not able to sort out the problem. Could you find any issue with the training part of the code I have posted early?

Did you check the inputs for NaNs or Infs?
If so, are you expecting (some) masks to be all zero?

Sorry, I have not checked the inputs for NaNs or Infs. Can I do that during training? or during the dataloading?. But, as I can see the data, there are no NaNs in the data. Also, the inputs are of .jpg and the masks are of .png.

@ptrblck Hello, I have tried anomaly detection and got this error:


RuntimeError Traceback (most recent call last)
in
18 _,logps = model(images_train)
19 loss = criterion(logps.float(), masks_train.float())
—> 20 loss.backward()
21 optimizer.step()
22 running_loss += loss.item()

~/yes/lib/python3.7/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
193 products. Defaults to False.
194 “”"
–> 195 torch.autograd.backward(self, gradient, retain_graph, create_graph)
196
197 def register_hook(self, hook):

~/yes/lib/python3.7/site-packages/torch/autograd/init.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
97 Variable._execution_engine.run_backward(
98 tensors, grad_tensors, retain_graph, create_graph,
—> 99 allow_unreachable=True) # allow_unreachable flag
100
101

RuntimeError: Function ‘MseLossBackward’ returned nan values in its 0th output.

Please let me if there is any solution for it.

Check the input to the criterion for invalid values via torch.isnan and torch.isinf.
The model output or target could contain these values.
Once we know which tensor it is, we can debug further.