Train model with any images size and keep resolution of layers output using dilation

I try to make my model to work with any size of images so I remove flatting layer and use ConvTranspose2d and softmax.
I use dilation to keep the resolution of image so I don’t use adaptivepooling layer becuase it fixed the output to one.
when I run the code and exact in line:
loss = criterion(output, target) in train function
I get error:
“AttributeError: ‘tuple’ object has no attribute ‘log_softmax’”
This is parts of my code

class O_FCasConv(nn.Module):
    def __init__(self, classes=4):
        super(O_FCasConv, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=1, padding=2,dilation=8),
        self.layer2 = nn.Sequential(
            nn.Conv2d(64, 32, kernel_size=3, stride=1, padding=2,dilation=8),
            nn.MaxPool2d(kernel_size=2, stride=1))
        self.classifier = nn.ConvTranspose2d(32, 8, 3, stride=2, padding=2, output_padding=1, groups=classes, bias=False)
        self.softmax = nn.Softmax2d()
    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        y = self.classifier(out)
        return self.softmax(y), x

also I use

      criterion = nn.CrossEntropyLoss()
      optimizer=torch.optim.Adadelta(model.parameters(), lr=0.001, rho=0.9, eps=1e-06, weight_decay=0)

and I use for training

def train(train_loader):
        total_loss = 0
        total_size = 0
        for batch_idx, (data, target) in enumerate(train_loader):            
            data, target =,
            output = model(data)
            loss = criterion(output, target)
            total_loss += loss.item()
            total_size += data.size(0)

There are a few issues in your code:

  • If you use nn.CrossEntropyLoss, you don’t need to call nn.Softmax on your model output, as the loss function will internally call nn.LogSoftmax and nn.NLLLoss. Just remove the self.softmax layer from your model.
  • Your model currently outputs two tensors, i.e. self.softmax(y) and x. I’m not sure, why you need x in the output, but you should pass output[0] to `criterion (after removing the softmax of course).
  • Variables are deprecated since PyTorch 0.4.0. You can just use tensors in the latest stable release.

I modefy the code but I get error:
ValueError: Expected input batch_size (8) to match target batch_size (32)."

and I change the output size of ConvTranspose2d to 32 but get error
“ValueError: Expected target size (32, 76), got torch.Size([32])”

The dim of target and data and out are:
target shape torch.Size([32])
data shape torch.Size([32, 3, 64, 64])
out shape torch.Size([32, 32, 39, 39])

Assuming you are trying to classify your data into 4 classes, your model output should have the shape [batch_size, 4].
I’m not sure, how nn.ConvTranspose2d should deal with your activation volumes, as it’ll increase the spatial size in your setup.
Could you explain your architecture a bit?

I’m familiar with an adaptive pooling layer before the linear or with global pooling for fully convolutional models. However, I’m currently not sure how your models should works.

I have data with 4 classes and size of each image is 64.
But I try to use conv layer instead of fully connected layer.
I used adaptive pooling layer and it is solve the problem of train network with image of any size
How we make ConvTranspose2d does the job of


self.fc2 = nn.Linear(32, num_classes)

In paper dilated resudal netwrok

If you see Figure 2 in the paper. That what I want to do.

Thanks for the paper!

Skimming through it, I think their method uses dilated convolutions to get a spatially bigger activation map in Group5 of a ResNet. However, for classification they still seem to use global average pooling.

From section 2:

The output of G5 in a DRN is 28 × 28. Global average pooling therefore takes in 2**4 times more values, which can help the classifier recognize objects that cover a smaller number of pixels in the input image and take such objects into account in its prediction.

Figure 2 is related to a localization use case, i.e. if you have some kind of localization/segmentation target.

Also, I couldn’t find any transposed convolutions in the paper, just dilated convolutions.
Is this a misunderstanding or how would you like to use transposed convolutions for this method?

Thank you very much for your cooperation.