Alternative to implement linear layer with a variable input paramters

Hi guys,
I want to implement some linear layers in each output layer after each convulitonal layer in yolov5. The problem I’m facing is that the input image passed to my linear layer changes each image, due to the fact that yolo localization grid passes each image with a new width and height. Also, I want to train everything with my GPU, which means I need to intialize my linearity layers in the init function of my class:
https://forums.pytorchlightning.ai/t/runtimeerror-but-found-at-least-two-devices-cpu-and-cuda-0/1634

class STN_8_8(nn.Module):
          def __init__(self, c1):
              super(STN_8_8, self).__init__()
              self.Linearity = nn.Sequential(
                  nn.Linear(c1*16*16, 256),
                  nn.ReLU(), 
                  nn.Linear(256, 128),
                  nn.ReLU(),
                  nn.Linear(128, 64),
                  nn.ReLU(),
                  nn.Linear(64, 32),
                  nn.ReLU(), 
                  nn.Linear(32, 6)
       
                             )
        
        # Spatial transformer network forward function
        def forward(self, x):
            print(f'Shape just focus {x.shape}')
            
            xs = x.view(-1, x.shape[1] * x.shape[2] * x.shape[3])
            theta = self.Linearity(xs)
            x = theta.view(-1, 2, 3)
    
            return x

The problem that i’m facing is that the network that was implemented before my network forward to my network an input with standard batch size and number of channels but only the width and the height are gettint changed.
The first image works fine and I have no problems. The burner size of the first image is [100, 512, 16, 16].
As soon as my model starts with the next image torch.size[100, 512, 8, 8], I get this error:
"Traceback (most recent call last):\n", " File \"C:\\Users\\zacco\\Desktop\\Master\\Project_2022\\Project_gamma\\Ultralytics\\yolov5\\train.py\", line 677, in <module>\n", " main(opt)\n", " File \"C:\\Users\\zacco\\Desktop\\Master\\Project_2022\\Project_gamma\\Ultralytics\\yolov5\\train.py\", line 571, in main\n", " train(opt.hyp, opt, device, callbacks)\n", " File \"C:\\Users\\zacco\\Desktop\\Master\\Project_2022\\Project_gamma\\Ultralytics\\yolov5\\train.py\", line 352, in train\n", " pred = model(imgs) # forward\n", " File \"c:\\Users\\zacco\\anaconda3\\envs\\pytorch\\lib\\site-packages\\torch\\nn\\modules\\module.py\", line 1110, in _call_impl\n", " return forward_call(*input, **kwargs)\n", " File \"C:\\Users\\zacco\\Desktop\\Master\\Project_2022\\Project_gamma\\Ultralytics\\yolov5\\models\\yolo.py\", line 155, in forward\n", " return self._forward_once(x, profile, visualize) # single-scale inference, train\n", " File \"C:\\Users\\zacco\\Desktop\\Master\\Project_2022\\Project_gamma\\Ultralytics\\yolov5\\models\\yolo.py\", line 179, in _forward_once\n", " x = m(x) # run\n", " File \"c:\\Users\\zacco\\anaconda3\\envs\\pytorch\\lib\\site-packages\\torch\\nn\\modules\\module.py\", line 1110, in _call_impl\n", " return forward_call(*input, **kwargs)\n", " File \"C:\\Users\\zacco\\Desktop\\Master\\Project_2022\\Project_gamma\\Ultralytics\\yolov5\\models\\common.py\", line 60, in forward\n", " theta = self.Linearity(xs)\n", " File \"c:\\Users\\zacco\\anaconda3\\envs\\pytorch\\lib\\site-packages\\torch\\nn\\modules\\module.py\", line 1110, in _call_impl\n", " return forward_call(*input, **kwargs)\n", " File \"c:\\Users\\zacco\\anaconda3\\envs\\pytorch\\lib\\site-packages\\torch\\nn\\modules\\container.py\", line 141, in forward\n", " input = module(input)\n", " File \"c:\\Users\\zacco\\anaconda3\\envs\\pytorch\\lib\\site-packages\\torch\\nn\\modules\\module.py\", line 1110, in _call_impl\n", " return forward_call(*input, **kwargs)\n", " File \"c:\\Users\\zacco\\anaconda3\\envs\\pytorch\\lib\\site-packages\\torch\\nn\\modules\\linear.py\", line 103, in forward\n", " return F.linear(input, self.weight, self.bias)\n", "RuntimeError: mat1 and mat2 shapes cannot be multiplied (60x86528 and 32768x256)\n"

I was able to understand the error, but I was not able to find an idea on how to fulfill my goals with adding a linear layers and using the GPU simulstaniously. This means that I have to implement a variable input for my first linear layer.

i’ll be really greatfull if someone gives me an idea how to deal with such a problem

What if you just added a Resize transform at the beginning of the forward method of your class STN_8_8?

def forward(self, x):
            print(f'Shape just focus {x.shape}')
            
            x = torchvision.transforms.Resize(16)(x)  # <-- newly added
            xs = x.view(-1, x.shape[1] * x.shape[2] * x.shape[3])
            theta = self.Linearity(xs)
            x = theta.view(-1, 2, 3)
    
            return x
1 Like

thank you for your answer. I’m afraid if i’m going to resize the image, the model won’t be trained correctly. this is my expectation but i’m not sure?

Why is that your expectation? You may be right, but I’d love to understand more about your reasoning, so that we can perhaps suggest a better solution.

Your first Linear layer has an input shape of c1 * 16 * 16, which is the exact same amount of information available in an image with c1 channels and dimension 16x16. Resizing just down- (or up-samples) the information from your input, in a spatially coherent way, to this exact size. There’s just no way to jam more than c1 * 16 * 16 of information into the first Linear layer, given its shape.

A far more complicated (and, I’m guessing unnecessary) solution would be to:

  • have multiple Linearity models, each with their own initial linear layer e.g. Linearity_8_8 has Linear(c1 * 8 * 8, 256) as the first layer while Linearity_16_16 has Linear(c1 * 16 * 16, 256) as the first layer
  • have all of these models share all subsequent layers / parameters
  • have an if / then inside your forward() pass that uses the appropriate Linearity model
  • and then you have to make sure you batch the data properly, since all data inside of one batch must use the same Linearity model per this structure. as a simple but computationally poor fix, you can just run with batch size 1

i’m really greatful for your taking my problem seriously.
Ok so my goal is to get the output of yolov5 which is going to be images with different sizes and forward it to a new network this network will apply some kind of distortions on the image and give it back. my problem now is that the output of yolo is not fix which means every image has one of this output.
so as soon as i give this line to train my model:
!python train.py --img 416 --batch 20 --epochs 5 --data data.yaml --cfg models/yolov5s_stn.yaml --weights ‘’

i get this output(this output it the print which is included in forward function in my network):

how my forward function looks like that the training works:

class STN(nn.Module):

def __init__(self, c1):

    super(STN, self).__init__()

    self.Linearity = nn.Sequential(

        nn.Linear(c1*26*26, 256),

        nn.ReLU(),

        nn.Linear(256, 128),

        nn.ReLU(),

        nn.Linear(128, 64),

        nn.ReLU(),

        nn.Linear(64, 32),

        nn.ReLU(),

        nn.Linear(32, 6)



                   )



# Spatial transformer network forward function

def forward(self, x):

    print(f'Shape just focus {x.shape}')

   

    xs = x.view(-1, x.shape[1] * x.shape[2] * x.shape[3])

    if (x.shape[0]==1) | (x.shape[0]==40) |  (x.shape[2]!=26):

        return x

    theta = self.Linearity(xs)

    theta = theta.view(-1, 2, 3)

    grid = F.affine_grid(theta, x.size())

    x = F.grid_sample(x, grid, align_corners=True)

    return x

i just want an idea to apply the distortions to all the images without resizing them
I like your idea that for each image shape I try to initialize it correspondant linear layer, so for example if I have an 8 x 8 image I set it as c1 * 8 * 8

Hi Zac - understood, you’re trying to do a spatial transformer. Just so I understand, why do you think it would be problematic to have the following flow:
(1) you make a copy of the original image
(2) then you resize the image and run it forward through your network
(3) you compute the affine grid on the resized image
(4) you resize the affine grid itself back up to the size of the original image
(5) you run grid_sample on the resized grid and the original image

Seems to me like this could work, but if you don’t think so I’d love to hear why.

Having a bunch of different linear layers as I suggested previously could work too but it feels like there’s lot of redundant training happening (since the different layers aren’t learning from one another) so the data isn’t being utilized in the training to its full potential.

1 Like

Thank you. Yes, that’s right, that’s what I’m trying to do to integrate a spatial transformation. The only thing I didn’t understand is point 5. The grid or the output of F.affine_grid will have a tensor like this: (N, H , W ,2) which means that torchvision.transforms.resize will not be able to resize the grid since this method will only resize the last and the second to last row. I hope that what I am saying is correct. If not, could you please give me more information or a method to resize the grid.

Actually, I don’t think you need to resize the affine grid at the end, in other words you can skip (4). Seems to me like affine_grid runs fine on images of various sizes:

theta = torch.randn(1, 2, 3)
affine_grid = F.affine_grid(theta, (1, 3, 64, 64))

tensors = []
for size in (50, 100, 224):
    image = Image.open("three.png").resize((size, size))
    tensor_original = transforms.ToTensor()(image)
    tensors.append(F.grid_sample(tensor_original.unsqueeze(0), affine_grid, align_corners=True)[0])

transforms.ToPILImage()(torchvision.utils.make_grid(tensors))

image

I haven’t implemented this myself before, but it looks like this will behave normally.

1 Like

Hi Andrei, thanks again for the quick reply. Actually, I want to resize the affine_grid because I am convinced of it. I can also imagine why we need to resize it in the end. However, it seems to me that to unrezise the grid is an unreasonable method because I can’t understand how the grid will fits correctly on the non-resized image/original image. i will try to implement both methods and keep you updated.
Actually my forward function looks like it works without resizing the grid, but i can’t understand how that could work

Spatial transformer network forward function

def forward(self, x):

    img = x

    transform = T.Resize((28,28))

    transform_grid = T.Resize((x.shape[2],x.shape[3]))

    img = transform(img)

    print(f'first focus {img.shape}')

   

    xs = img.view(-1, img.shape[1] * img.shape[2] * img.shape[3])

    theta = self.Linearity(xs)

    theta = theta.view(-1, 2, 3)

    grid = F.affine_grid(theta, x.size())


    x = F.grid_sample(x, grid, align_corners=True)

    return x

A naive question: what do you think about introducing a global average pooling before the linear layers, so that the input size to linear layers is always C dimensional vector regardless of spatial resolution?

1 Like

I‘m not hundert percent from my answer but avgpooling needs in this case variable parameters to fit the images width and height every time the image is going to be forwarded.

Imagine :
image one has this tensor[20, 128, 16, 16]
Image two [20, 128, 26, 26]
Image three [20, 128, 8, 8]

Which parameters you can plan to set for the averagepoolikg method that you want to use ?

There is adaptive average pooling available where you could specify output size as 1.

https://pytorch.org/docs/stable/generated/torch.nn.AdaptiveAvgPool2d.html

I believe, the RCNN variants such as Fast/er RCNNs (RoIPool, RoIAlign etc) do some kind of fixed resolution adaptive pooling to overcome this problem of different spatial resolutions.

Your idea seems to be really great. One extra question: could an adaptativeAvgPool reshape the images to a size which is bigger than the actual size. for example we have an image WxH == 7x7 can we would like to have 28x28. Is this possible?

I’m not sure about the usage of adaptive pooling to increase the dimension. However, you could take a look at F.interpolate. It might achieve what you want to do.

https://pytorch.org/docs/stable/generated/torch.nn.functional.interpolate.html

so this is the result not applying point 4 and 5, which means i didn’t resize the grid. i only resized the localized imaged then ipassed to the affine grid then to the grind sample.

i applied the same idea on the STN example provided by torch and i seems like working so thank you very much.

sorry but i wasn’t able to integrate the interpolation function. Could you please provide me with a simple example

The following code shows how F.interpolate() works.
Pl. check if this is suitable for your use case.

import torch
import torch.nn.functional as F

x = torch.arange(16).view(1,1,4,4).float()
y = F.interpolate(x, size=(8, 8), mode='bilinear', align_corners=False)