Dynamic Structure of CNN

Wafaa_Wardah · May 22, 2019, 3:04am

I want to have a CNN where I can have flexible kernel_size and in and out_feature dimensions. How do I do this? This may have an easy answer. but I had trouble… For example, I have datasets with images of dimensions 5x38, 9x38, … , 35x38 etc. I’m not sure how to accommodate for the different heights. Note, I run a dataset at a time so the model doesn’t get mixed dimensions.

heights = [5, 9, 15, 25, 35]
kernel_sizes = [3, 5, 7]

for height in heights:
     for kernel_size in kernel_sizes:
          model = my_dynamic_model(height, kernel_size) 
          trainset = Dataset.my_dynamic_dataset(height)

I want to have something like the above. That looks like grid search, but I’m doing Bayesian Optimization and need a similar way of setting the height and kernel_size.

Please help with suggestions.

Thank you so much.

LeviViana · May 22, 2019, 7:48am

The easiest way for you to get something running is just by resizing all the images. Note that you don’t necessarily need to warp them, you can simply pad the smaller images with zeros.

As for the kernel sizes, it is much easier and probably more efficient to use only small kernel sizes (say 3x3) and a deeper CNN. In fact, you can simulate a receptive field of any arbitrary size with only 3x3 kernels by going deeper in your architecture. For instance, here is an illustration on how to get a 5x5 receptive field by stacking two 3x3 convolutions:

example

You save parameters by doing this: 2 x 3 x 3 = 18 for two stacked 3x3 convs versus 1 x 5 x 5 = 25 for one 5x5 conv.

Maybe I didn’t understand why you need this dynamic architecture. But if the only reason is the image sizes, I’d recommend you to try out this before, it should work just fine.

alex.veuthey · May 22, 2019, 7:57am

I don’t think there’s actually a way of doing it properly . A traditional CNN has fixed kernel sizes, so that you can train every weight at the same time. This ensures that the model is consistent. Assuming that you have a maximum number of kernels and size of kernels, then if you train only part of the kernels and parts of the kernels, it probably breaks the global behaviour of the model.

If you have variable input sizes, I would consider using RNNs instead of CNNs (or combinations of the two). If you have a maximum input size, I think padding with 0s and/or resizing images means that the activations won’t influence the result.

What kind of data are you working with?

Wafaa_Wardah · May 22, 2019, 8:06am

Thank you for your response. I’m wanting to get a few models out of this (simplified for loop code above) and then compare their results. As for my data, it is protein data which is basically sequence of amino acid data. I’m trying to use a sliding window to create image-like inputs. So the height of an image is the number of consecutive amino acid data.
Example: 5x38 image is 5 consecutive amino acid records of 38 features. Hope that makes sense.

alex.veuthey · May 22, 2019, 8:15am

Oh all right! In that case creating and training different models for each input size is the expected approach, and you just want to create models more automatically than defining each layer’s size by hand… I understand now

Your approach should work, it all depends on how your model is defined. If you don’t use padding as a way to keep the input sizes of layers, like ResNet does (the size of the features in the end of the network are much smaller than the input images), I think you will probably need to figure out a formula for handling different heights depending on your model’s layers.

The dataset could handle inputs dynamically also, depends on the code…

Wafaa_Wardah · May 22, 2019, 8:22am

I have implemented the dataset class and that works fine. I’ve also got the full 10-fold cross-validation with automatically using different hyperparameters (learning rate, batch_size, num_epochs) experiment set up and running fine. I now want to add window_size (image height) and kernel_size (I might eliminate this and stick to 3x3). But I don’t know how to calculate the numbers when I’m defining the CNN class conv and fc layers if the height and perhaps kernel_size are dynamic.

alex.veuthey · May 22, 2019, 8:33am

For the kernels, just change the kernel_size argument of nn.Conv2d.

Now this and the changing image sizes will change the output sizes by quite a lot. I would suggest experimenting with the different input sizes and see what breaks in your network, then change sizes accordingly, by hand.

Then when you have it figured out, change them automatically with the image size you give to the network.

If you have a ResNet-like network, the number of inputs of the FC will be different, and that should break when trying different sizes (kernel and image). It might be best for performance to tune the convolution layers to output a constant FC input size. Try different number of layers, number of filters, size of kernels, you can also experiment with non-square kernels depending on the image size… There are infinitely many possibilities here!

Wafaa_Wardah · May 22, 2019, 10:43am

Thank you for your help. I agree kernel size of 3 is where I probably want to be. Also, I think I should be more interested in number of kernels (I mean channels_out - is that correct understanding?) rather than kernel size.

Also, this question is what I was trying to ask it seems.

Thank you again.

alex.veuthey · May 22, 2019, 11:07am

Indeed the number of kernels is the out_channels parameter of the nn.Conv2d layers. Increasing this value will result in more filters/kernels and therefore larger models with larger capacity, which might be what works for your data.

LeviViana · May 22, 2019, 11:32am

You’ll need to make your CNN go deeper as well in order to be capable of learning patterns bigger than 3x3 patches. Don’t hesitate to ask if it isn’t clear.

Wafaa_Wardah · May 22, 2019, 1:25pm

Thank you so much.

I managed to make it work Here is what I have:

class dynamic_dataset(Dataset):
    def __init__(self, WS):
        self.len = 500
        self.WS = WS
        
    def __getitem__(self, index):
        image = torch.randn(self.WS,38)
        label = torch.randint(1, (1,))
        return image, label
        
    def __len__(self):
        return self.len

class dynamic_model(nn.Module):
    def __init__(self, H_in, W_in, num_kernels):
        super(dynamic_model, self).__init__()
        
        C_in_1, C_out_1     = 1, num_kernels
        kernel_size_1       = 3
        H_out_1, W_out_1    = self.conv_output_shape((H_in, W_in), kernel_size=kernel_size_1) # W_in = 38
        
        C_in_2, C_out_2     = C_out_1, num_kernels
        kernel_size_2       = 1
        H_out_2, W_out_2    = self.conv_output_shape((H_out_1, W_out_1), kernel_size=kernel_size_2)
        
        self.layer1 = nn.Sequential(
            nn.Conv2d(C_in_1, C_out_1, kernel_size=kernel_size_1, stride=1, padding=0),
            nn.ReLU())
        
        self.layer2 = nn.Sequential(
            nn.Conv2d(C_in_2, C_out_2, kernel_size=kernel_size_2, stride=1, padding=0),
            nn.ReLU())        
        
        self.fc1 = nn.Linear(C_out_2 * H_out_2 * W_out_2, C_out_2 * H_out_2 * W_out_2)
        self.fc2 = nn.Linear(C_out_2 * H_out_2 * W_out_2, 2)
        
    def forward(self, x):
        x = x.unsqueeze(1)
        x = self.layer1(x)
        x = self.layer2(x)
        x = x.view(x.size(0), -1)
        x = self.fc1(x)
        x = self.fc2(x)
        return x

    def conv_output_shape(self, h_w, kernel_size=1, stride=1, pad=0, dilation=1):
        from math import floor
        if type(kernel_size) is not tuple:
            kernel_size = (kernel_size, kernel_size)
        h = floor( ((h_w[0] + (2 * pad) - ( dilation * (kernel_size[0] - 1) ) - 1 )/ stride) + 1)
        w = floor( ((h_w[1] + (2 * pad) - ( dilation * (kernel_size[1] - 1) ) - 1 )/ stride) + 1)
        return h, w

# simplified experiment setup
if __name__ == '__main__': 
         
        lr          = 0.001
        WS          = 25
        num_epochs  = 3
        batch_size  = 128
        num_kernels = 32
    
        model = dynamic_model(WS, 38, num_kernels)
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.Adam(model.parameters(), lr=lr)
        trainset = dynamic_dataset(WS=WS)
        trainloader = DataLoader(trainset, batch_size=batch_size)
        
        for i, (images, labels) in enumerate(trainloader):
            labels = labels.squeeze(1)
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            print(i, loss.item())

Sorry for the long code snippet. Could you please point out anything in my model structure that I could improve on? Layer 1 and layer 2 kernel_size?

Really appreciate all your help.

Thank you

LeviViana · May 22, 2019, 7:54pm

You don’t need the first self.fc1 statement. Otherwise, I think it is fine.

Wafaa_Wardah · May 22, 2019, 9:24pm

Oh thank you. I’m sorry I had an error in my code earlier. I have just fixed that. I meant to have fc1 and fc2. Do you think I should just have one fc layer or 2 is fine? Thank you in advance.

alex.veuthey · May 23, 2019, 6:29am

The most recent successful architectures use only 1 or even no FC in general. The reason is that they are very costly (see this answer for a detailed explanation).

But of course it always depends on the application! So try different things (1, 2 FCs, maybe even without FC entirely, but then you need to figure out a way to output what you want from only convolutions) and see what works

LeviViana · May 23, 2019, 7:56am

Moreover, probably you want to learn some translation invariant features, specially if you are dealing with classification. In this case, it would be useful to try to stack max-pooling layers after the convolutions. It will reduce the nb of parameters as well.

alex.veuthey · May 23, 2019, 7:59am

Convolution is translation invariant, unless you have the same input and kernel size (and the input is cropped from the original content, meaning that you can actually translate the input which will change the end result, otherwise no translation is possible).

LeviViana · May 23, 2019, 8:01am

Nope, convolutions are translation equivariant.

alex.veuthey · May 23, 2019, 8:32am

Right, my bad, wrong terms. I’ve seen invariant used instead of equivariant too many times.

What I meant to say is that the processed image will keep the same translation as the input, so the pooling indeed helps to keep the end results similar to the non-translated input, you’re right. Especially with small input sizes.

Wafaa_Wardah · May 23, 2019, 12:16pm

Thank you so much… I will add in some MaxPool2d and see how it goes. Really appreciate all your input