Style Transfer - how to choose the cnn layers?

I read this tutorial:
Neural Transfer Using PyTorch — PyTorch Tutorials 1.12.0+cu102 documentation

In that tutorial they choosed:

  • conv layer #4 for content_layers
  • conv layers: 1, 2, 3, 4, 5 for style_layers
  1. Why they choosed the 4 conv layer for content_layers and style_layers ?
  2. I read that content_layers are the deep conv layers,
    so how can conv layer #5 choosed for style_layers ? (which is deeper than conv layer #4 which choosed for content_layers) ?
  3. I saw another medium post about Style Transfer:
    Style Transfer using Pytorch. I have recreated the style transfer… | by Alex Diaz | Analytics Vidhya | Medium
    In that post the author choosed other conv layers for content_layers and style_layers
    Are there any rules or ideas behind the conv layers we choose ?

Hi Amit -

I doubt you will find too much theory on this since it will often come down to the specific architecture and images the model was trained on.

However, I recommend you look into feature visualization (there are several approaches here which you’ll find after a quick search) and simply spend a little time visualizing each of your layers. After a little while it should become apparent which is more “style” and which is more “content”.

Good luck!

I took your advice and looked on the feature visualization:

The source image was:

According to the article in: Style Transfer using Pytorch. I have recreated the style transfer… | by Alex Diaz | Analytics Vidhya | Medium
the content layer is layer #21 (conv4_2) and it’s first 4 fatures:

The other conv layers are style layers.
The first 4 features of the first 3 layers:

  • And it seems that there is no much difference between those layers (style and content).
  • To me is seems that we can extract the outline of the objects of an image from the style and content layer (and not just from the content layer… as you can see above)

So what am I missing or misunderstanding?

Cool visualizations! I think what you’re seeing here is which area of the image is activating the different layers the most, which doesn’t exactly help you in understanding what exactly the layer is looking for / what the feature is matching on. This article is a helpful addition to yours with more info..

Broadly there are at least a couple of approaches you can take to visualizing features the way you are likely looking to do:
(1) go through an actual image dataset and find examples of chunks of image that achieve peak activation of a particular neuron, and then combine those in some way
(2) take a GAN approach of transforming a randomly initiated image in the direction that maximally activates a particular neuron. this can suffer from some high-frequency problems and look funky, so you may have to regularize for that (penalizing high frequency patterns)

I don’t think that there’s a single way to do it that will always work, which is perhaps why there isn’t a good package that does this for you out of the box. Interpretability is extremely exciting, and if you build anything cool and want to share it, the community will appreciate it!