CNN for getting and representing superpixels

Hi,
Is there any CNN that is able to learn features representation for each superpixel of the image. In general, we use SLIC algorithm to obtain the superpixels and compute the HOG of that image than associate a hog vector to each superpixel. However, HOG are hand-crafted features. I would like to associate a learned feature representation for each superpixel (rather than hog). So my question, is there any CNN that can learns features representation for each superpixel and the superpixel themselves ?

You may say just use Faster rcnn to localize objects then you get the bounding boxes and features represetnation. However, it doesn’t apply for my case. l don’t not necessary have objects but patches.

Thank you

You may try to 0-pad your patches but they will be of different spatial dimensions, so you won’t be able to stack them into a mini-batch for training.

A workaround for this is feeding images one by one and summing gradients. e.g.:

for epoch in EPOCHS:
    for img, target in images:
        for _ in range(BATCH_SIZE):
            out = net(img)
            loss = criterion(img, out)
            loss.backward()
        optimizer.step()
        optimizer.zero_grad()

calling .backward() multiple times before .zero_grad() sums gradients, much as if it was a mini-batch.
Hope this helps

Hi @kembo,

Thank you for your answer.

I’m not learning on my images. I just use ResNet-152 as a feature extractor.

So the idea, is to take each superpixel (a set of pixels) of a given image as a whole image and 0-pad it to get image into -224,224,3).
So for an image M with K superpixels, l build K images of dimension (224,224,3), their spatial dimensions are different but they are 0-padded. These K images are fed to ResNet-152 to extract features. So for image M l get K x 2048 feature vector.

However, l’m wondering if there is any negative impact (poor feature representation) on Resnet152 due to the fact of padding an image such that one superpixel is about 8% information on the image and the remaining 92% are 0 .

Thank you for your answer

1 Like

I believe there would be significant drop in performance of ResNet. Convolutional kernels will be passing mostly over zeros, thus I would expect feature vector to be mostly zeroed too, but it’s got to be checked.

As there are a lot of small-sized superpixels I would probably approach this problem with 0-padding each superpixel to a smallest rectangle and training a small CNN for feature extraction. Siamese architectures might be reasonable for training in this case, but it depends on the data you have at hand and what you want your descriptors to do.

Thank you @kembo for your suggestion. If l train a small CNN such as Siamese network, what would be the labels of superpixels ?