Transfer learning using VGG-16 (or 19) for regression

Hi,

I’m trying to solve a problem where I have a dataset of images of dimensions (224, 224, 2) and want to map them to a vector of 512 continuous values between 0 and 2 * pi. I’ve already created a dataset of 10,000 images and their corresponding vectors. And I’m soon to start experimenting with VGG-16. However, I have some concerns:

  1. Images are sparse by nature, as they represent the presence (or not) of a particle in space. Each particle is annotated by an area of 5x5 pixels in the image. On channel 1, wherever there is a particle, the area of pixels is white, otherwise is black. On channel 2, wherever there is a particle the area of pixels goes from white to black, depending on how close or far the particles are from the observer (position in 3d). Everything else is black as before. My concern here is how a CNN like VGG-16 is going to behave on the sparsity of data. Of course I will not know if I won’t start experiment, but it would be great if you could provide me with any intuition on that, i.e. if it’s totally pointless to approach this problem like that or whatever.

  2. For starting, I will be using torch.nn.MSELoss to minimize the error between predicted and actual 512 values for each image. Does it make sense? Do you have something else to suggest?

Thanks

Also, I already know that my 512 outputs are phases meaning the true targets are continuous values between 0 and 2 * pi. Is there any way to add something like an activation function that does the mod 2 * pi calculation so my prediction is always within that range, and is also differentiable? I know tanh is also an option, but that will tend to push most of values at the boundaries.

1 Like

Check out

torch.fmod()

https://pytorch.org/docs/master/torch.html#torch.fmod

I am not sure about autograd with this but you can try

If you have image with 2 channels how are you goint to use VGG-16 which requires RGB images (3 channels ) ?

There are several options you can try. One of them could be to just add a third channel with all values the same, or just add a layer in the beginning that goes from 2 to 3 channels. Thanks for your suggestion.

if you are going to use pretrained weight in ImageNet you should add the third channel and transform your input using ImageNet mean and std

–> https://pytorch.org/docs/stable/torchvision/models.html

I didn’t know that. Is this necessary even if my images are already normalized between 0 and 1?

For better leverage of the transfer learning from ImageNet because the network has been trained with this range of inputs . Otherwise I would advise to finetune all layers VGG-16 if you use range [0,1]

1 Like

Small update: I did try a couple of loss functions (MSE with mod 2pi, atan2) but nothing surprised me. I had another idea of doing multi-output classification. Actually my 512 phases at the end on my dataset do come on 128 discretized levels (because of hardware limitation issues, aliasing etc.) and I could take advantage of that. Instead of having only one fork (fully connected layer) at the end I could have 512 small networks, each of them having 128 outputs with a sigmoid activation function, and train on nn.CrossEntropyLoss.

I realized that the device I’m measuring the 512 phases from (actually these are phases that 512 transducers produce, so each phase is assigned to one transducer), due to hardware limitations is only capable of producing 128 discrete phases between 0 and 2pi. Thus, I believe it is overkill to go for a regression task. What I thought instead was to add 512 seperate nn.Linear(4096, 128) layers with a softmax activation function, like a multi-output classification approach. For each of 512 layers I calculate a seperate loss, with the output from the vgg as input to these layers. My network now looks like this:

class MyVgg(nn.Module):
    def __init__(self, version='16', batch_norm=True, pretrained=True):
        super().__init__()
        
        vgg = namedtuple('vgg', ['version', 'batch_norm', 'pretrained'])
        combinations = {vgg('16', True, True): torchvision.models.vgg16_bn(pretrained=True),
                        vgg('16', True, False): torchvision.models.vgg16_bn(pretrained=False),
                        vgg('16', False, True): torchvision.models.vgg16(pretrained=True),
                        vgg('16', False, False): torchvision.models.vgg16(pretrained=False),
                        vgg('19', True, True): torchvision.models.vgg19_bn(pretrained=True),
                        vgg('19', True, False): torchvision.models.vgg19_bn(pretrained=False),
                        vgg('19', False, True): torchvision.models.vgg19(pretrained=True),
                        vgg('19', False, False): torchvision.models.vgg19(pretrained=False)}
        self.model = combinations[vgg(version, batch_norm, pretrained)]

        # Remove the last fc layer
        self.model.classifier = nn.Sequential(*list(self.model.classifier.children())[:-1])

        # Include seperate classifiers for each phase
        self.pc = OrderedDict() # pc: phase classifiers, 512 in total
        for classifier in range(512):
            self.pc['PC_{}'.format(classifier)] = nn.Sequential(nn.Linear(4096, 128, bias=True)) # no need for nn.Softmax(), it is encapsulated in nn.CrossEntropyLoss()

    # Set your own forward pass
    def forward(self, img, extra_info=None):
        x = x.view(x.size(0), -1)
        pre_split = self.model(x) # before splitting to different classifiers, take the output from vgg

        outputs = OrderedDict()
        for pc in self.model.pc().values(): # iterate through all 512 classifiers
            outputs['x{}'.format] = pc(pre_split) # pass network output to all 512 classifiers

        return outputs # dictionary with the outputs from the 512 classifiers

The output is a dictionary with 512 keys, and 128 vectors as values.

And, for each classifier at the end I’m calculating the nn.CrossEntopyLoss() (which encapsulates the softmax activation btw, so no need to add that to my fully connected layers). My true labels is again a vector of 128 values (neurons), with 1 where the true value is and 0s for the rest (one-hot encoding like). Then I sum up the 512 losses and I’m back propagating to train the network like this:

for batch_idx, (images, labels) in enumerate(Bar(loaders['train'])):
    images, labels = images.to(self.device, dtype=torch.float), labels.to(self.device, dtype=torch.float) 
    optimizer.zero_grad()
    preds = network(images) # returns a dictionary 512 outputs of 128 values (512, 128) 
    loss = 0
    for output, target in zip(preds.values(), labels): # comparing (128, 1) vs. (128, 1) vectors 
        loss += self.criterion(output, target)

Do you think the whole concept makes sense? I generated 12k images today, and gonna start experimenting again tomorrow.

For the rest of participants in the forums here’s how a pair of data looks like for 6 particles:

PositionImage_00477_jpg

And the .csv file with the 512 target phases:

118
111
43
116
36
12
35
73
87
80
65
88
52
2
69
127
44
16
80
126
49
72
87
95
99
98
60
9
89
37
103
36
77
17
112
50
11
49
123
12
22
35
33
44
9
86
25
80
111
50
119
48
75
94
2
29
29
20
61
32
120
69
8
66
73
39
106
104
59
107
4
21
10
111
88
53
8
82
25
89
96
62
117
8
28
57
115
15
55
56
43
48
14
87
21
71
57
12
117
45
49
79
118
127
118
99
47
9
92
37
102
51
79
1
31
73
102
36
71
101
15
45
29
124
96
58
123
61
118
101
54
118
80
99
0
25
6
98
78
57
32
106
40
95
110
8
63
87
115
73
8
35
22
111
81
23
73
125
49
99
119
7
69
38
89
121
24
39
19
113
93
68
46
122
54
103
88
123
51
37
107
36
16
27
22
105
80
44
53
113
39
88
70
104
40
3
58
113
119
2
125
77
58
33
9
88
22
70
38
69
1
77
111
109
81
90
94
9
127
95
76
47
108
34
4
24
82
43
93
52
53
39
39
11
76
23
82
20
74
115
84
109
38
107
21
30
44
95
98
32
9
103
51
90
19
65
117
67
46
109
28
77
101
101
75
58
39
33
110
58
110
45
53
4
29
70
110
13
50
56
25
3
102
99
55
36
124
78
61
110
45
71
90
118
47
15
113
37
21
94
95
125
55
122
110
45
110
31
35
60
104
90
50
55
64
22
16
84
69
11
66
25
49
105
30
92
104
123
97
67
60
38
84
46
104
38
104
57
118
11
54
113
12
53
110
76
84
66
63
127
99
27
97
56
6
31
61
86
7
127
123
77
87
42
105
45
115
41
114
67
16
81
50
74
53
95
74
66
82
31
50
124
81
6
63
14
95
31
42
44
37
89
24
18
55
121
108
28
83
11
67
30
114
80
121
2
108
23
17
28
9
62
116
81
20
77
113
56
121
44
61
63
30
67
68
79
52
6
32
69
15
71
82
82
28
87
7
66
72
46
64
52
51
22
34
110
46
111
104
54
1
97
19
71
94
78
83
72
63
23
104
68
29
82
123
69
13
83
9
48
97
92
46
46
26
4
5
96
36
92
77
30
98
29
97
25
55
78
56
30
8
95
60
43
120
50
27
94
83
15
61
107
5
110
108
108
88
61
12
78
57
124

As you can see, the image is really sparse. Also, the phases come on discrete levels between 0 and 127 due to hardware limitations (FPGA that calculates the phase). To give you a better overview on the problem: There is a forward method that we have already implemented that given the position of particles in space (which here is represented as an image) we can calculate the phase of each of 512 transducers (so 512 phases in total). What these transducers do is emit sound waves with a particular phase and amplitude, and when all sound waves coming from all transducers combined, then the particles can be moved in space. It doesn’t really matter why and how this equation is formed. The point is that we’re experimenting with a deep learning approach, as the current algorithm is kind of slow for real time, and also there are better and more accurate algorithms that we haven’t implemented because they’re really slow to compute (for a real-time task).