Hello,
I have been trying to train a one layer network on features extracted with another network pre-trained.
The same features are discriminative when I train linear non-DL classifiers such as a Linear SVM or a logistic regression. However for some reason, when I try to train a neural network on these same features, the loss never gets close enough to zero and at some point starts to stagnate. I have tried everything I could think of: I normalized the features before passing them to the network, I increased/decreased the learning rate…decreased the batch size…
Initially this neural network was intended to be trained on a dataset (that is then passed to the same pre-trained feature extractor) way bigger (2 million samples) than the one I used for the non-DL classifiers (13 000 samples), but I’m trying it now on even the same dataset used for the non-DL classifiers and it’s still not working. I am really puzzled and would appreciate some help.
Here are relevant parts of the code:
the one layer neural network
class Net(nn.Module):
def __init__(self, embedding_size=512):
super(Net, self).__init__()
self.embedding_size= embedding_size
self.fc=nn.Linear(self.embedding_size,1)
def forward(self,x):
x = self.fc(x)
return x
optimizer and loss
advBackbone=Net(embedding_size=cfg.embedding_size).to(device)
opt_adv_backbone = torch.optim.Adam(
params=[{'params': advBackbone.parameters()}],
lr=0.01,
)
advcriterion = BCEWithLogitsLoss()
training loop
for epoch in range(start_epoch, cfg.num_epoch):
for i, (img, _, g) in enumerate(train_loader):
for p in backbone.parameters(): # fixing the pre-trained feature extractor
p.requires_grad = False
global_step += 1
opt_adv_backbone.zero_grad()
img = img.to(device)
g = g.to(device)
features = F.normalize(backbone(img))
output = advBackbone(features.detach())
loss_a=advcriterion(output, g.unsqueeze(1).float())
loss_a.backward()
opt_adv_backbone.step()
loss.update(loss_a, 1)
by the way, when I try to let it overfit on one batch, the loss does get close to zero (~0.005). When I try one the whole data, it decreases and then from a certain epoch stagnates around a certain value (0.2) for the small data and (0.3) for the large dataset. What I don’t understand, is how come it fails when a logistic regression or a linear SVM does converge. The fact that these do means that the features, at least extracted for the small data are indeed linearly separable and the neural network defined should be able to converge as well…
Thanks in advance for anyone willing to help