Autoencoder Loss Plateu

TheOraware · September 28, 2021, 1:29am

Hi Everyone,

I asked this question on social media

I am working on dimensional reduction techniques and chose DAE Autoencoder as one of techniques. Reason for chosing this as it works well with linear and non linear data. I have 120 features with almost one million records. Denoising Autoencoder (DAE) give me a latent space of 80 features - what it means is i reduced features from 120 to 80 , 80 is an arbitrary number and checked with other random number as well , is there any systematic approach to chose the size of latent space ? Now the real problem is binary classification prediction 1 or 0. After getting this latent space 80 and feed into downstream algorithm Light GBM , i don’t see any improvement in model accuracy , without latent space i mean plain LGBM with 120 features LGBM give me 80% accuracy, with latent space 80 LGBM give me 55% accuracy. Target class is balanced 50/50 , reason for going features reduction is to avoid over-fitting and more simpler model. I ran down feature selection technique as well and almost all features are contributing fair with combination of other features to predict target feature 1/0, that is the reason i chose DAE , i also checked all features are completely overlapped between target feature 1 and 0 as shown below in attached. Any idea how can i reduce these features from 120 to atleast 80 , PCA is crashing as it requires more memory which i don’t have at this stage, autoencoder seems to me promising in terms of speed , memory and has capability to work on non linear as well as linear , while PCA required linear. Anyone can suggest me why autoencoder is not converging well? i tried with different layers with different number of neurons but no luck , please share any idea which covers Autoencoder - Thanks

Someone has replied

Add a separate classification head to your decoder and train on a combined loss (say 50% reconstruction loss + 50% classification loss)

I did not get it and asked to that same guy what does mean by classification head and how can i add a separate classification head to decoder and train on a combined loss (say 50% reconstruction loss + 50% classification loss)?

Any body has idea please give me some dummy code and why we need it?

ptrblck · September 28, 2021, 5:35am

I guess the user tried to suggest to add a new layer in your decoder, which would be used to output the prediction for the classification, which could then be used to train together with the reconstruction.
I.e. I think she meant something like this:

# original
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(10, 10),
            nn.ReLU(),
            nn.Linear(10, 2))
        self.decoder = nn.Sequential(
            nn.ReLU(),
            nn.Linear(2, 10),
            nn.ReLU(),
            nn.Linear(10, 10))
        
    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x

model = MyModel()
x = torch.randn(1, 10)
reconstruction = model(x)

# probably suggested
# original
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(10, 10),
            nn.ReLU(),
            nn.Linear(10, 2))
        self.decoder_features = nn.Sequential(
            nn.ReLU(),
            nn.Linear(2, 10),
            nn.ReLU())
        self.decoder_recon = nn.Linear(10, 10)
        self.decoder_classifier = nn.Linear(10, 10)
        
    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder_features(x)
        reconstruction = self.decoder_recon(x)
        logits = self.decoder_classifier(x)
        return reconstruction, logits

model = MyModel()
x = torch.randn(1, 10)
reconstruction, logits = model(x)
# calculate losses for both using corresponding loss functions
loss = loss_recon + loss_classifier
loss.backward()

Of course your model would look different and more sophisticated than this example.

TheOraware · September 29, 2021, 1:22am

@ptrblck thanks , what is the catch of doing this , we are replicating the same decoder layer into different instance variable “self.decoder_recon” and “self.decoder_classifier” and treating one is decoder and other is classifier. Could you please why we need to do this?

self.decoder_recon = nn.Linear(10, 10)
self.decoder_classifier = nn.Linear(10, 10)

ptrblck · September 29, 2021, 4:33am

These layers would reuse the “base” of the model as a potential feature extractor and would then act as two different heads giving a different set of outputs. While one head would create the reconstruction of the input image, the other one would output the classification logits. Training both heads and thus the base together using the accumulated losses could yield better results, but of course you would have to run experiments to see if that’s indeed the case.

TheOraware · September 29, 2021, 6:03am

@ptrblck make sense , but still i need to understand it more deeper , i may ask you more detail please bear me with more silly question , i implemented the same on my data , there i see a little bit improvement.

The improvement i am measuring by feeding the latent/bottleneck into a boosted algo LGBM for classification , i mean after autoencoder training i use the same autoencoder model to get the latent space of my training data and then use this latent space to train classification using LGBM. Before using this above technique as mentioned above two heads , LGBM model accuracy was 0.58 which increase to after using it to 0.59

I noticed there are data issue as well , data are too fuzzy you cant discriminate the two records which more or less are same but one has target value 1 and other is 0 , all features has the same problem and this problem is widespread in all rows , very very overlapped data

I am thinking create different auto-encoder with different architecture and averaged all their latent space to get final latent space and feed into LGBM to see any improvement, Is their any systematic way to achieve the same in pytorch?

Please let me know how can i approach to solve this problem if you have any idea please share

ptrblck · September 29, 2021, 6:55pm

How was the accuracy of the classifier head in the autoencoder? It would perform a similar task as the LGBM model, i.e. using the latent tensors to classify the samples.

I don’t know how you’ve checked this, but in case both classes are drawn from the same distribution, I would assume that your model should not be able to learn a lot (besides overfitting).

TheOraware · September 30, 2021, 11:14am

How was the accuracy of the classifier head in the autoencoder? It would perform a similar task as the LGBM model, i.e. using the latent tensors to classify the samples.

@ptrblck I am lost here , i have binary classification problem. Total number of features are 111 which i want to decrease to 100 using autoencoder , then these 100 latent space will be fed into LGBM model.

Autoencoder

class Autoencoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(111,480),
            nn.ReLU(),
            nn.BatchNorm1d(480),

            nn.Linear(480,100),
            nn.BatchNorm1d(100),
            nn.ReLU()
        )
        self.decoder = nn.Sequential(
            nn.Linear(100,480),
            nn.ReLU())
        self.decoder_recon = nn.Linear(480,111)
        self.decoder_classifier = nn.Linear(480,111)
    
    def forward(self,x):
        x = self.encoder(x)
        x = self.decoder(x)
        reconstruction = self.decoder_recon(x)
        logits = self.decoder_classifier(x)
        return reconstruction,logits
    
    def get_encoder_state(self,x):
        encoded = self.encoder(x)
        return encoded

After training Autoencoder i grab the latent space on training data

X_train = torch.from_numpy(pd.DataFrame(X_nums).to_numpy(np.float32)).to(device)
X_train_encoded = autoencoder.get_encoder_state(X_train)
X = pd.DataFrame(X_train_encoded.cpu().detach().numpy())
X.shape
(912512, 100)

Then i train this latent space as an input for LGBM

best_params = {'n_estimators': 1000, 
               'subsample': 0.4, 
               'learning_rate': 0.01, 
               'num_leaves': 70, 
               'is_unbalance':False,
               'device': 'gpu'}

gbm = lgb.LGBMClassifier(**best_params)

scores = list()
kfold = StratifiedKFold(n_splits=5,shuffle=True,random_state=42)
frequent_labels_cat_dict={}

for train_idx, test_idx in kfold.split(X,y):

    train_X, test_X = X.loc[train_idx,:], X.loc[test_idx,:]
    train_y, test_y = y.loc[train_idx], y.loc[test_idx]
    gbm.fit(train_X, train_y)

    pred = gbm.predict_proba(test_X)[:, 1] # This grabs the positive class prediction
    cv_score  = roc_auc_score(test_y, pred)
    
    scores.append(cv_score)
    print('> ', cv_score)

mean_s, std_s = np.mean(scores), np.std(scores)
print('Mean: %.5f, Standard Deviation: %.5f' % (mean_s, std_s))

Please let me know what you need more? And what i am missing?

ptrblck · September 30, 2021, 5:26pm

Since you were training the autoencoder with both heads, did you check the accuracy of the classifier of the autoencoder? The classifier head should perform a similar task as the LGBM model, so comparing their accuracy might be interesting.

TheOraware · October 1, 2021, 11:25am

@ptrblck my classification task is binary 0 or 1

i want to reduce 118 features to some x number until data representation is not lost
Rather then feeding 118 features to any algorithm, i want to give some smaller features set which represent 118 original features, hence i started to reduce 118 features to 100 using autoencoder
Then these 100 features will predict classification 1 and 0 using any algo e.g. LGBM in my case

The classifier head here is 111 as i followed your given example , it is not 2 for 0 or 1 classification

self.decoder_classifier = nn.Linear(480,111)

Do you want me to change it

self.decoder_classifier = nn.Linear(480,2)

ptrblck · October 1, 2021, 6:24pm

No, don’t change your model architecture.
Based on the suggestion you’ve received initially, you’ve added a classifier head to your autoencoder (while still keeping the LGBM classifier). This new classifier head in the autoencoder will also output the classification (in form of logits or probabilities) for the samples, so both the classifier head as well as the LGBM model are now creating classification predictions for each sample. Did you compare the accuracy of these two classifiers and if so what accuracy did you see?

TheOraware · October 2, 2021, 3:00pm

@ptrblck this is what i understood , apologies if i am still missing please guide me i really want to learn this stuff

I asked to track loss of classification header in network , here is Autoencoder no change as you suggested

class Autoencoder(nn.Module):
    def __init__(self):
        super().__init__()

        self.encoder = nn.Sequential(
            nn.Linear(112,480),
            nn.ReLU(),
            nn.BatchNorm1d(480),
            
            nn.Linear(480,100),
            nn.BatchNorm1d(100),
            nn.ReLU()
        )
        self.decoder = nn.Sequential(
            nn.Linear(100,480),
            nn.ReLU())
            
        self.decoder_recon = nn.Linear(480,112)
        self.decoder_classifier = nn.Linear(480,112)
    
    def forward(self,x):
        x = self.encoder(x)
        x = self.decoder(x)
        reconstruction = self.decoder_recon(x)
        logits = self.decoder_classifier(x)
        return reconstruction,logits
    
    def get_encoder_state(self,x):
        encoded = self.encoder(x)
        return encoded

Here i am just updating gradient of additional defined classification header loss

self.decoder_classifier = nn.Linear(480,112)

I removed self.decoder_recon = nn.Linear(480,112) loss updating in the following code

criterion = nn.MSELoss()
seed = 4
torch.manual_seed(seed)
num_epochs = 100
outputs = []
running_loss = 0.0

for epoch in range(num_epochs):
    for data in data_loader:
        inputs, targets = data['x'].to(device), data['y'].to(device)
        recon,logits = autoencoder(inputs)
        loss_recon  = criterion(recon,targets) # calculate loss
        loss_classifier     = criterion(logits,inputs)
        loss = loss_recon + loss_classifier

        optimizer.zero_grad()
        loss_classifier.backward()  # gradients = backward pass
        optimizer.step() # update weights 
        if not  scheduler.__class__ ==  torch.optim.lr_scheduler.ReduceLROnPlateau:
            scheduler.step(loss)
        running_loss += loss.item()
    if (epoch+1) % 2 ==0:
        print(f'Epoch: {epoch+1}, Loss:{loss_classifier}, Running Loss:{running_loss}')

Epoch: 2, Loss:0.19022159278392792, Running Loss:506.3911786079407
Epoch: 4, Loss:0.15673978626728058, Running Loss:1002.0990172624588
.
.
Epoch: 98, Loss:0.062328677624464035, Running Loss:20757.95539867878
Epoch: 100, Loss:0.06230540573596954, Running Loss:21154.803141355515

Updating combined loss self.decoder_classifier = nn.Linear(480,112) and
self.decoder_recon = nn.Linear(480,112) in the following code

criterion = nn.MSELoss()
seed = 4
torch.manual_seed(seed)

num_epochs = 100
outputs = []
running_loss = 0.0

for epoch in range(num_epochs):
    for data in data_loader:
        inputs, targets = data['x'].to(device), data['y'].to(device)
        recon,logits = autoencoder(inputs)
        loss_recon  = criterion(recon,targets) # calculate loss
        loss_classifier     = criterion(logits,inputs)
        loss = loss_recon + loss_classifier

        optimizer.zero_grad()
        loss.backward()  # gradients = backward pass
        optimizer.step() # update weights 
        if not  scheduler.__class__ ==  torch.optim.lr_scheduler.ReduceLROnPlateau:
            scheduler.step(loss)
        running_loss += loss.item()
    if (epoch+1) % 2 ==0:
        print(f'Epoch: {epoch+1}, Loss:{loss.item():.4f}, Running Loss:{running_loss}')

Epoch: 2, Loss:0.7351, Running Loss:216.49947154521942
Epoch: 4, Loss:0.7167, Running Loss:424.1127555966377
Epoch: 6, Loss:0.7047, Running Loss:627.1306979060173
Epoch: 8, Loss:0.7052, Running Loss:828.52132833004
.
.
Epoch:100, Loss:0.4444, Running Loss:3412.2449165582657

In both cases down stream model LGBM using latent space 100 as an input for model training , and downstream LGBM model is giving more or less same score in each fold

Updating loss with respect to classification header latent space performance of LGBM model

0.8072682510834308
0.8065939798651113
0.8057722965731224
0.8059680518666961
0.8057054368050915
Fold Mean: 0.80626, Standard Deviation: 0.00059

Updating combined loss w.r.t recon and classification header latent space performance of LGBM model

0.8065435663650142
0.8059007970122725
0.8055537970177743
0.8060057710128392
0.8054112326987021
Fold Mean: 0.80588, Standard Deviation: 0.00040

There is only slight improvement from fold 0.80588 to 0.80626

If still i am doing wrong please correct me

ptrblck · October 2, 2021, 11:35pm

In case this value shows the losses, I would consider the improvement rather noise than a real change.

I think your approach looks valid, but the model is clearly not training well, as the loss increases.
This would also mean that the accuracy using the logits output from self.decoder_classifier will most likely be also random, but you could check it nevertheless.

TheOraware · October 3, 2021, 6:16am

@ptrblck is there any detailed document about these extra header

self.decoder_recon = nn.Linear(480,112)
self.decoder_classifier = nn.Linear(480,112)

How does it work? And why it is working on NN’s latent space , when i use it it gives good accuracy score on LGBM model. Without it , it gives worst LGBM accuracy score . I want to learn what mechanism is working behind its usage

ptrblck · October 3, 2021, 6:33am

I don’t know if there are good resources about it (I would guess you could find some papers describing this approach). The main idea would be to not only train the model with the reconstruction loss, but also to force it to learn to classify the samples. Since you are training the model end2end this would also influence the latent space in the bottleneck of the model, so that it has to encode the “classification information” as well and which might then benefit the downstream LGBM model.