Autoencoders for Sparse dataset

Hello,

Goal is to perform a dimensionality reduction without loss of information. This post is relating to autoencoders and its ability to reconstruct a sparse dataset.

Input: is a 2D tensor (N, 512) where 512 is a frequency vector with discrete values only.
Example: [[3, 0, 1, 0, 45, 0,………, 0, 0, 0, 2][……]]

Expected output is a 2D tensor (N, 512) as close as possible to input (but only discrete)

Loss functions used:

  • MSELoss
  • MSELoss + CosineEmbeddingLoss
  • L1 regularization
  • MSELoss + KL Divergence (from the theory of sparse autoencoders)

Note:

  1. Dataset is highly sparse.
  2. Latent dimension is 32
  3. With all the above loss functions and different architectures the autoencoder’s training and validation loss values always decreases. Thus, based on these metrics it looks like the enoder and decoder is working fine.
class SparseAutoEncoder(nn.Module):

    def __init__(self, original_dim: int, latent_dim: int):
        super(SparseAutoEncoder, self).__init__()
        self.enc1 = nn.Linear(in_features=original_dim, out_features=original_dim//2)
        self.enc2 = nn.Linear(in_features=original_dim//2, out_features=original_dim//4)
        self.enc3 = nn.Linear(in_features=original_dim//4, out_features=latent_dim)
        self.dec1 = nn.Linear(in_features=latent_dim, out_features=original_dim//4)
        self.dec2 = nn.Linear(in_features=original_dim//4, out_features=original_dim//2)
        self.dec3 = nn.Linear(in_features=original_dim//2, out_features=original_dim)


    def forward(self, x):

        x = F.relu(self.enc1(x))
        x = F.relu(self.enc2(x))
        encoded = F.relu(self.enc3(x))
        x = F.relu(self.dec1(encoded))
        x = F.relu(self.dec2(x))
        decoded = F.relu(self.dec3(x))
        return encoded, decoded

However, for qualitative analysis plotting is not the option (unlike in images) I take the absolute difference of these two tensors (np.abs(X - X_prime)) and compute the number of non-zero entries for each training sample. This is where the fault is observed: the autoenconder is partially able to reconstruct the input frequency vectors.

Example:
If original vector is: X = [3, 1, 1, 1, 2] at non-zero indices = [0, 22, 26, 256, 264] and all other indices are 0. Such that len(X) is always 512.

The reconstructed vector is: X _prime = [3, 0, 0, 1, 0] for indices [0, 22, 26, 256, 264] and all other indices 0. Such that len(X_prime) is also 512.

So, we have 5 non-zero entries in the original vector and the goal is to capture them as precise as possible and the autoencoder fails to achieve this. Similar pattern is observed for all the N samples where around ~50% of non-zero entries are correct.

My questions would be:

  1. Is autoencoder a right architecture to capture this kind of discrete input data?
  2. Do I need to use some other loss function? (A mean square error will always be low because most of the entries are 0 and the model correctly predicts the 0’s. It only fails to predict non-zero values (which are very few).
  3. What other methods could be tried to perform this task of dimensionality reduction? Will PCA also face the same issue !

Thanks for your help and suggestions :grinning:

Best,
Sharma