5 minutes per epoch on tabular data?

Hello all,

I am training an auto encoder on tabular data that has 272 features that are mostly sparse one-hot encodings. I am experiencing incredibly slow training. In fact it appears to possibly be training faster on CPU than on GPU. The time to train one epoch with a batch size of 5 is 5 minutes.

I’m hoping someone can provide some insight into why this is happening. Thank you in advance.

My set-up is a very basic fully connected network:

class FCNetwork(nn.Module):
    """Fully Connected Network Class"""
    def __init__(self, n_input, layers, n_output, act=('relu', nn.ReLU())):
        """
        :param n_input: Integer. Size of input vector.
        :param layers: Tuple. containing the desired hidden layer architecture.
        :param n_output: Size of output vector.
        :param act: Tuple ('name', act_func). The first element should be a string
            describing the activation function. The second element is the activation
            function itself. Default is ``('ReLU', nn.ReLU())``.
        """
        super().__init__()
        self.input = nn.Linear(n_input, layers[0])
        self.hidden = self.init_hidden(layers, activation=act)
        self.output = nn.Linear(layers[-1], n_output)
        
    def init_hidden(self, layers, activation, dropout=0.0):

        n_layers = len(layers)
        modules = OrderedDict()
        a_name = activation[0]
            
        modules[f'{a_name}_in'] = activation[1]
        
        for i in range(n_layers - 1):
            modules[f'fc{i}'] = nn.Linear(layers[i], layers[i + 1])
            modules[f'{a_name}{i}'] = activation[1]
            modules[f'drop{i}'] = nn.Dropout(p=dropout)
            
        modules[f'{a_name}_out'] = activation[1]
        
        return nn.Sequential(modules)
            
    def forward(self, x):
        x = x.float()
        x = self.input(x)
        x = self.hidden(x)
        return self.output(x)
    

And this is my training method:

def train_encoder(model, trainload, epochs, criterion=nn.MSELoss(), optimizer=optim.Adam, lr=1e-4, testload=None):
    """
    Train auto-encoder reconstruction for given number of epochs.

    :param trainload: a DataLoader object containing training variables and targets used for training.
    :param testload: Optional. a DataLoader containing the validation set. If included, both training and
        validation loss will be tracked and can be plotted using model.plot_loss().
    :param epochs: Number of times the network will view the entire data set
    :param optimizer: Learning method. Default optim.Adam
    :param lr: Learning Rate. Default 0.003
    :param criterion: Loss function
    :return:
    """
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print(f'Training model on {device}')
    model = model.to(device)
    opt = optimizer(model.parameters(), lr)

    train_loss = []
    valid_loss = []

    for e in tqdm(range(epochs)):
        running_tl = 0
        running_vl = 0
        for x in trainload:
            x = x.to(device).float()
            opt.zero_grad()
            loss = criterion(model(x), x)
            loss.backward()
            opt.step()
            running_tl += loss.item()

        if testload is not None:
            model.eval()
            with torch.no_grad():
                for x in testload:
                    x = x.to(device).float()
                    loss = criterion(model(x), x)
                    running_vl += loss.item()
                valid_loss.append(running_vl / len(testload))
            model.train()
            
        train_loss.append(running_tl / len(trainload))
    
    return train_loss, valid_loss

Number of features = 272
And it’s one hot encoded at that
And u have data of of Length = 5 / batch
Assuming this tabular data is of 5 columns,
The u are having a data of (5, 5, 272) per batch.
This is really really large.

U should try to use a different vectorizing method other than the one hot encoding method simply because the number of feature is way way much and can lead to higher computational time.

Hey, thanks for the response. The 272 columns are AFTER one-hot encoding. Do you still think that is too large?

Sorry I don’t really get this
Can u write the shape of the data out for me?

Eg:
(N, C, F) sth like this.

Okay, input looks like this:

(batch_size, n_columns) = (5, 272)

Since it’s an auto-encoder output is the same shape as the input.

Ok I see.
Still, I don’t really recommend one-hot vectorization for data this long except there’s really no other way.

What exactly is the context of ur data?

The context of my data is that I have 3 columns worth of continuous data, plus one column that is categorical (with 269 unique categories). These categories are nominal and this is why I chose one-hot.

I am only aware of one-hot and label encoding. I know label-encoding is not good for this use-case. Is there a 3rd option?

So the the one column that is categorical is ur target data?

I want to confirm b4 I go further to explain something.

Sure, no problem.

It’s an auto-encoder, so the target is the same the input. The network reduces the input to a smaller-dimensional feature vector (embedding) and then tries to re-construct it.

For example, if input is 272 element vector, the network reduces this vector down to something that is say a 100 element vector. It then tries to reconstruct the original vector based on this reduced version.

so x in and x out.

Ok ok I understand now
So basically there’s an embedder that takes in the input, resizes it into smaller vector, then this smaller vector is passed to another layer (a dense layer) that then tries to predict the actual vector that was initially inputed.

Hmmmm🤔
One hot encoding might actually be the way forward…but then 272 length one hot vector might be too sparse

Well try encoding ur data with number labels still so that the shape can go from (5, 272) to (5, 1) then the input will be 1 and the output will be a probability score vector of length = 272 so that the index with the highest probability will be the number of that category.

Eg:
Output: [0.8, 0.1, 0.1] will belong to category 0 coz the highest value is at the 0th index.
Sth like that.

How about that?:man_shrugging:
What do u think?

Well, there’s only one way to find out. I will just try it!

Thanks for the time and your ideas!

Cool
I would like to get ur feedback after🙃