Converting sklearn Classifier to PyTorch


Due to certain system requirements, our team is looking at converting our use of an SGD classifier from sklearn over to PyTorch. So far, I’ve been able to take the transformed data from a Column Transformer and pass that into PyTorch tensors which seem like I can pass them to a simple PyTorch model:

        class Network(torch.nn.Module):
            def __init__(self, num_features, num_classes, hidden_units):
                # First layer
                self.fc1 = torch.nn.Linear(num_features, hidden_units)
                # Second layer
                self.fc2 = torch.nn.Linear(hidden_units, num_classes)
                #Final output of sigmoid function
                self.output = torch.nn.Sigmoid()
            def forward(self, x):
                fc1 = self.fc1(x)
                fc2 = self.fc2(fc1)
                output = self.output(fc2)
                return output[:, -1]


        train_x = self.preprocessor.fit_transform(self.features_train)
        test_x = self.preprocessor.transform(self.features_test)
        train_x_tensor = torch.tensor(train_x).float()
        test_x_tensor = torch.tensor(test_x).float()


            train_data = TensorDataset(train_x_tensor, train_y_tensor)
            train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
            for i in range(num_epochs):
                for x_batch, y_batch in train_loader:
                    y_pred = model(x_batch)
                    loss = loss_fun(y_pred, y_batch.float())
                print('After {} epoch training loss is {}'.format(i,loss.item()))

Now, I’m not attached to any of this code - so any suggestions are welcome, but the issue I keep running into is that large datasets seem to give this code trouble. I think there’s a memory issue because whenever I run this on large datasets, the program dies with

zsh: killed     python3

I am not using GPU objects on this machine as I don’t appear to have access to cuda and I’m not even sure if I’ll have access to PyTorch GPU stuff in the final workflow. Can anyone recommend how I can approach fixing this issue? I’m sure PyTorch can handle the size of the data set - the train_x size in question is a 2d array of (10000, 37000).

Any advice would be appreciated!

Your OS might kill your process if it’s running out of host RAM.
dmesg should also show more information in case the oom-killer was invoked. If so, you would need to reduce the memory usage by e.g. using a smaller input.

Ah yeah, that does appear to be the issue:
[6052865.969576]: low swap: killing largest compressed process with pid 93743 (python3.11) and size 24275 MB

Is the only way around this sort of thing running on a different machine and/or using GPU instead?

To be clear, I don’t think I can even get past the initial allocation - I don’t think this is because of memory being used in the training of the model. Just in the initial Tensor creation.

I don’t think the initial tensor is problematic since it would take ~1.4GB only using float32 while your process seems to use ~24GB based on the log output.

There’s probably other stuff I can see about removing then… This at least gives me ideas. Thank you!