Hi,
Due to certain system requirements, our team is looking at converting our use of an SGD classifier from sklearn over to PyTorch. So far, I’ve been able to take the transformed data from a Column Transformer and pass that into PyTorch tensors which seem like I can pass them to a simple PyTorch model:
class Network(torch.nn.Module):
def __init__(self, num_features, num_classes, hidden_units):
super().__init__()
# First layer
self.fc1 = torch.nn.Linear(num_features, hidden_units)
# Second layer
self.fc2 = torch.nn.Linear(hidden_units, num_classes)
#Final output of sigmoid function
self.output = torch.nn.Sigmoid()
def forward(self, x):
fc1 = self.fc1(x)
fc2 = self.fc2(fc1)
output = self.output(fc2)
return output[:, -1]
Data:
train_x = self.preprocessor.fit_transform(self.features_train)
test_x = self.preprocessor.transform(self.features_test)
train_x_tensor = torch.tensor(train_x).float()
test_x_tensor = torch.tensor(test_x).float()
Training:
train_data = TensorDataset(train_x_tensor, train_y_tensor)
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
for i in range(num_epochs):
for x_batch, y_batch in train_loader:
model.train()
y_pred = model(x_batch)
loss = loss_fun(y_pred, y_batch.float())
loss.backward()
optimizer.step()
optimizer.zero_grad()
print('After {} epoch training loss is {}'.format(i,loss.item()))
Now, I’m not attached to any of this code - so any suggestions are welcome, but the issue I keep running into is that large datasets seem to give this code trouble. I think there’s a memory issue because whenever I run this on large datasets, the program dies with
zsh: killed python3 myprogram.py
I am not using GPU objects on this machine as I don’t appear to have access to cuda and I’m not even sure if I’ll have access to PyTorch GPU stuff in the final workflow. Can anyone recommend how I can approach fixing this issue? I’m sure PyTorch can handle the size of the data set - the train_x size in question is a 2d array of (10000, 37000).
Any advice would be appreciated!