I am pretty new to Pytorch, having only setup a few models to this point.

I am attempting to implement a fully connected ReLU network as seen in this example - https://pytorch.org/tutorials/beginner/examples_nn/two_layer_net_nn.html

I have a decently large dataset ~750k rows and 2k columns. I am training the model using a cluster with 4 GPUs. When I try to train the model, I am getting the RuntimeError: Cuda error: out of memory.

From the research that I have done to this point it seems my options are to drop the cached variables during the training of the model, or reduce the batch size.

I attempted to implement some functionality to detach the variables, however, I don’t believe I am doing this correctly as I am still running into the same error. Also, I am not sure how to implement batching when creating tensors from a numpy array. There is a StackOverflow question on the topic, but it doesn’t have a solution - https://stackoverflow.com/questions/46170814/how-to-train-pytorch-model-with-numpy-data-and-batch-size.

My code:

```
device = torch.device('cuda:0')
inputSize = X_trainTransformed.shape[1]
firstHiddenLayer = 500
secondHiddenLayer = 250
thirdHiddenLayer = 125
outputLayer = 1
model = torch.nn.Sequential(
torch.nn.Linear(inputSize, firstHiddenLayer),
torch.nn.ReLU(),
torch.nn.Linear(firstHiddenLayer, secondHiddenLayer),
torch.nn.ReLU(),
torch.nn.Linear(secondHiddenLayer, thirdHiddenLayer),
torch.nn.ReLU(),
torch.nn.Linear(thirdHiddenLayer, outputLayer)
)
if torch.cuda.device_count() > 1:
print('Train using', torch.cuda.device_count(), 'GPUs!')
model = torch.nn.DataParallel(model)
model.to(device)
X_trainTensor = torch.from_numpy(X_trainTransformed).float().to(device)
y_trainTensor = torch.from_numpy(np.array(y_train)).float().reshape(-1, 1).to(device)
X_testTensor = torch.from_numpy(X_testTransformed).float().to(device)
y_testTensor = torch.from_numpy(np.array(y_test)).float().reshape(-1, 1).to(device)
lossFunction = torch.nn.MSELoss(size_average = False)
learningRate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr = learningRate)
for step in range(2000):
yPredict = model(X_trainTensor)
loss = lossFunction(yPredict, y_trainTensor)
yPredict.detach()
loss.detach()
model.zero_grad()
loss.backward()
optimizer.step()
if step % 200 == 0:
testPrediction = model(X_testTensor).cpu()
print(step, r2_score(testPrediction.detach().numpy(), y_test.reshape(-1, 1)))
```

Any advice on how to implement a solution to allow this model to train on the GPUs would be great.