ReLU Network Cuda Memory Error

I am pretty new to Pytorch, having only setup a few models to this point.

I am attempting to implement a fully connected ReLU network as seen in this example -

I have a decently large dataset ~750k rows and 2k columns. I am training the model using a cluster with 4 GPUs. When I try to train the model, I am getting the RuntimeError: Cuda error: out of memory.

From the research that I have done to this point it seems my options are to drop the cached variables during the training of the model, or reduce the batch size.

I attempted to implement some functionality to detach the variables, however, I don’t believe I am doing this correctly as I am still running into the same error. Also, I am not sure how to implement batching when creating tensors from a numpy array. There is a StackOverflow question on the topic, but it doesn’t have a solution -

My code:

device = torch.device('cuda:0')
inputSize = X_trainTransformed.shape[1]
firstHiddenLayer = 500
secondHiddenLayer = 250
thirdHiddenLayer = 125
outputLayer = 1

model = torch.nn.Sequential(
  torch.nn.Linear(inputSize, firstHiddenLayer),
  torch.nn.Linear(firstHiddenLayer, secondHiddenLayer),
  torch.nn.Linear(secondHiddenLayer, thirdHiddenLayer),
  torch.nn.Linear(thirdHiddenLayer, outputLayer)

if torch.cuda.device_count() > 1:
  print('Train using', torch.cuda.device_count(), 'GPUs!')
  model = torch.nn.DataParallel(model)

X_trainTensor = torch.from_numpy(X_trainTransformed).float().to(device)
y_trainTensor = torch.from_numpy(np.array(y_train)).float().reshape(-1, 1).to(device)
X_testTensor = torch.from_numpy(X_testTransformed).float().to(device)
y_testTensor = torch.from_numpy(np.array(y_test)).float().reshape(-1, 1).to(device)

lossFunction = torch.nn.MSELoss(size_average = False)
learningRate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr = learningRate)

for step in range(2000):
  yPredict = model(X_trainTensor)
  loss = lossFunction(yPredict, y_trainTensor)  
  if step % 200 == 0:
    testPrediction = model(X_testTensor).cpu()
    print(step, r2_score(testPrediction.detach().numpy(), y_test.reshape(-1, 1)))

Any advice on how to implement a solution to allow this model to train on the GPUs would be great.