I am training a 7 layers linear fully connected network with 48 neurons in each hidden layer (which gives 14353 learnable parameters). My data has 3 input features and 1 output. Data size is around 51230. I am using Dataloader with 20 batches. However the time improvement from CPU to GPU is only of 30-40% reduction in training time. After experimentation, i have noticed that GPU would only give significant time improvement if the total number of learnable parameters in significantly increased, say to the order of millions (then we can have training time reduced around 7 times). Can we not achieve significant benefit from GPU for NN model with 14353 parameters?
Overall if i train it for 200 epochs, these are time comparisons:
For 14k model parameters:
CPU: 5.7 min
GPU: 4.1 min
For 3.54 millions model parameters:
CPU: 31.0 min
GPU: 4.27 min
Is there any other way i can reduce my training time for around 14k model parameters?
If you are working with a small dataset, you could preload the whole dataset and push it to the GPU before training, to avoid loading times. Also, make sure not to create any unnecessary synchronization points in your training loop, e.g. printing the loss often.
Since the whole data has been transferred to GPU before creating dataset and train loader, i think it’s not having any loading time issues. pin_memory wouldn’t work for data already on GPU, if that’s what you’re suggesting. Also i have tried to minimize loss printing. Here are major parts of the code for clarity.
class Net(torch.nn.Module):
def __init__(self, D_in, H, D_out):
super(Net, self).__init__()
self.linear1 = torch.nn.Linear(D_in, H)
self.linear2 = torch.nn.Linear(H, H)
self.linear3 = torch.nn.Linear(H, H)
self.linear4 = torch.nn.Linear(H, H)
self.linear5 = torch.nn.Linear(H, H)
self.linear6 = torch.nn.Linear(H, H)
self.linear7 = torch.nn.Linear(H, H)
self.linear8 = torch.nn.Linear(H, D_out)
def forward(self, x):
out = F.relu(self.linear1(x))
out = F.relu(self.linear2(out))
out = F.relu(self.linear3(out))
out = F.relu(self.linear4(out))
out = F.relu(self.linear5(out))
out = F.relu(self.linear6(out))
out = F.relu(self.linear7(out))
y_pred = self.linear8(out)
return y_pred
D_in, H, D_out = 3, 768, 1
model = Net(D_in, H, D_out)
criterion = nn.MSELoss(reduction='sum')
optimizer = Adam(model.parameters(), lr=5e-4)
device = torch.device('cuda')
model.to(device)
dataX = dataX.to(device)
dataY = dataY.to(device)
dataset = TensorDataset(dataX,dataY)
training_batches = 20
batch_size_train = int(len(dataX)/training_batches) +1
train_loader = DataLoader(dataset, batch_size=batch_size_train, shuffle=True)
start_time = time.time()
for epoch in range(201):
running_loss = 0.0
for i, data in enumerate(train_loader):
features, target = data
optimizer.zero_grad()
forward = model(features)
loss = criterion(forward, target)
if epoch % 100 == 0:
running_loss += loss.item()
loss.backward()
optimizer.step()
if epoch % 100 == 0:
print('Epoch: {}, Training Loss: {:.2e}'.format(epoch, running_loss/training_batches))
elapsed = time.time() - start_time
print('GPU Time: {:.2f} min'.format(elapsed/60))
If I understand the issue correctly at the moment, you are seeing a time of 0.117 s/epoch using random data on the GPU and 1.22 s/epoch if you use your real data?
We are aware of denormal values, which might slow down the execution on the CPU, but this shouldn’t be the case on the GPU.
Could you nevertheless set torch.set_flush_denormal(True) and time it again?
0.117 s/epoch was time for 1000 samples of random dummy data, while my real data size is 51320 and 1.22 s/epoch corresponds to that.
My concern is with the comparison of CPU & GPU time for this real data of size 51320, in which i am not getting significant reduction in training time with GPU. (Also if i increase the size for random dummy data to 51320, times are almost the same)
For 14k model parameters: CPU: 5.7min GPU: 4.1min
I am not sure if i am doing everything i can, to get the maximum benefit from GPU.
Isn’t GPU benefit significant for model with only 14k parameters?
Thanks a lot @ptrblck_de for your help throughout the discussion.
Here’s an enigma. Although GPU Tesla P100 provide better times for data size of 1000, the training time increases rapidly by just increasing the data size, as corresponding times are mentioned below:
Do you understand why there’s so much difference in time increase for two different GPUs?
Also i tried a code from a research paper written with tensorflow. Having made no changes to it, training time on Tesla P100 was 6.3 min while it’s mentioned in the paper that NVIDIA Titan X did it in 1 min. I can’t figure any reason since i think P100 is more powerful.
I can rerun these tests tomorrow on a few GPUs and report some numbers.
In the meantime, could you update PyTorch to the latest stable release (1.2.0) so that we get comparable numbers?
On a unrelated note, I think it worth mentioning that these many linear layers in sequence is not advisable in most cases. You will be probably better of using two linear layers (in total) and playing with the number of neurons in the first linear layer.
If you want more sensible comparisons I’d fix the batch size at the same size regardless of the dataset size . I’m not sure why you’d want a fixed number of batches per epoch. Not doign this will result in changing GPU utilization with different dataset sizes .
I wouldnt’ be suprised if the performance, even on GPU is CPU bound due to dataset/dataloader python overhead. The call __getitem__ per batch element of many many thousands will chew up a lot of CPU time when done in Python. I’d try building a random index and building batches from X, Y tensors manually.
So i have worked with the administrators of computing resources and turns out that P100 GPU i was using wasn’t performing optimally for some reasons. Same GPU on another system gave comparable performance as you mentioned.
It doesn’t make much difference. Except increasing the number of batches for same data size increases the computing time. Also whole dataset is already on GPU before training starts.