Training is slowed down dramatically when evaluating validation set

zacheberhart · September 11, 2018, 10:21am

I’m fairly new to PyTorch and am playing around with training on a GPU for the first time.

I’m using a pre-trained Densenet121. Unfrozen layers:

Sequential(
  (fc1): Linear(in_features=1024, out_features=500, bias=True)
  (relu): ReLU()
  (fc2): Linear(in_features=500, out_features=81, bias=True)
  (output): Softmax()
)

Anyway, training is working fine (though still fairly slow considering) but when I starting calculating the Validation Loss and Accuracy, the training slows down dramatically. Maybe 3-4X slower. Is there anything I am doing wrong to cause this? What might I do to improve training speed with validation?

Code below:

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.classifier.parameters(), lr = 0.001)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
torch.cuda.is_available()

# in-training validation
def validation(model, testloader, criterion):
    test_loss = 0
    accuracy = 0
    for inputs, labels in testloader:
        inputs, labels = inputs.to(device), labels.to(device)

        outputs = model.forward(inputs)
        test_loss += criterion(outputs, labels).item()

        ps = torch.exp(outputs)
        equality = (labels.data == ps.max(dim=1)[1])
        accuracy += equality.type(torch.cuda.FloatTensor).mean()
    
    return test_loss, accuracy

model.to(device)

for epoch in range(epochs):
    running_loss = 0
    model.train()

    for ii, (inputs, labels) in enumerate(trainloader):
        steps += 1
        optimizer.zero_grad()
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model.forward(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        
        if steps % print_steps == 0:
            model.eval()
            with torch.no_grad():
                test_loss, accuracy = validation(model, testloader, criterion)
            
            print("Epoch: {}/{}.. ".format(epoch+1, epochs),
                  "Training Loss: {:.3f}.. ".format(running_loss / print_steps),
                  "Test Loss: {:.3f}.. ".format(test_loss / len(testloader)),
                  "Test Accuracy: {:.3f}".format(accuracy / len(testloader)))
            
            running_loss = 0
            model.train()

ptrblck · September 11, 2018, 10:56am

Some minor issues:
You should call the model directly like this model(inputs) instead of model.forward(inputs). However, that shouldn’t be the issue regarding the performance.
It seems your model returns log probabilities. You can directly get the prediction using torch.argmax(outputs, dim=1) instead of using torch.exp.

Shubham_Singh1 · June 13, 2022, 10:09am

i am training deberta-v3-large, for training, average speed is (5it/s) but for validation average speed is (1 it/s) which is 5 times slower than training, i am wondering why it’s happening.
I am using dynamic padding(means for validation set examples has more tokens), could this be the reason, but i am creating 5 folds, so i think dynamic padding couldn’t be the reason.
thanks!

ptrblck · June 13, 2022, 8:08pm

I’m not sure what “5 folds” would mean in this case and how the number of tokens relate to the model (are there more indices used in the inputs?) so could you add more details to your use case?
It might also help to profile the workflow with the PyTorch profiler or e.g. Nsight System.

Shubham_Singh1 · June 13, 2022, 11:44pm

Thanks for your response, by 5 fold means i am splitting my data in 5 equal part and then training model on 4parts and evaluating on remaining 1 part, in this way i am training 5 models so that each of the 5 parts acts as validation set once, by token length i misunderstood model size by token length i.e (deberta-v3-large will take more time for training and evaluation as compared to deberta-large or deberta-v3-base) as model size is less so we can choose slightly bigger batch_size which will speedup training/validation little bit, so i misinterpreted it as lesser the token length faster the speed.
If you want then i can share my code along with the data.
thanks!