How to improve training performance with Apex package

tjk · September 1, 2024, 5:09am

Hello!

I am using NVIDIA Apex package for speeding up training of my CNN model, I compare the performance between using traditional Adam algorithm and using Apex O1 optimization technique with the following code:

optimizer = optim.Adam(model.parameters(), lr = 1e-3)
model, optimizer = amp.initialize(model, optimizer, opt_level = 'O1')

The training process is speed up 3-4 times visibly compared with traditional Adam algorithm. But when I check the training model, I find it turns out performance of model trained with Apex in testing sets is degraded compared with just using Adam. Are there any solutions? Because I want to speed up training process and obtain good performances on testing datasets.

Thank you!

ptrblck · September 1, 2024, 3:19pm

apex.amp is deprecated and torch.amp utils. should be used instead.

tjk · September 2, 2024, 2:09am

Thanks for notifying that.

tjk · September 4, 2024, 2:48am

Hi, @ptrblck. I have switched to torch.amp utils, but I find out another problem. Also iterating 500 epoches, using torch.utils is more prone to overfitting than the direct Adam-only method, how can I solve this problem?

ptrblck · September 4, 2024, 4:34am

Could you describe what exactly you are using from torch.utils that seems to be responsible for the overfitting you are seeing?

tjk · September 4, 2024, 6:16am

I suspect the autocast function in torch is responsible for overfitiing

with autocast():
            outputs = model(inputs)
            loss = criterion(outputs, labels)

ptrblck · September 4, 2024, 3:18pm

We haven’t seen any overfitting issues using amp and a few examples are given e.g. in this blog post. It would be great if you could provide more information about the expected target accuracy and the the mean+/-stddev of the achieved accuracy when comparing different approaches.