How to calculate the loss on Resnet

bioinfo-dirty-jobs · April 15, 2020, 2:55pm

Dear all!
I would like to write this code for training a model of Resnet34 (train) from image data. I don’t understands some parts in particular row 6 it is correct to do calculation of the loss in this way? I use batch size of 4 and the same for row 23.
It is correct calculate the ACC after training in this way?
Thanks in advance for any help

ptrblck · April 16, 2020, 2:03am

If you are referring to line17, then yes, this is the standard way to calculate the loss:

loss = criterion(outputs_tr, labels_tr)

The script does neither import not define the metric functions (mcor, acc, precision, recall), so we cannot answer, if the implementation is correct.

bioinfo-dirty-jobs · April 16, 2020, 6:54am

Thanks so much for you kind help! I add that information:

from sklearn.metrics import accuracy_score as acc

from sklearn.metrics import confusion_matrix

from sklearn.metrics import matthews_corrcoef as mcor

from sklearn.metrics import precision_score as precision

from sklearn.metrics import recall_score as recall

from torch.utils.data import DataLoader

from torch.utils.tensorboard import SummaryWriter

from torchvision import transforms

I’m sorry I don’t explain well… my problem are refereed to this lines where I start to check the test :

if j % int(len(loader_train) / 2) == 0 and j != 0:
                model.eval()
                with torch.no_grad():

Here are performed in this way in some script I found after 10 accumulation… what is the right way to do if I have batch of 4 image at time?

ptrblck · April 16, 2020, 8:02am

The training batch size doesn’t matter, since you are using loader_test.
The loss calculation looks wrong:

loss_test_avg = losses_sum / num_samples_test
mean_loss_train = losses_sum / (
    len(loader_train) * loader_train.batch_size
)

I’m not sure which reduction you are using in the criterion, but based on the variable names I would assume reduction='sum'?
On the other hand, since you are dividing by num_samples_test (which is in fact the number of batches), you might be using the default reduction='mean'?

The mean_loss_train calculation doesn’t seem to be correct, since you are dividing the test loss by the number of samples.

bioinfo-dirty-jobs · April 16, 2020, 8:46am

II don’t found reduction variable. So the loss calculation are corrected in this way?

for epoch in range(EPOCHS):
      
        for j, data in enumerate(loader_train):
            global_i += 1

            if j % 10 == 0:
                print(time.time() - start_time)
                start_time = time.time()

            optimizer.zero_grad()

            images_tr = data["data"].to(device)
            labels_tr = torch.LongTensor(data["label"]).to(device)
            outputs_tr = model(images_tr).to(device)

            # backward
            loss = criterion(outputs_tr, labels_tr)
            loss.backward()

            optimizer.step()

            # check test set
            if j % int(len(loader_train) / 2) == 0 and j != 0:
                model.eval()
                with torch.no_grad():

                    losses_sum = 0
                    num_samples_test = 0

                    for data_test in loader_test:

                        images_ts = data_test["data"].to(device)
                        labels_ts = torch.LongTensor(data["label"]).to(device)

                        outputs_ts = model.forward(images_ts)

                        loss_test_sum = criterion(outputs_ts, labels_ts).item()
                        losses_sum += loss_test_sum
                        num_samples_test += 1

                    loss_test_avg = losses_sum / num_samples_test
                    
                    

                    last_loss_test = loss_test_avg
                val_epoch_tr_loss=loss.item()/len(loader_train)
                losses_tr.append(val_epoch_tr_loss)
                losses_ts.append(loss_test_avg)

                del images_ts, labels_ts

            iteration += 1
            del images_tr, labels_tr
            gc.collect()
            model.train()

Is it correct? the loss of training are calculate on the number of data in the training and the loss of test are calculate with the number of element of training
Thanks for the kind help!

bioinfo-dirty-jobs · April 17, 2020, 2:58pm

transfer learning tutorial In the tutorial I found also this way to do loss calculation

running_loss += loss.item() * inputs.size(0)

So I don’t understand which is the way to calculate correctly. inputs.size(0) correspond to the batch that I use?

ptrblck · April 18, 2020, 1:36am

The running_loss calculation multiplies the averaged batch loss (loss) with the current batch size, and divides this sum by the total number of samples.

In your example you are summing the averaged batch losses and divide by the number of batches.
This might create an offset, if your last batch is smaller than the others.

Your code snippet looks alright, if you want to ignore the potential offset in the loss calculation.
However, val_epoch_tr_loss uses the loss of the last batch during training and divides by the number of batches in the training DataLoader, which still seems to be wrong.

bioinfo-dirty-jobs · April 20, 2020, 12:27pm

thanks for your help. So you suggest to do only this calculation:

losses_tr.append(loss.item())
print(
                    "Train_loss:{:.4f} Test_loss {:.4f}".format(
                        train_loss_t, last_loss_test
                    )
                )

ptrblck · April 22, 2020, 7:08am

In the usual use case you would compute the average of both, the training and test loss.
Your suggested approach of using:

    loss = criterion(outputs_ts, labels_ts)
    losses_sum += loss.item()
    num_samples += 1
...
loss_test_avg = losses_sum / num_samples

would work for both losses. Although you might see a slight bias (as explained before), it could be “good enough”.

Alternatively, you could use the AverageMeter from the ImageNet example and update the loss as seen here.