Training stopped at the end of the training phase of the first iteration

f10w · December 21, 2018, 9:41am

Hello everybody,

Over the last days I have encountered a very strange problem: my training stopped at the end of the training phase of the first epoch (it did not perform the validation step), without any errors.

The main part of my training code is shown below. I have tested multiple times, the code works well on a subset of my dataset. However, whenever I run it on the full dataset (that has over 700000 images with train/val ratio is 9/1), it stopped at the first training phase and never entered the validation phase.

I remember having a similar issue some time ago: training did not enter the validation phase and produced a segmentation fault error. However, this time, there was no errors (the last lines of the terminal output are shown after the code).

Could you please help to to find out what happened?

Thank you so much in advance for any suggestions!

for phase in ['train', 'val']:
    print('Entering in phase:', phase)
    if phase == 'train':
        scheduler.step(best_acc)
        model.train()
    else:
        model.eval()

    running_loss = 0.0
    running_corrects = 0

    # Iterate over data.
    print('Iterating over data:')
    n_samples = 0
    for batch_idx, (inputs, labels) in enumerate(dataloaders[phase]):
        inputs = inputs.to(device)
        labels = labels.to(device)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
        # track history if only in train
        with torch.set_grad_enabled(phase == 'train'):
            outputs = model(inputs)
            _, preds = torch.max(outputs, 1)
            loss = criterion(outputs, labels)

            # backward + optimize only if in training phase
            if phase == 'train':
                loss.backward()
                optimizer.step()

        # statistics
        running_loss += loss.item() * inputs.size(0)
        running_corrects += torch.sum(preds == labels.data)
        n_samples += len(labels)

        if phase == 'train':
            print('{}/{} Avg Loss: {:.4f} Avg Acc: {:.4f}'.format(n_samples, dataset_sizes[phase],
                                    running_loss/n_samples, running_corrects.double()/n_samples))
    print('Done iterating over data')
    epoch_loss = running_loss / dataset_sizes[phase]
    epoch_acc = running_corrects.double() / dataset_sizes[phase]
    print('\t{} Loss: {:.4f}\t{} Acc: {:.4f}'.format(phase, epoch_loss, phase, epoch_acc))

# Here we are at the end of the 'val' phase
# check for improvement over the last epochs
if epoch_acc > best_acc:
    print('Improved.')
    best_acc = epoch_acc
    try:
        state_dict = model.module.state_dict()
    except AttributeError:
        state_dict = model.state_dict()
    torch.save(state_dict, save_path)

632736/633656 Avg Loss: 0.0436 Avg Acc: 0.9868
632944/633656 Avg Loss: 0.0436 Avg Acc: 0.9868
633152/633656 Avg Loss: 0.0435 Avg Acc: 0.9868
633360/633656 Avg Loss: 0.0435 Avg Acc: 0.9868
633568/633656 Avg Loss: 0.0435 Avg Acc: 0.9868
633656/633656 Avg Loss: 0.0435 Avg Acc: 0.9868

f10w · December 21, 2018, 5:48pm

I tried with half the dataset and it works, but with the full dataset it did not work

ptrblck · December 22, 2018, 8:32am

That’s really strange. How long did you wait?
Could it be, that your memory is full and the swap is used for the validation step (which would take some time)?
What stack trace do you get if you kill the process with CTRL+C?
Are you using multiple workers? If so, does your code run using num_workers=0?

f10w · December 22, 2018, 3:29pm

@ptrblck Thank you for your reply. Yes I’m using multiple workers and I set the number of workers to be equal to the number of CPU processors (= 8). And I forgot to mention: training was run on multiple GPUs (4).
It was stuck at the end of training for at least two hours before I shut it down (and this is reproducible, I ran it at least three times). Unfortunately I do not remember what stack trace I got, but I think it was related to dataloaders.
Thanks again!

ptrblck · December 23, 2018, 1:40pm

Thanks for the info! Is you code working with num_workers=0?
If so, this might narrow down the possible source of your issue.

f10w · December 24, 2018, 4:24pm

I have tested today with num_workers=0 and it works! It finished the first epoch (training + validation) and now it’s running the second epoch.

D you have any suggestions to fix the problem with multiple workers? Because otherwise training is much slower (6 hours per epoch compared to 3 hours with num_workers=8)

Thank you for your help!

ptrblck · December 25, 2018, 12:37pm

Which PyTorch version are you using? We had some deadlock issues in older versions, which should be fixed by now in the latest release.

f10w · December 25, 2018, 1:46pm

I’m using PyTorch 0.4.1. Since PyTorch 1.0 is a big update, I was afraid of upgrading because of potential incompatibility. However, after a quick search it seems that 0.4.1 code can run seamlessly in 1.0 so I think I will upgrade. Thank a lot!

ptrblck · December 25, 2018, 2:23pm

The change should be pretty straightforward.
However, if you encounter any issues, please let us know here and I’m sure we can fix it.

f10w · December 25, 2018, 11:00pm

Thanks a lot! I will get back soon.

chrisliu54 · January 4, 2019, 3:54am

Found similar issues.
My program hung unexpectedly at some random point after several epochs of training. But I also set num_workers>0 and I haven’t test using num_workers=0.

f10w · January 16, 2019, 9:09am

@ptrblck I confirm that upgrading to PyTorch 1.0 solves this issue. Thank you very much for your help!
Surprisingly I didn’t have to change anything in my code (0.4.1), it just works!

@chrisliu54 Upgrade to PyTorch 1.0!

hsah · May 4, 2019, 10:22pm

Hi, I am facing a similar issue where the training stopped at model.eval() after executing the training phase with no errors. I have not specified num_workers, so the default value is 0.

I am using torch==1.1.0 . Request any suggestions. Thanks.

f10w · May 6, 2019, 8:28am

Have you tried with version 1.0?

hsah · May 9, 2019, 3:30pm

Sorry for the late response. There was no issue with torch as such but with pandas version. I confirm that this works fine with torch==1.1.0 too. Thanks.

Tahlor · June 29, 2019, 7:07am

I have experienced this issue intermittently using in version 1.1.0. Setting num_workers=0 seems to solved it for me, though it’s obviously not ideal.