My server crashed after running this code?

Hi guys, I am trying to fine tuning BERT with Pytorch. And I use torch.nn.Parallel to train the model in 8 GPUs. After the evalution I delete the model and using torch.cuda.empty_cache().

The most interesting is that when the script is running, my server is good. But one I click the “Interrupt the kernel” button, my server crashed. Could you tell me why it happened.

The reason why I want to clean gpu cache is that i want my result to be reproducible. In my notebook, I evalute the ‘bert-base-uncased’, ‘bert-base-case’, ‘bert-large-uncased’, ‘bert-large-cased’ by using a for loop. But something weird happened. If I restart the kernel and run the for loop function in this order like ABCD, the results are always the same. But if just fine-tuning B model. The result of this B model is different from the result of B model which trained in ABCD loop. I’m wondering if the model reuse some parameters from the former model. So I want to clean all gpu cache to see it. But the crash happpend!!!

environment:
centos7, pytorch1.1, cuda 9.0, jupyter lab latest

some code snippets:

for name, tokenizer in tokenizers.items():
        #imagine we create datasets here
        
        # initialize model
        num_labels = 2 if task=='A' else 4
        if model_type == 'BERT':
            model = BertForSequenceClassification.from_pretrained(name, num_labels=num_labels)
           
        # Tell pytorch to run this model on the GPU. 
        if torch.cuda.device_count() > 1:
            # data parallelism
            model = torch.nn.DataParallel(model)
            print("Let's use", torch.cuda.device_count(), "GPUs!")
        model.to(device)
        
        #initialize optimizer
        if model_type == 'BERT':
            epochs = CONFIG['bert_epochs']
            optimizer = AdamW(model.parameters(),
                  lr = 2e-5, # args.learning_rate - default is 5e-5, our notebook had 2e-5
                  eps = 1e-8 # args.adam_epsilon  - default is 1e-8.
                )
            # Total number of training steps is number of batches * number of epochs.
            total_steps = len(train_dataloader) * epochs


            # Create the learning rate scheduler.
            scheduler = get_linear_schedule_with_warmup(optimizer, 
                                                        num_warmup_steps = 0, # Default value in run_glue.py
                                                        num_training_steps = total_steps)
            
        
        # training
        
        # Store the average loss after each epoch so we can plot them.
        loss_values = []

        model.zero_grad()

        # For each epoch...
        for epoch_i in range(0, epochs):

            # ========================================
            #               Training
            # ========================================

            # Perform one full pass over the training set.
            if verbose:
                print("")
                print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
                print('Training...')

            # Measure how long the training epoch takes.
            t0 = time()

            # Reset the total loss for this epoch.
            total_loss = 0

            # Set our model to training mode (as opposed to evaluation mode)
            model.train()

            # This training code is based on the `run_glue.py` script here:
            # https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128

            # For each batch of training data...
            for step, batch in enumerate(train_dataloader):

                # Progress update every 40 batches.
                if step % 40 == 0 and not step == 0:
                    # Calculate elapsed time in minutes.
                    elapsed = format_time(time() - t0)

                    # Report progress.
                    if verbose:
                        print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))


                # Unpack this training batch from our dataloader. 
                #
                # As we unpack the batch, we'll also copy each tensor to the GPU using the 
                # `to` method.
                #
                # `batch` contains three pytorch tensors:
                #   [0]: input ids 
                #   [1]: attention masks
                #   [2]: labels 
                b_input_ids = batch[0].to(device)
                b_input_mask = batch[1].to(device)
                b_labels = batch[2].to(device)

                # Forward pass (evaluate the model on this training batch)
                # `model` is of type: pytorch_pretrained_bert.modeling.BertForSequenceClassification
                outputs = model(b_input_ids, 
                            token_type_ids=None, 
                            attention_mask=b_input_mask, 
                            labels=b_labels)

                loss = outputs[0]

                # Accumulate the loss. `loss` is a Tensor containing a single value; 
                # the `.item()` function just returns the Python value from the tensor.
                total_loss += loss.mean().item()
#                 total_loss += loss.item()

                # Perform a backward pass to calculate the gradients.
                loss.mean().backward()
#                 loss.backward()

                # Clip the norm of the gradients to 1.0.
                if model_type == 'BERT':
                    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

                # Update parameters and take a step using the computed gradient
                optimizer.step()

                # Update the learning rate.
                if model_type == 'BERT':
                    scheduler.step()

                # Clear out the gradients (by default they accumulate)
                model.zero_grad()

            # Calculate the average loss over the training data.
            avg_train_loss = total_loss / len(train_dataloader)            

            loss_values.append(avg_train_loss)
            
            if verbose:
                print("")
                print("  Average training loss: {0:.2f}".format(avg_train_loss))
                print("  Training epcoh took: {:}".format(time() - t0))

            # ========================================
            #               Validation
            # ========================================
            # After the completion of each training epoch, measure our performance on
            # our validation set.

                print("")
                print("Running Validation...")

            t0 = time()

            # Put model in evaluation mode to evaluate loss on the validation set
            model.eval()

            # Tracking variables 
            preds, labels = [], []

            # Evaluate data for one epoch
            for batch in validation_dataloader:

                # Add batch to GPU
                batch = tuple(t.to(device) for t in batch)

                # Unpack the inputs from our dataloader
                b_input_ids, b_input_mask, b_labels = batch

                # Telling the model not to compute or store gradients, saving memory and speeding up validation
                with torch.no_grad():        
                    # Forward pass, calculate logit predictions
                    # token_type_ids is for the segment ids, but we only have a single sentence here.
                    # See https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L258 
                    outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)

                logits = outputs[0]

                # Move logits and labels to CPU
                logits = logits.detach().cpu().numpy()
                label_ids = b_labels.to('cpu').numpy()

                preds.append(np.argmax(logits, axis=1))
                labels.append(label_ids)


            # Report the final accuracy for this validation run.
            preds = [item for sublist in preds for item in sublist]
            labels = [item for sublist in labels for item in sublist]
            # print(preds, labels)
            
            if verbose:
                print_score(preds, labels, task)
                print("  Validation took: {:}".format(format_time(time() - t0)))
        
            # Prediction on test set
            print('Predicting labels for {:,} test sentences...'.format(len(prediction_inputs)))

        # Put model in evaluation mode
        model.eval()

        t0 = time()

        # Tracking variables 
        predictions , true_labels = [], []

        # Predict 
        for batch in prediction_dataloader:
            # Add batch to GPU
            batch = tuple(t.to(device) for t in batch)

            # Unpack the inputs from our dataloader
            b_input_ids, b_input_mask, b_labels = batch

            # Telling the model not to compute or store gradients, saving memory and 
            # speeding up prediction
            with torch.no_grad():
              # Forward pass, calculate logit predictions
              outputs = model(b_input_ids, token_type_ids=None, 
                              attention_mask=b_input_mask)

            logits = outputs[0]

            # Move logits and labels to CPU
            logits = logits.detach().cpu().numpy()
            label_ids = b_labels.to('cpu').numpy()

            # Store predictions and true labels
            predictions.append(np.argmax(logits, axis=1))
            true_labels.append(label_ids)

        predictions = [item for sublist in predictions for item in sublist]
        true_labels = [item for sublist in true_labels for item in sublist]
        print("The {} model's result for task {} is:".format(name, task))
        print_score(predictions, true_labels, task)
        if verbose:
            print("  Prediction took: {:}".format(format_time(time() - t0)))
            print('    DONE.')
            
        del model
        torch.cuda.empty_cache()

everything is ok before I add

del model
torch.cuda.emtpy_cache()

Please help me!!!

1 Like

Does your server restart or does your Python kernel just crash?
In the latter case, could you try to get the stack trace via:

 gdb --args python my_script.py
...
Reading symbols from python...done.
(gdb) run
...
(gdb) backtrace
...

I’m using jupyter. After the crash, I cannot ssh it. I have to restart the server by power. It seems like a serious problem.

In my opinion, the crash may be related to three reasons.

  1. interrupt during the training
  2. data parallel
  3. delete model and empty cache

But anyway my goal is not this. I just want to make sure my results reproducible. Could you give me some advices? I will explain the problem briefly.

I want to test 4 BERT models. Let’s call them ABCD. (I will restart kernel after training. )

If I train A only , the result is always the same.
If I train ABCD by a for loop, the result for ABCD is also always the same.
Here the problem comes. If I train B only, this result is different from the result of B trained in a for loop.

In my opinion, the result of training B only is correct since it doesn’t get any effect from other things. I am confused about how A can affect B in the training loop. From the code, you can see that I initialize the model and optimizer in each loop.

Also, I got same problems with accuracy-difference-on-multi-gpu-with-nn-dataparallel

For a same model, the result of training on one GPU is 0.582, and the result of training on 8 GPUs is 0.616. The difference is too big… The results are all reproducible. I don’t know which is “correct”.

Just to make sure, I understand the issue correctly:

  • if you train 4 models sequentially as A, B, C, D, you get exactly the same results for different runs
  • If you train model A only, you will get exactly the same results for different runs
  • if you start with model B, the result will differ from the other runs (e.g. from A, B, C, D)

Is that correct?

If so, the difference might come to another sequence of calls to the pseudo random number generator.
If you are seeding the code, the PRNG will generate the same “random” values for the same sequence of calls with the same arguments.
However, if your model definition of model A differs from model B (e.g .different number of hidden units in a linear layer), will will not get the same results.

Thank your for the reply. Your understanding is correct. Do you mean that it is because of the pseudo randomness?

But I add code below before the fine tuning to confirm the randomness will not affect the result.

# Set the seed value all over the place to make this reproducible.
def setup_seed(seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)  # if you are using multi-GPU.
    np.random.seed(seed)  # Numpy module.
    random.seed(seed)  # Python random module.
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True  

setup_seed(42)

The order of calls still matter as shown here:

torch.manual_seed(2809)
# modelA weight matrix
weightA = torch.randn(2, 2)
# modelB weight matrix
weightB = torch.randn(3, 3)
print(weightA)
print(weightB)

# will yield exactly the same results
torch.manual_seed(2809)
# modelA weight matrix
weightA = torch.randn(2, 2)
# modelB weight matrix
weightB = torch.randn(3, 3)
print(weightA)
print(weightB)

# this will not!
torch.manual_seed(2809)
# modelB weight matrix
weightB = torch.randn(3, 3)
# modelA weight matrix
weightA = torch.randn(2, 2)
print(weightA)
print(weightB)

Have a look at the Wikipedia article on PRNG for more information.

Wow, that’s really surprising!

So the two results are all correct?

Is there any way to keep them same even the order of calls is different?

Yes, the samples numbers are “correct”, but as you can see they are not fixed.

I would recommend to store the randomly initialized state_dicts once and just reload them to the appropriate model during your experiments, to get reproducible results.
This would at least reuse the same parameters.

Note however, that all other calls to random generators might still yield slightly different results, e.g. if you have more dropout layer in modelB than in modelA.

2 Likes

That will be solution. Thank you very much!