Pre-trained model : run out of GPU during inference

Hello everyone !
I am new to PyTorch and I am trying to solve an NLP problem (single sentence binary classification) using PyTorch with a 12 GB GPU. I am trying to do transfer learning (fine-tuning) with the BERT model. My plan is to extract the output from the BERT model for each sentence (these are feature vectors representing the meaning of each sentence) and then train a 3-layer DNN on these features and the targets (a class of 0 or 1).

The problem is : I am quickly running out of GPU when I try to extract features from BERT. (12 GB for doing inference on 71 batches of 32 sentences each !). I suspect that autograd is unnecessarily storing tensors from intermediate layers in the network when it is doing forward propagation for inference. I have already used torch.no_grad(). Here is the code (here, “model” is the BERT model which outputs 768-dimensional vectors for each sentence and each sentence is padded to a length of 128) :

Will using PyTorch DataLoader make any difference ?


train_text_features = []
val_text_features = []

print('Extract text features for training data')
for i in range(len(train_sequences) - BATCH_SIZE):
    train_batch = train_sequences[i : i + BATCH_SIZE].cuda()

    with no_grad():
        batch_text_features = model(train_batch)
        del batch_text_features

    print("Batch No. : " + str(i))

print('Extract text features for validation data')
for i in range(len(val_sequences) - BATCH_SIZE):
    val_batch = val_sequences[i : i + BATCH_SIZE].cuda()

    with no_grad():
        batch_text_features = model(val_batch)
        del batch_text_features

print("Batch No. : " + str(i))
train_text_features = LongTensor(np.array(train_text_features))
val_text_features = LongTensor(np.array(val_text_features))

Why not directly fine-tune the full BERT model with a classification layer on top? That way you wouldn’t have to store the features.

If you go that way, I would suggest to drop the 3-layer DNN and just use a linear layer since BERT has a huge capacity, you likely don’t need 3 layers on top.

And if you go that way… well you can directly use the BertForSequenceClassification model for PyTorch BERT since that’s exactly what is this model :wink:

That’s exactly what I’m using (pytorch-pretrained-bert is just fabulous) ! I just realized that I need to use only the last hidden state of the BertModel for single sequence classification. So, I think can replace:




Also, I am calculating the BERT features first because it was performing badly and I read that it can boost performance. (it’s just a workaround for some autograd problem in pytorch).

If you are using the BertModel class of pytorch-bert, be sure to read in detail the doc on the initialization and outputs which is here. I am not sure you are using it correctly (the output is a tuple and the first element is either a list with all the encoded layers or only the last layer depending on the initialization of the class, read the doc!)

Thanks Thomas ! I checked the docs and directly did the fine-tuning with BertForSequenceClassification. The results are great and I am not running out of GPU. Thanks for looking into my problem.

1 Like