Hello everyone !
I am new to PyTorch and I am trying to solve an NLP problem (single sentence binary classification) using PyTorch with a 12 GB GPU. I am trying to do transfer learning (fine-tuning) with the BERT model. My plan is to extract the output from the BERT model for each sentence (these are feature vectors representing the meaning of each sentence) and then train a 3-layer DNN on these features and the targets (a class of 0 or 1).
The problem is : I am quickly running out of GPU when I try to extract features from BERT. (12 GB for doing inference on 71 batches of 32 sentences each !). I suspect that autograd is unnecessarily storing tensors from intermediate layers in the network when it is doing forward propagation for inference. I have already used torch.no_grad(). Here is the code (here, “model” is the BERT model which outputs 768-dimensional vectors for each sentence and each sentence is padded to a length of 128) :
Will using PyTorch DataLoader make any difference ?
BATCH_SIZE = 32 train_text_features =  val_text_features =  print('Extract text features for training data') for i in range(len(train_sequences) - BATCH_SIZE): train_batch = train_sequences[i : i + BATCH_SIZE].cuda() with no_grad(): batch_text_features = model(train_batch) train_text_features.extend(list(batch_text_features)) del batch_text_features gc.collect() print("Batch No. : " + str(i)) os.system('clear') print('Extract text features for validation data') for i in range(len(val_sequences) - BATCH_SIZE): val_batch = val_sequences[i : i + BATCH_SIZE].cuda() with no_grad(): batch_text_features = model(val_batch) val_text_features.extend(list(batch_text_features)) del batch_text_features gc.collect() print("Batch No. : " + str(i)) os.system('clear') train_text_features = LongTensor(np.array(train_text_features)) val_text_features = LongTensor(np.array(val_text_features))