SentenceBERT cuda out of memory problems

Hello, I have cuda memory problems while trying to fine tune Siamese BERT on quora question dataset. I am using SentenceTransformers library (https://github.com/UKPLab/sentence-transformers). I launched VM on GCP with 4 GPUS (NVIDIA TESLA P100), and I tried to fine tune it with following code:

train_batch_size = 64
model_name = 'bert-base-nli-mean-tokens'
num_epochs = 4
model_save_path = 'output/training_stsbenchmark_continue_training-'+model_name+'-'+datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

warmup_steps = math.ceil(len(train_data)*num_epochs/train_batch_size*0.1) #10% of train data for warm-up
logging.debug("Warmup-steps: {}".format(warmup_steps))


# Training the model
model.fit(train_objectives=[(train_dataloader, train_loss)],
          evaluator=evaluator,
          epochs=num_epochs,
          evaluation_steps=1000,
          warmup_steps=warmup_steps,
          output_path=model_save_path)

after 342 iterations I got following error:

/opt/anaconda3/lib/python3.7/site-packages/transformers/modeling_bert.py in forward(self, hidden_states)
    326     def forward(self, hidden_states):
    327         hidden_states = self.dense(hidden_states)
--> 328         hidden_states = self.intermediate_act_fn(hidden_states)
    329         return hidden_states
    330 

/opt/anaconda3/lib/python3.7/site-packages/transformers/modeling_bert.py in gelu(x)
    131         Also see https://arxiv.org/abs/1606.08415
    132     """
--> 133     return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))
    134 
    135 

RuntimeError: CUDA out of memory. Tried to allocate 98.00 MiB (GPU 0; 15.90 GiB total capacity; 13.44 GiB already allocated; 13.88 MiB free; 1.80 GiB cached)

I tried using more GPUs but it always failed, and I started to wonder if maybe there is a problem with not optimal memory allocation in the SentenceTransformers library. On the other hand, BERT model is pretty huge, so considering its siamese model it can be just too big, and I might need to use even more GPUs or distributed training. I’d be really grateful for every suggestion :slight_smile:

I’m not familiar with the mentioned repository, but by just skimming through the code it seems multiple GPUs won’t be used?
The fit() function points to this line of code, which will only use the default device.

You could try to lower the batch size and see, if the model still converges as you wish.
Also, you could try to use torch.utils.checkpoint to trade compute for memory, but I’m not sure how easy it would be to implement it in this particular repository, as some higher level abstractions seem to be used.

Thanks, I’m not really familiar with the topic of using multiple GPUs, I thought that this syntax by default splits onto every available GPU. So if I’d like to use all, I should be moving it to every device explicitly, or use model = nn.DataParallel(model)? In case of repository, I think I will have to fork it and change just my nessecary functions/classes, and maybe later on work on it in depth

I tried using nn.DataParallel, but I ran into the same problem. The diagnostics of GPUs looks like this:


Sun Jan 26 12:16:30 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   49C    P0    36W / 250W |  16233MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  Off  | 00000000:00:05.0 Off |                    0 |
| N/A   40C    P0    28W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-PCIE...  Off  | 00000000:00:06.0 Off |                    0 |
| N/A   40C    P0    28W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-PCIE...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   42C    P0    25W / 250W |     10MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      7706      C   /opt/anaconda3/bin/python                  16223MiB |
+-----------------------------------------------------------------------------+

after I used such syntax in fit method of the SentenceTransformer class:

        if device is None:
            device = "cuda" if torch.cuda.is_available() else "cpu"
            logging.info("Use pytorch device: {}".format(device))
        self.device = torch.device(device)
        self = nn.DataParallel(self)
        self.to(device)

It looks like everything is still passed only to first GPU.

EDIT: I added nn.DataParallel in appropriate block of code (for every loss_model in loss_models, with appropriate changes in them so that they work) and it works, models are being split onto every GPU :slight_smile:

Hi, Ra-V. Can you show more detail about the loss_model changes?

Sure, I think everything is in this commit in data_parallel branch of my fork of original repository:

Essentially, I changed these lines of code:

92        self = nn.DataParallel(self)
93        self.to(device)
...
294      device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
295      for loss_model in loss_models:
296      loss_model.to(device)

to:

300  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
301  loss_models = list(map(lambda x: nn.DataParallel(x), loss_models))
302  for loss_model in loss_models:
303      loss_model.to(device)

I also changed the CosineSimilarityLoss, but it was mainly because of different training objective I wanted to use, so it’s not nessecary for multiple GPUs. Let me know if you need help, it’s a bit messy but it works :slight_smile:

1 Like

Hi, Ra-V. I have read your committed. Only see the change in:

self.device = torch.device(device)
self = nn.DataParallel(self)
self.to(device)

I don’t see the change of device = torch.device(“cuda” if torch.cuda.is_available() else “cpu”).
Should I add these by myself?

Thank you.

Hello, I’m not sure if you’ve read the same commit, there is a line I pasted above, here is a screenshot of it:

there is a line 300 with code for setting up device

 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

If this is what you’re asking about :slight_smile:

Hello, could you link to/paste whole code with your ‘training_nli_bert.py’ script? There is a problem with shapes of tensors you pass (outputs of your model have different dimensions than your targets)

Thanks, sir. Now the problem have been solved.

Great, good luck on the project!

excuse me what if I have a single GPU . is there any solution for it ?