I have been trying to run my transformer codebase on top of a single CPU.
But I hit a wall when the code is trying to run matrix multiplication as part of self-attention :
def forward(self, hidden_states, attention_mask):
mixed_query_layer = self.query(hidden_states)
mixed_key_layer = self.key(hidden_states)
mixed_value_layer = self.value(hidden_states)
query_layer = self.transpose_for_scores(mixed_query_layer)
key_layer = self.transpose_for_scores(mixed_key_layer)
value_layer = self.transpose_for_scores(mixed_value_layer)
print (query_layer.shape, key_layer.transpose(-1,-2).shape )
# torch.Size([1, 8, 10381, 16]) torch.Size([1, 8, 16, 10381])
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
It always ‘bombed’ out on the last operation ( torch.matmul ) with the following error :
RuntimeError: CUDA out of memory. Tried to allocate 8.38 GiB (GPU 0; 15.78 GiB total capacity; 8.75 GiB already allocated; 5.47 GiB free; 8.76 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I’m surprised that matmul operation takes so much memory so such as small matrix multiplication.
Did anyone experience the same thing ? Any workaround, guys ?
Appreciate any help.
-Dony