Really cool to see MPS support in PyTorch! I have been testing it and it all works great until it comes trying to fine-tune a BERT model from HF. I have a simple training loop that looks like:
model = BertForSequenceClassification('bert-base-uncased')
model.train()
optim = torch.optim.Adam(model.parameters(), lr=5e-5)
model.to(device)
loop = tqdm(loader, leave=True)
for batch in loop:
batch_mps = {
'input_ids': batch['input_ids'].to(device),
'attention_mask': batch['attention_mask'].to(device),
'labels': batch['labels'].to(device)
}
optim.zero_grad()
outputs = model(**batch_mps)
loss = outputs[0]
loss.backward()
optim.step()
As soon as it hits the loss.backward() step the kernel dies. I tried minimizing the batch_size (although I figure with unified memory it can handle the same batch size as CPU?), but no luck.
It works as expected on CPU, am I missing something?
Thanks! The problem was simply that the batches, despite being incredibly small are still too large for my first gen M1. It begins to manage with a batch_size of 1. Thanks for your help!
But the problem with shared memory is that any CPU allocation might start failing as well. And these we don’t always control and can lead to hard crash.
It seems that you indeed use heap-backed memory, something I thought of myself to allow for zero-cost allocation: pytorch/MPSAllocator.h at 09be44de7b56495bcb5ad1d47376200cbb853097 · pytorch/pytorch · GitHub. Could you go a little more into detail for how you came up with that allocation heuristic? Is there any parallel in the CUDA backend? Did you try an exponentially increasing idea such as the one described in Sharing ideas about our work. · Issue #1 · AnarchoSystems/DeepSwift · GitHub? It seems that you tried to make a heap be the same size as the buffer. That would make sense if you repeatedly recreate tensors of the same size, so you have heaps hanging around from previous allocations to reuse. But how do you know when enough pre-existing heaps is enough and start deleting previously allocated heaps before you run out of memory?
I haven’t tested my theory yet. If I could reuse your algorithm and the time you spent investigating this performance problem, that would be a big help for my personal ML project. I would make a documentation comment giving credit to PyTorch for coming up with the idea first.
Note that this is a little bit copy-pasted right now for the MPS side and it will be refactored for the two to be closer once the MPS version is stable.
I’ve fully translated the PyTorch MPS allocator from Objective-C to Swift, and it’s working quite well for me. I have many other unrelated optimizations, which remove intermediate tensors and reduce the need for zero-cost allocation. That makes the heap allocator quite overkill for my framework. (1) You’re barely even allocating memory and (2) when you are, it’s instantaneous.
I have intensely stress-tested this, and it works wonderfully. Great job! I would suggest a few cleanups, like removing the code for “splits” inside a heap. That code is unutilized and seems copied from the CUDA allocator, which must have manual heap placement. I removed it from my translation. However, it should be useful in my OpenCL backend for S4TF, which requires manual placement of sub-allocations within a larger “heap”. I’ll have to look at your CUDA allocator when I get around to my OpenCL backend.
Another optimization is something that happens when you exceed system memory limits. Unlike PyTorch, I wait to encode/submit operations until the GPU is almost starved of work. That means I could have ~100 operations queued up. If each op allocates a sizable chunk of memory, that could grow astronomical. To solve this problem, I flush the operation queue when memory allocation exceeds system RAM, waiting until more memory can be released. This is one step further than your allocator’s automatic purging of cached BufferBlock’s.
When you exceed system RAM size and flushing the operation queue doesn’t reduce total allocated memory, this switches to a mode labeled permitExceedingSystemRAM. The framework knows that you’re allocating absurd amounts of memory, and stops flushing the operation queue after every tensor materialization. In other words, the optimization permits a reasonable amount of CPU-side encoding performance in such a situation.
This would be a good optimization for PyTorch when someone’s trying to test the limits of memory allocation. You would need a way to halt execution until all submitted GPU commands are completed. Furthermore, I learned something interesting while doing this. On unified memory architectures (M1, maybe Intel), GPU memory pages to the disk when it runs out of space. I tested this by allocating and writing to 64 GB of Metal memory (my Mac has 32 GB RAM), and there were no runtime errors*. If people could utilize this virtual memory paging on M1 GPUs with PyTorch, it would allow 100’s of gigabytes of memory. CUDA cards are limited to 10 GB these days, but my Apple GPU has as much memory as my SSD hard drive!
*When you set the buffer’s storage mode to .storageModePrivate, there are runtime errors. Metal prevents you from exceeding system RAM. This means discrete GPUs can’t page their memory to the disk.
Also, in MPSHeapAllocatorImpl.Malloc, why does it check whether a buffer is < the device’s max buffer size, but not <=? That would mean if the largest possible buffer size was 16 GB, PyTorch users could only allocate 15.9999999 GB. In my prototype backend, someone can allocate all 16 GB if they want.