RuntimeError: Expected a 'mps:0' generator device but found 'cpu'

Hi all,

I am new to LLM programming in Python and I am trying to fine-tune the instructlab/merlinite-7b-lab] model (see HiggingFaces) on my Mac M1. My goal is to teach this model to a new music composer Xenobi Amilen I have invented.

Using the new Ilab CLI from RedHat I created this training set for the model. It is a JSONL file with 100 questions/answers about the invented composer.

I wrote this Python script to train the model. I tested all the parts related to the tokenizer, datasets and it seems to work. However, the final train got this error:

Traceback (most recent call last):
  File "/Users/sasadangelo/github.com/sasadangelo/llm-train/main.py", line 99, in <module>
    trainer.train()
  File "/Users/sasadangelo/github.com/sasadangelo/llm-train/venv/lib/python3.12/site-packages/transformers/trainer.py", line 1932, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/sasadangelo/github.com/sasadangelo/llm-train/venv/lib/python3.12/site-packages/transformers/trainer.py", line 2230, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/Users/sasadangelo/github.com/sasadangelo/llm-train/venv/lib/python3.12/site-packages/accelerate/data_loader.py", line 454, in __iter__
    current_batch = next(dataloader_iter)
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sasadangelo/github.com/sasadangelo/llm-train/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 701, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/Users/sasadangelo/github.com/sasadangelo/llm-train/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 756, in _next_data
    index = self._next_index()  # may raise StopIteration
            ^^^^^^^^^^^^^^^^^^
  File "/Users/sasadangelo/github.com/sasadangelo/llm-train/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 691, in _next_index
    return next(self._sampler_iter)  # may raise StopIteration
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sasadangelo/github.com/sasadangelo/llm-train/venv/lib/python3.12/site-packages/torch/utils/data/sampler.py", line 347, in __iter__
    for idx in self.sampler:
  File "/Users/sasadangelo/github.com/sasadangelo/llm-train/venv/lib/python3.12/site-packages/accelerate/data_loader.py", line 92, in __iter__
    yield from super().__iter__()
  File "/Users/sasadangelo/github.com/sasadangelo/llm-train/venv/lib/python3.12/site-packages/torch/utils/data/sampler.py", line 197, in __iter__
    yield from torch.randperm(n, generator=generator).tolist()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sasadangelo/github.com/sasadangelo/llm-train/venv/lib/python3.12/site-packages/torch/utils/_device.py", line 79, in __torch_function__
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected a 'mps:0' generator device but found 'cpu'

However, even forcing the PyTorch code in the file torch/utils/data/sampler.py to use a Generator on the mps device (I changed the pytorch code locally), then I got the problem:

RuntimeError: Placeholder storage has not been allocated on MPS device!
  0%|          | 0/75 [00:00<?, ?it/s]                                                                                                                                        

I found a lot of articles about this error on Google and also StackOverflow. This last problem seems related to sending model and input data to mps. I am sure both the model and input data are on mps, I tested it.

I don’t know how to fix these issues. I tried with Pytorch last stable release and also today nightly build.

Can anyone help?

Hi all again,
Can anyone help on this problem?