Training doesn't converge when running on M1 pro GPU (MPS device)

jenkin · July 30, 2022, 2:36pm

Hi,

I’m trying to train a network model on Macbook M1 pro GPU by using the MPS device, but for some reason the training doesn’t converge, and the final training loss is 10x higher on MPS than when training on CPU.

Does anyone have any idea on what could cause this?

def train():
    device = torch.device('mps')

    epoch_number = 0

    EPOCHS = 5

    best_vloss = 1_000_000.
    model = ESFuturesPricePredictorModel(maps_multiplier=1)
    model.to(device)

    dataset = ESFuturesDataset('data', ['5m'], [130], common_transform=[my_transform, torch_transform])

    # Create data loaders for our datasets; shuffle for training, not for validation
    training_loader = torch.utils.data.DataLoader(dataset, batch_size=64*4, shuffle=True, num_workers=4)
    #validation_loader = torch.utils.data.DataLoader(validation_set, batch_size=4, shuffle=False, num_workers=2)

    loss_fn = torch.nn.MSELoss()

    # Report split sizes
    print('Training set has {} instances'.format(len(dataset)))

    optimizer = torch.optim.AdamW(model.parameters(), lr=0.0001, betas=(0.9, 0.95))

    def train_one_epoch(epoch_index):
        running_loss = 0.
        last_loss = 0.

        for i, data in enumerate(training_loader):
            inputs, labels = data

            for k in inputs.keys():
                inputs[k] = inputs[k].to(device)
            for k in labels.keys():
                labels[k] = labels[k].to(device)

            optimizer.zero_grad()

            outputs = model(inputs)

            loss = loss_fn(outputs, labels['5m'])
            loss.backward()

            # Adjust learning weights
            optimizer.step()

            # Gather data and report
            running_loss += loss.item()
            if i % 100 == 99:
                last_loss = running_loss / 100 # loss per batch
                print('  batch {} loss: {}'.format(i + 1, last_loss))
                running_loss = 0.

        return last_loss

    for epoch in range(EPOCHS):
        print('EPOCH {}:'.format(epoch_number + 1))

        model.train(True)
        avg_loss = train_one_epoch(epoch_number)

        print('LOSS train {}'.format(avg_loss))

        epoch_number += 1

jaxsunlight · September 29, 2022, 6:23pm

I have experienced similar things training with MPS. My networks converge using CPU but not when using the MPS device. This is with multiple different versions, most recently:
pytorch 1.13.0.dev20220929 py3.9_0 pytorch-nightly

utkuaslan85 · October 11, 2022, 8:51am

Same happened to me. On my desktop using Cuda, same training converge but on M1 laptop, it didnt converge. I will try to troubleshoot If I can and share here.

denisv · October 15, 2022, 2:45am

Hi @jenkin. Thank you for reporting this issue.

@jenkin, Jackson, @utkuaslan85 - could you please provide a small sample code that reproduces the issue? For visibility, it would also be good to create an issue on GitHub - pytorch/pytorch: Tensors and Dynamic neural networks in Python with strong GPU acceleration with the sample code and torch version used. Thanks!

dlaplace01 · November 19, 2022, 3:25am

Any solution to this problem yet ? It hasn’t been funny for me for more than 24 hours now. Same model I trained a week ago is no longer converging on M1 GPU but works on CPU.

dlaplace01 · November 21, 2022, 12:25am

Updates : updating to pytorch “1.13.1” solves my problem.

elcolie · April 28, 2023, 11:09am

I got the same problem on M2.

In [2]: torch.__version__
Out[2]: '2.0.0'

In [3]: !python --version
Python 3.8.16

OSX: 13.2

NoobMLDude · June 14, 2023, 12:34pm

Any updates to this issue?
I get the same problem on PyTorch Nightly

torch                 2.1.0.dev20230613
torchaudio            2.1.0.dev20230613
torchvision           0.16.0.dev20230613

and Apple M1 Max

OSX - 13.4 (22F66)

ddq · July 6, 2023, 5:13am

Experiencing same problem. Please have you found a solution?

ddq · July 6, 2023, 5:13am

Experiencing same issue. Please have you found a solution?

Tom_Wilson · March 21, 2024, 9:42pm

Confirmed still an issue with M2 & torch 2.2.1. CPU works MPS does not.

4th05 · March 22, 2024, 6:13pm

I’m facing the same issue here, running torch 2.2.0 on M2.

In the figure, we can observe the loss throughout the first epochs (~350) using the exact same parameters to train the same model. This result is systematic; regardless of the parameters or architecture, performance on the CPU consistently outperforms that on MPS.

Please, if anyone could provide any assistance, I would appreciate.

David_Venus · August 19, 2024, 2:54pm

I had a similar problem, not sure if this is related. My problem was, that copying the tensors from cpu to mps didn’t really do a copy (even it had to, according to the docs). So the tensors got corrupted getting the next batch asynchronously. Forcing a copy with “copy=True” fixed that.

So try

            inputs[k] = inputs[k].to(device, copy=True)

and

            labels[k] = labels[k].to(device, copy=True)

Hope that helps.