[MPS Crash] Assertion failed in _getLSTMGradKernelDAGObject using LSTM on macOS with MPS backend

Running an Bidirectional LSTM model on macOS with MPS backend crashes with the following error during training/backpropagation:
Assertion failed: (shape4.size() >= 3), function _getLSTMGradKernelDAGObject, file GPURNNOps.mm, line 2417.
zsh: abort python3 translation_model.py
resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown

1 Like

Hello Jack,

PyTorch version: Version: 2.7.1

Mac OS version: 15.5 (24F74)

I am feeding the input as dimension (Sequence ,Batch, length) example: (N,1,27), I am not using the batch first
Model definition:
self.input_size=input_size
self.hidden_size=hidden_size
self.output_size=output_size
self.encoder=nn.LSTM(self.input_size,self.hidden_size,2,bidirectional=True)
self.u=nn.Linear(self.hidden_size2,self.hidden_size)
self.w1=nn.Linear(self.hidden_size,self.hidden_size)
self.w2=nn.Linear(self.hidden_size,self.hidden_size)
self.w3=nn.Linear(self.hidden_size,self.hidden_size)
self.w4=nn.Linear(self.hidden_size,self.hidden_size)
self.w=nn.Linear(self.hidden_size,self.hidden_size)
self.atten=nn.Linear(self.hidden_size,1)
self.output2catin=nn.Linear(self.output_size,self.hidden_size
2)
self.decoder=nn.LSTM(self.hidden_size*4,self.hidden_size,4)
self.hidden_output=nn.Linear(self.hidden_size,self.output_size)
self.softmax=nn.LogSoftmax(dim=2)

yes the model works fine on CPU, even in GPU we can inference model, only the training procedure is killed.

1 Like

Hello Jack,

I am mean model fails while training with the GPU, do you have idea why its failing, if so kindly help with the root cause?

Thanks,
Harish.

Hi all,

Getting this exact same error when training a RecurrentPPO model on mps, which also uses a LSTM PyTorch model underneath. Running on the M3 Ultra.

Just updated to MacOS 15.6 and still the same error. Running Torch version 2.7.1

Could you create an issue on GitHub describing this use case in case the most recent nightly binary also fails?

@HarishAi , will you raise that issue?

Having exact same problem with the latest nightly build (2.9.0.dev20250801).

Did you already raise that issue @HarishAi ?

@Vinnie , Yes, I have raised it.

I am seeing no activity on the GitHub issue. How are they normally being picked up?