I have an NLP model that trains fine in the following contexts:
- Windows11 + CPU
- Windows11 + CUDA
- Ubuntu20.04 + CPU
- Ubuntu20.04 + CUDA
- macOS12.5 + CPU
However, my attempts to run the same model using “mps” as the device are resulting in unexpected behavior: the nn.Embedding layers in my model are being initialized but then the weights quickly train to Nan values. There is no specific error that occurs - the loss just never improves. Since the model trains fine if I simply change my device to “cpu”, I believe this is likely an issue with my virtual environment setup. Does anyone have insight on what I am doing wrong?
Environment Setup Information
To setup a M1 compatible environment I used the following commands:
CONDA_SUBDIR=osx-arm64 conda create -n test_environment python=3.9 -c conda-forge
conda env config vars set CONDA_SUBDIR=osx-arm64
pip3 install torch torchvision torchaudio
Output from Simple Sanity Checks
These sanity checks seems to suggest the environment should work:
import torch
import platform
print(f'Platform: {platform.platform()}')
print(f'torch.has_mps: {torch.has_mps}')
print(f'MPS is available: {torch.backends.mps.is_available()}')
print(f'Pytorch was built with MPS: {torch.backends.mps.is_built()}')
Platform: macOS-12.5-arm64-arm-64bit
torch.has_mps: True
MPS is available: True
Pytorch was built with MPS: True