Distributed data parallel freezes without error message

I found the answer!

modify /etc/default/grub

#GRUB_CMDLINE_LINUX=""                           <—— Original commented
GRUB_CMDLINE_LINUX="iommu=soft"           <——— Change

ref : https://github.com/pytorch/pytorch/issues/1637#issuecomment-338268158

1 Like