StyleGAN2 Training Illegal Instruction on cache clear instruction

Hi all,

I am training a StyleGAN2 and am facing a multitude of crashes recently. I provide the python stack trace (I have anonymised the stack traces if the paths look slightly weird) through running the train.py with

python3 -X faulthandler -u StyleGAN2.py

One error is due to illegal instruction:

 38%|███▊ | 591068/1562500 [29:32:52<107:33:34, 2.51it/s]Fatal Python error: Illegal instruction

Thread 0x00000001 (most recent call first): Thread 0x00000002 (most recent call first): File "/path/to/python/lib/python3.11/threading.py", line 331 in wait File "/path/to/python/lib/python3.11/threading.py", line 629 in wait File "/path/to/python/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run File "/path/to/python/lib/python3.11/threading.py", line 1045 in bootstrapinner File "/path/to/python/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00000003 (most recent call first): File "/path/to/python/lib/python3.11/threading.py", line 327 in wait File "/path/to/python/lib/python3.11/multiprocessing/queues.py", line 231 in _feed File "/path/to/python/lib/python3.11/threading.py", line 982 in run File "/path/to/python/lib/python3.11/threading.py", line 1045 in bootstrapinner File "/path/to/python/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00000004 (most recent call first): File "/path/to/python/lib/python3.11/selectors.py", line 415 in select File "/path/to/python/lib/python3.11/multiprocessing/connection.py", line 947 in wait File "/path/to/python/lib/python3.11/multiprocessing/connection.py", line 440 in _poll File "/path/to/python/lib/python3.11/multiprocessing/connection.py", line 257 in poll File "/path/to/python/lib/python3.11/multiprocessing/queues.py", line 113 in get File "/path/to/python/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 32 in do_one_step File "/path/to/python/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 55 in pinmemory_loop File "/path/to/python/lib/python3.11/threading.py", line 982 in run File "/path/to/python/lib/python3.11/threading.py", line 1045 in bootstrapinner File "/path/to/python/lib/python3.11/threading.py", line 1002 in _bootstrap

Current thread 0x00000005 (most recent call first): File "/path/to/python/lib/python3.11/site-packages/torch/cuda/memory.py", line 170 in empty_cache File "/path/to/project/StyleGAN2/StyleGAN2.py", line 1129 in <module>

With line 1129 being: torch.cuda.empty_cache()

Another error is a segfault

FID score for iteration 20000: 64.7917022705078175 [02:54<00:00, 25.46it/s]
  1%|▏         | 20668/1562500 [3:51:10<168:03:37,  2.55it/s]Fatal Python error: Segmentation fault

Current thread 0x00000001 (most recent call first):
  <no Python frame>

Thread 0x00000002 (most recent call first):
  File "/path/to/python/lib/python3.11/threading.py", line 331 in wait
  File "/path/to/python/lib/python3.11/threading.py", line 629 in wait
  File "/path/to/python/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run
  File "/path/to/python/lib/python3.11/threading.py", line 1045 in bootstrapinner
  File "/path/to/python/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00000003 (most recent call first):
  File "/path/to/python/lib/python3.11/threading.py", line 327 in wait
  File "/path/to/python/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
  File "/path/to/python/lib/python3.11/threading.py", line 982 in run
  File "/path/to/python/lib/python3.11/threading.py", line 1045 in bootstrapinner
  File "/path/to/python/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00000004 (most recent call first):
  File "/path/to/python/lib/python3.11/threading.py", line 327 in wait
  File "/path/to/python/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
  File "/path/to/python/lib/python3.11/threading.py", line 982 in run
  File "/path/to/python/lib/python3.11/threading.py", line 1045 in bootstrapinner
  File "/path/to/python/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00000005 (most recent call first):
  File "/path/to/python/lib/python3.11/selectors.py", line 415 in select
  File "/path/to/python/lib/python3.11/multiprocessing/connection.py", line 947 in wait
  File "/path/to/python/lib/python3.11/multiprocessing/connection.py", line 440 in _poll
  File "/path/to/python/lib/python3.11/multiprocessing/connection.py", line 257 in poll
  File "/path/to/python/lib/python3.11/multiprocessing/queues.py", line 113 in get
  File "/path/to/python/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 32 in do_one_step
  File "/path/to/python/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 55 in pinmemory_loop
  File "/path/to/python/lib/python3.11/threading.py", line 982 in run
  File "/path/to/python/lib/python3.11/threading.py", line 1045 in bootstrapinner
  File "/path/to/python/lib/python3.11/threading.py", line 1002 in _bootstrap

Thread 0x00000006 (most recent call first):
  File "/path/to/python/lib/python3.11/site-packages/torch/autograd/graph.py", line 768 in enginerun_backward
  File "/path/to/python/lib/python3.11/site-packages/torch/autograd/init.py", line 289 in backward
  File "/path/to/python/lib/python3.11/site-packages/torch/_tensor.py", line 521 in backward
  File "/path/to/project/StyleGAN2/StyleGAN2.py", line 1093 in module

where line 1093 being: d_optimizer.step()

These errors are seemingly random, cropping up throughout training at different points.

Thanks for reading this post, I’d be glad to take any suggestions on how to fix :slight_smile: