Hi all,
I am training a StyleGAN2 and am facing a multitude of crashes recently. I provide the python stack trace (I have anonymised the stack traces if the paths look slightly weird) through running the train.py with
python3 -X faulthandler -u StyleGAN2.py
One error is due to illegal instruction:
38%|███▊ | 591068/1562500 [29:32:52<107:33:34, 2.51it/s]Fatal Python error: Illegal instruction
Thread 0x00000001 (most recent call first): Thread 0x00000002 (most recent call first): File "/path/to/python/lib/python3.11/threading.py", line 331 in wait File "/path/to/python/lib/python3.11/threading.py", line 629 in wait File "/path/to/python/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run File "/path/to/python/lib/python3.11/threading.py", line 1045 in bootstrapinner File "/path/to/python/lib/python3.11/threading.py", line 1002 in _bootstrap
Thread 0x00000003 (most recent call first): File "/path/to/python/lib/python3.11/threading.py", line 327 in wait File "/path/to/python/lib/python3.11/multiprocessing/queues.py", line 231 in _feed File "/path/to/python/lib/python3.11/threading.py", line 982 in run File "/path/to/python/lib/python3.11/threading.py", line 1045 in bootstrapinner File "/path/to/python/lib/python3.11/threading.py", line 1002 in _bootstrap
Thread 0x00000004 (most recent call first): File "/path/to/python/lib/python3.11/selectors.py", line 415 in select File "/path/to/python/lib/python3.11/multiprocessing/connection.py", line 947 in wait File "/path/to/python/lib/python3.11/multiprocessing/connection.py", line 440 in _poll File "/path/to/python/lib/python3.11/multiprocessing/connection.py", line 257 in poll File "/path/to/python/lib/python3.11/multiprocessing/queues.py", line 113 in get File "/path/to/python/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 32 in do_one_step File "/path/to/python/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 55 in pinmemory_loop File "/path/to/python/lib/python3.11/threading.py", line 982 in run File "/path/to/python/lib/python3.11/threading.py", line 1045 in bootstrapinner File "/path/to/python/lib/python3.11/threading.py", line 1002 in _bootstrap
Current thread 0x00000005 (most recent call first): File "/path/to/python/lib/python3.11/site-packages/torch/cuda/memory.py", line 170 in empty_cache File "/path/to/project/StyleGAN2/StyleGAN2.py", line 1129 in <module>
With line 1129 being: torch.cuda.empty_cache()
Another error is a segfault
FID score for iteration 20000: 64.7917022705078175 [02:54<00:00, 25.46it/s]
1%|▏ | 20668/1562500 [3:51:10<168:03:37, 2.55it/s]Fatal Python error: Segmentation fault
Current thread 0x00000001 (most recent call first):
<no Python frame>
Thread 0x00000002 (most recent call first):
File "/path/to/python/lib/python3.11/threading.py", line 331 in wait
File "/path/to/python/lib/python3.11/threading.py", line 629 in wait
File "/path/to/python/lib/python3.11/site-packages/tqdm/_monitor.py", line 60 in run
File "/path/to/python/lib/python3.11/threading.py", line 1045 in bootstrapinner
File "/path/to/python/lib/python3.11/threading.py", line 1002 in _bootstrap
Thread 0x00000003 (most recent call first):
File "/path/to/python/lib/python3.11/threading.py", line 327 in wait
File "/path/to/python/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
File "/path/to/python/lib/python3.11/threading.py", line 982 in run
File "/path/to/python/lib/python3.11/threading.py", line 1045 in bootstrapinner
File "/path/to/python/lib/python3.11/threading.py", line 1002 in _bootstrap
Thread 0x00000004 (most recent call first):
File "/path/to/python/lib/python3.11/threading.py", line 327 in wait
File "/path/to/python/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
File "/path/to/python/lib/python3.11/threading.py", line 982 in run
File "/path/to/python/lib/python3.11/threading.py", line 1045 in bootstrapinner
File "/path/to/python/lib/python3.11/threading.py", line 1002 in _bootstrap
Thread 0x00000005 (most recent call first):
File "/path/to/python/lib/python3.11/selectors.py", line 415 in select
File "/path/to/python/lib/python3.11/multiprocessing/connection.py", line 947 in wait
File "/path/to/python/lib/python3.11/multiprocessing/connection.py", line 440 in _poll
File "/path/to/python/lib/python3.11/multiprocessing/connection.py", line 257 in poll
File "/path/to/python/lib/python3.11/multiprocessing/queues.py", line 113 in get
File "/path/to/python/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 32 in do_one_step
File "/path/to/python/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 55 in pinmemory_loop
File "/path/to/python/lib/python3.11/threading.py", line 982 in run
File "/path/to/python/lib/python3.11/threading.py", line 1045 in bootstrapinner
File "/path/to/python/lib/python3.11/threading.py", line 1002 in _bootstrap
Thread 0x00000006 (most recent call first):
File "/path/to/python/lib/python3.11/site-packages/torch/autograd/graph.py", line 768 in enginerun_backward
File "/path/to/python/lib/python3.11/site-packages/torch/autograd/init.py", line 289 in backward
File "/path/to/python/lib/python3.11/site-packages/torch/_tensor.py", line 521 in backward
File "/path/to/project/StyleGAN2/StyleGAN2.py", line 1093 in module
where line 1093 being: d_optimizer.step()
These errors are seemingly random, cropping up throughout training at different points.
Thanks for reading this post, I’d be glad to take any suggestions on how to fix