Pytorch multi-processes error on SageMaker GPU instance

Im trying to train forecasting model using Pytorch-forecasting on GPU instance (ml.p3.16xlarge), it only works when I specify using one GPU, when I configure more than one GPU it returns error:

ProcessExitedException: process 1 terminated with signal SIGSEGV

env:

Python: 3.10.6 Pytorch: 1.13.1+cu116 pytorch-lightning : 1.9.0 pytorch-forecasting: 0.10.3

code:
` trainer = pl.Trainer(

max_epochs=5,
accelerator='gpu', 
devices=2,
enable_model_summary=True,
gradient_clip_val=0.1,
limit_train_batches=30, 
callbacks=[lr_logger, early_stop_callback],
logger=logger,)

`

what’s missing here?

Could you post the backtrace via:

gdb --args python script.py args
...
run
...
bt

Thank you @ptrblck ! below is the backtrace:

sh-4.2$ gdb --args /home/ec2-user/anaconda3/envs/python3/bin/python3 gpu-test.py 
GNU gdb (GDB) Red Hat Enterprise Linux 8.0.1-36.amzn2.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/ec2-user/anaconda3/envs/python3/bin/python3...done.
(gdb) run
Starting program: /home/ec2-user/anaconda3/envs/python3/bin/python3 gpu-test.py
Missing separate debuginfos, use: debuginfo-install glibc-2.26-62.amzn2.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7fffed823700 (LWP 94784)]
[New Thread 0x7fffece22700 (LWP 94785)]
[New Thread 0x7fffe4421700 (LWP 94786)]
[New Thread 0x7fffdba20700 (LWP 94787)]
[New Thread 0x7fffcb01f700 (LWP 94788)]
[New Thread 0x7fffca61e700 (LWP 94789)]
[New Thread 0x7fffb9c1d700 (LWP 94790)]
[New Thread 0x7fffb921c700 (LWP 94791)]
[New Thread 0x7fffb081b700 (LWP 94792)]
[New Thread 0x7fffa7e1a700 (LWP 94793)]
[New Thread 0x7fff9f419700 (LWP 94794)]
[New Thread 0x7fff8ea18700 (LWP 94795)]
[New Thread 0x7fff8e017700 (LWP 94796)]
[New Thread 0x7fff7d616700 (LWP 94797)]
[New Thread 0x7fff74c15700 (LWP 94798)]
[New Thread 0x7fff6c214700 (LWP 94799)]
[New Thread 0x7fff6b813700 (LWP 94800)]
[New Thread 0x7fff62e12700 (LWP 94801)]
[New Thread 0x7fff5a411700 (LWP 94802)]
[New Thread 0x7fff51a10700 (LWP 94803)]
[New Thread 0x7fff4100f700 (LWP 94804)]
[New Thread 0x7fff3860e700 (LWP 94805)]
[New Thread 0x7fff2fc0d700 (LWP 94806)]
[New Thread 0x7fff2f20c700 (LWP 94807)]
[New Thread 0x7fff2680b700 (LWP 94808)]
[New Thread 0x7fff15e0a700 (LWP 94809)]
[New Thread 0x7fff0d409700 (LWP 94810)]
[New Thread 0x7fff04a08700 (LWP 94811)]
[New Thread 0x7ffefc007700 (LWP 94812)]
[New Thread 0x7ffef3606700 (LWP 94813)]
[New Thread 0x7ffeeac05700 (LWP 94814)]
[New Thread 0x7ffee2204700 (LWP 94815)]
[New Thread 0x7ffed9803700 (LWP 94816)]
[New Thread 0x7ffed0e02700 (LWP 94817)]
[New Thread 0x7ffec8401700 (LWP 94818)]
[New Thread 0x7ffebfa00700 (LWP 94819)]
[New Thread 0x7ffebefff700 (LWP 94820)]
[New Thread 0x7ffeae5fe700 (LWP 94821)]
[New Thread 0x7ffea5bfd700 (LWP 94822)]
[New Thread 0x7ffe9d1fc700 (LWP 94823)]
[New Thread 0x7ffe9c7fb700 (LWP 94824)]
[New Thread 0x7ffe8bdfa700 (LWP 94825)]
[New Thread 0x7ffe8b3f9700 (LWP 94826)]
[New Thread 0x7ffe829f8700 (LWP 94827)]
[New Thread 0x7ffe79ff7700 (LWP 94828)]
[New Thread 0x7ffe695f6700 (LWP 94829)]
[New Thread 0x7ffe60bf5700 (LWP 94830)]
[New Thread 0x7ffe581f4700 (LWP 94831)]
[New Thread 0x7ffe4f7f3700 (LWP 94832)]
[New Thread 0x7ffe46df2700 (LWP 94833)]
[New Thread 0x7ffe463f1700 (LWP 94834)]
[New Thread 0x7ffe3d9f0700 (LWP 94835)]
[New Thread 0x7ffe2cfef700 (LWP 94836)]
[New Thread 0x7ffe2c5ee700 (LWP 94837)]
[New Thread 0x7ffe1bbed700 (LWP 94838)]
[New Thread 0x7ffe131ec700 (LWP 94839)]
[New Thread 0x7ffe0a7eb700 (LWP 94840)]
[New Thread 0x7ffe01dea700 (LWP 94841)]
[New Thread 0x7ffdf93e9700 (LWP 94842)]
[New Thread 0x7ffdf89e8700 (LWP 94843)]
[New Thread 0x7ffde7fe7700 (LWP 94844)]
[New Thread 0x7ffddf5e6700 (LWP 94845)]
[New Thread 0x7ffddebe5700 (LWP 94846)]
[New Thread 0x7ffdc89ff700 (LWP 94862)]
[New Thread 0x7ffdc726b700 (LWP 94872)]
[New Thread 0x7ffdc686a700 (LWP 94873)]
[New Thread 0x7ffdc5e69700 (LWP 94874)]
[New Thread 0x7ffdc5468700 (LWP 94875)]
[New Thread 0x7ffdc4a67700 (LWP 94876)]
[New Thread 0x7ffdbffff700 (LWP 94877)]
[New Thread 0x7ffdbf5fe700 (LWP 94878)]
[New Thread 0x7ffdbebfd700 (LWP 94879)]
Missing separate debuginfo for /home/ec2-user/anaconda3/envs/python3/lib/python3.10/site-packages/torch/lib/libgomp-a34b3233.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/5f/4fb88af97be3ecacc71363136bb015b2a07119.debug
Missing separate debuginfo for /lib64/libcuda.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/d7/a1b5fea0f4f1f6375e4ed8a59d4afbd04ef289.debug

[New Thread 0x7ffd35aa6700 (LWP 95027)]
[New Thread 0x7ffd350a5700 (LWP 95028)]
[New Thread 0x7ffd2ffff700 (LWP 95029)]
[New Thread 0x7ffd2edff700 (LWP 95030)]
[New Thread 0x7ffd2e3fe700 (LWP 95031)]
[New Thread 0x7ffd2d9fd700 (LWP 95032)]
[New Thread 0x7ffd2cffc700 (LWP 95033)]
[New Thread 0x7ffd13fff700 (LWP 95034)]
[New Thread 0x7ffd0b5fe700 (LWP 95035)]
[New Thread 0x7ffd135fe700 (LWP 95036)]
[New Thread 0x7ffd12bfd700 (LWP 95037)]
[New Thread 0x7ffd121fc700 (LWP 95038)]
[New Thread 0x7ffd117fb700 (LWP 95039)]
[New Thread 0x7ffd10dfa700 (LWP 95040)]
[New Thread 0x7ffd0bfff700 (LWP 95041)]
[New Thread 0x7ffd0abfd700 (LWP 95042)]
[New Thread 0x7ffd0a1fc700 (LWP 95043)]
[New Thread 0x7ffd097fb700 (LWP 95044)]
[New Thread 0x7ffd08dfa700 (LWP 95045)]
[New Thread 0x7ffcdbfff700 (LWP 95046)]
[New Thread 0x7ffcd35fe700 (LWP 95047)]
[New Thread 0x7ffcdb5fe700 (LWP 95048)]
[New Thread 0x7ffcdabfd700 (LWP 95049)]
[New Thread 0x7ffcda1fc700 (LWP 95050)]
[New Thread 0x7ffcd97fb700 (LWP 95051)]
[New Thread 0x7ffcd8dfa700 (LWP 95052)]
[New Thread 0x7ffcd3fff700 (LWP 95053)]
[New Thread 0x7ffcd2bfd700 (LWP 95054)]
[New Thread 0x7ffcd21fc700 (LWP 95055)]
[New Thread 0x7ffcd17fb700 (LWP 95056)]
[New Thread 0x7ffcd0dfa700 (LWP 95057)]
[New Thread 0x7ffca3dff700 (LWP 95058)]
[New Thread 0x7ffc9afff700 (LWP 95059)]
[New Thread 0x7ffc93fff700 (LWP 95069)]
[New Thread 0x7ffc935fe700 (LWP 95070)]
[New Thread 0x7ffc92bfd700 (LWP 95071)]
[New Thread 0x7ffc921fc700 (LWP 95072)]
[New Thread 0x7ffc917fb700 (LWP 95073)]
[New Thread 0x7ffc90dfa700 (LWP 95074)]
[New Thread 0x7ffc903f9700 (LWP 95075)]
[New Thread 0x7ffc8f9f8700 (LWP 95076)]
[New Thread 0x7ffc8eff7700 (LWP 95077)]
[New Thread 0x7ffc8e5f6700 (LWP 95078)]
[New Thread 0x7ffc8dbf5700 (LWP 95079)]
[New Thread 0x7ffc8d1f4700 (LWP 95080)]
[New Thread 0x7ffc8c7f3700 (LWP 95081)]
[New Thread 0x7ffc8bdf2700 (LWP 95082)]
[New Thread 0x7ffc8b3f1700 (LWP 95083)]
[New Thread 0x7ffc8a9f0700 (LWP 95084)]
[New Thread 0x7ffc89fef700 (LWP 95085)]
[New Thread 0x7ffc895ee700 (LWP 95086)]
[New Thread 0x7ffc88bed700 (LWP 95087)]
[New Thread 0x7ffc881ec700 (LWP 95088)]
[New Thread 0x7ffc877eb700 (LWP 95089)]
[New Thread 0x7ffc86dea700 (LWP 95090)]
[New Thread 0x7ffc863e9700 (LWP 95091)]
[New Thread 0x7ffc859e8700 (LWP 95092)]
[New Thread 0x7ffc84fe7700 (LWP 95093)]
[New Thread 0x7ffc845e6700 (LWP 95094)]
[New Thread 0x7ffc83be5700 (LWP 95095)]
[New Thread 0x7ffc831e4700 (LWP 95096)]
[New Thread 0x7ffc827e3700 (LWP 95097)]
[New Thread 0x7ffc81de2700 (LWP 95098)]
[New Thread 0x7ffc813e1700 (LWP 95099)]
[Thread 0x7ffc83be5700 (LWP 95095) exited]
[Thread 0x7ffc845e6700 (LWP 95094) exited]
[Thread 0x7ffc859e8700 (LWP 95092) exited]
[Thread 0x7ffc881ec700 (LWP 95088) exited]
[Thread 0x7ffc88bed700 (LWP 95087) exited]
[Thread 0x7ffc8a9f0700 (LWP 95084) exited]
[Thread 0x7ffc8e5f6700 (LWP 95078) exited]
[Thread 0x7ffc90dfa700 (LWP 95074) exited]
[Thread 0x7ffc813e1700 (LWP 95099) exited]
[Thread 0x7ffc917fb700 (LWP 95073) exited]
[Thread 0x7ffc8c7f3700 (LWP 95081) exited]
[Thread 0x7ffc8d1f4700 (LWP 95080) exited]
[Thread 0x7ffc831e4700 (LWP 95096) exited]
[Thread 0x7ffc8f9f8700 (LWP 95076) exited]
[Thread 0x7ffc89fef700 (LWP 95085) exited]
[Thread 0x7ffc8dbf5700 (LWP 95079) exited]
[Thread 0x7ffc81de2700 (LWP 95098) exited]
[Thread 0x7ffc895ee700 (LWP 95086) exited]
[Thread 0x7ffc903f9700 (LWP 95075) exited]
[Thread 0x7ffc863e9700 (LWP 95091) exited]
[Thread 0x7ffc8eff7700 (LWP 95077) exited]
[Thread 0x7ffc8bdf2700 (LWP 95082) exited]
[Thread 0x7ffc8b3f1700 (LWP 95083) exited]
[Thread 0x7ffc86dea700 (LWP 95090) exited]
[Thread 0x7ffc877eb700 (LWP 95089) exited]
[Thread 0x7ffc827e3700 (LWP 95097) exited]
[Thread 0x7ffc84fe7700 (LWP 95093) exited]
[New Thread 0x7ffc84fe7700 (LWP 95100)]
[New Thread 0x7ffc827e3700 (LWP 95101)]
[New Thread 0x7ffc877eb700 (LWP 95102)]
[New Thread 0x7ffc910c5700 (LWP 95103)]
[New Thread 0x7ffc906c4700 (LWP 95104)]
[New Thread 0x7ffc8fcc3700 (LWP 95105)]
[New Thread 0x7ffc8f2c2700 (LWP 95106)]
[New Thread 0x7ffc8e8c1700 (LWP 95107)]
[New Thread 0x7ffc8dec0700 (LWP 95108)]
[New Thread 0x7ffc8d4bf700 (LWP 95109)]
[New Thread 0x7ffc8cabe700 (LWP 95110)]
[New Thread 0x7ffc8c0bd700 (LWP 95111)]
[New Thread 0x7ffc8b6bc700 (LWP 95112)]
[New Thread 0x7ffc8acbb700 (LWP 95113)]
[New Thread 0x7ffc8a2ba700 (LWP 95114)]
[New Thread 0x7ffc898b9700 (LWP 95115)]
[New Thread 0x7ffc88eb8700 (LWP 95116)]
[New Thread 0x7ffc884b7700 (LWP 95117)]
[New Thread 0x7ffc86dea700 (LWP 95118)]
[New Thread 0x7ffc863e9700 (LWP 95119)]
[New Thread 0x7ffc859e8700 (LWP 95120)]
[New Thread 0x7ffc845e6700 (LWP 95121)]
[New Thread 0x7ffc83be5700 (LWP 95122)]
[New Thread 0x7ffc831e4700 (LWP 95123)]
[New Thread 0x7ffc81de2700 (LWP 95124)]
[New Thread 0x7ffc813e1700 (LWP 95125)]
[New Thread 0x7ffc6bfff700 (LWP 95126)]
[Thread 0x7ffc6bfff700 (LWP 95126) exited]
[Thread 0x7ffc81de2700 (LWP 95124) exited]
[Thread 0x7ffc884b7700 (LWP 95117) exited]
[Thread 0x7ffc8a2ba700 (LWP 95114) exited]
[Thread 0x7ffc8c0bd700 (LWP 95111) exited]
[Thread 0x7ffc8d4bf700 (LWP 95109) exited]
[Thread 0x7ffc8e8c1700 (LWP 95107) exited]
[Thread 0x7ffc910c5700 (LWP 95103) exited]
[Thread 0x7ffc84fe7700 (LWP 95100) exited]
[Thread 0x7ffc86dea700 (LWP 95118) exited]
[Thread 0x7ffc8f2c2700 (LWP 95106) exited]
[Thread 0x7ffc8cabe700 (LWP 95110) exited]
[Thread 0x7ffc877eb700 (LWP 95102) exited]
[Thread 0x7ffc827e3700 (LWP 95101) exited]
[Thread 0x7ffc8b6bc700 (LWP 95112) exited]
[Thread 0x7ffc898b9700 (LWP 95115) exited]
[Thread 0x7ffc813e1700 (LWP 95125) exited]
[Thread 0x7ffc88eb8700 (LWP 95116) exited]
[Thread 0x7ffc859e8700 (LWP 95120) exited]
[Thread 0x7ffc8acbb700 (LWP 95113) exited]
[Thread 0x7ffc83be5700 (LWP 95122) exited]
[Thread 0x7ffc831e4700 (LWP 95123) exited]
[Thread 0x7ffc863e9700 (LWP 95119) exited]
[Thread 0x7ffc8fcc3700 (LWP 95105) exited]
[Thread 0x7ffc845e6700 (LWP 95121) exited]
[Thread 0x7ffc8dec0700 (LWP 95108) exited]
[Thread 0x7ffc906c4700 (LWP 95104) exited]
[New Thread 0x7ffc906c4700 (LWP 95140)]
[New Thread 0x7ffc8dec0700 (LWP 95141)]
[New Thread 0x7ffc845e6700 (LWP 95142)]
[New Thread 0x7ffc913fb700 (LWP 95143)]
[New Thread 0x7ffc8fcc3700 (LWP 95144)]
[New Thread 0x7ffc8f2c2700 (LWP 95145)]
[New Thread 0x7ffc8e8c1700 (LWP 95146)]
[New Thread 0x7ffc8d4bf700 (LWP 95147)]
[New Thread 0x7ffc8cabe700 (LWP 95148)]
[New Thread 0x7ffc8c0bd700 (LWP 95149)]
[New Thread 0x7ffc8b6bc700 (LWP 95150)]
[New Thread 0x7ffc8acbb700 (LWP 95151)]
[New Thread 0x7ffc8a2ba700 (LWP 95152)]
[New Thread 0x7ffc898b9700 (LWP 95153)]
[New Thread 0x7ffc88eb8700 (LWP 95154)]
[New Thread 0x7ffc884b7700 (LWP 95155)]
[New Thread 0x7ffc87ab6700 (LWP 95156)]
[New Thread 0x7ffc870b5700 (LWP 95157)]
[New Thread 0x7ffc866b4700 (LWP 95158)]
[New Thread 0x7ffc85cb3700 (LWP 95159)]
[New Thread 0x7ffc852b2700 (LWP 95160)]
[New Thread 0x7ffc83be5700 (LWP 95161)]
[New Thread 0x7ffc831e4700 (LWP 95162)]
[New Thread 0x7ffc827e3700 (LWP 95163)]
[New Thread 0x7ffc81de2700 (LWP 95164)]
[New Thread 0x7ffc813e1700 (LWP 95165)]
[New Thread 0x7ffc6bfff700 (LWP 95166)]
[Thread 0x7ffc6bfff700 (LWP 95166) exited]
[Thread 0x7ffc813e1700 (LWP 95165) exited]
[Thread 0x7ffc852b2700 (LWP 95160) exited]
[Thread 0x7ffc85cb3700 (LWP 95159) exited]
[Thread 0x7ffc866b4700 (LWP 95158) exited]
[Thread 0x7ffc898b9700 (LWP 95153) exited]
[Thread 0x7ffc8c0bd700 (LWP 95149) exited]
[Thread 0x7ffc8e8c1700 (LWP 95146) exited]
[Thread 0x7ffc913fb700 (LWP 95143) exited]
[Thread 0x7ffc87ab6700 (LWP 95156) exited]
[Thread 0x7ffc8f2c2700 (LWP 95145) exited]
[Thread 0x7ffc870b5700 (LWP 95157) exited]
[Thread 0x7ffc8acbb700 (LWP 95151) exited]
[Thread 0x7ffc8cabe700 (LWP 95148) exited]
[Thread 0x7ffc8b6bc700 (LWP 95150) exited]
[Thread 0x7ffc845e6700 (LWP 95142) exited]
[Thread 0x7ffc8dec0700 (LWP 95141) exited]
[Thread 0x7ffc831e4700 (LWP 95162) exited]
[Thread 0x7ffc83be5700 (LWP 95161) exited]
[Thread 0x7ffc8d4bf700 (LWP 95147) exited]
[Thread 0x7ffc81de2700 (LWP 95164) exited]
[Thread 0x7ffc884b7700 (LWP 95155) exited]
[Thread 0x7ffc8a2ba700 (LWP 95152) exited]
[Thread 0x7ffc8fcc3700 (LWP 95144) exited]
[Thread 0x7ffc827e3700 (LWP 95163) exited]
[Thread 0x7ffc88eb8700 (LWP 95154) exited]
[New Thread 0x7ffc88eb8700 (LWP 95170)]
[New Thread 0x7ffc827e3700 (LWP 95171)]
[New Thread 0x7ffc8fcc3700 (LWP 95172)]
[New Thread 0x7ffc913fb700 (LWP 95173)]
[New Thread 0x7ffc8f2c2700 (LWP 95174)]
[New Thread 0x7ffc8e8c1700 (LWP 95175)]
[New Thread 0x7ffc8dec0700 (LWP 95176)]
[New Thread 0x7ffc8d4bf700 (LWP 95177)]
[New Thread 0x7ffc8cabe700 (LWP 95178)]
[New Thread 0x7ffc8c0bd700 (LWP 95179)]
[New Thread 0x7ffc8b6bc700 (LWP 95180)]
[New Thread 0x7ffc8acbb700 (LWP 95181)]
[New Thread 0x7ffc8a2ba700 (LWP 95182)]
[New Thread 0x7ffc898b9700 (LWP 95183)]
[New Thread 0x7ffc884b7700 (LWP 95184)]
[New Thread 0x7ffc87ab6700 (LWP 95185)]
[New Thread 0x7ffc870b5700 (LWP 95186)]
[New Thread 0x7ffc866b4700 (LWP 95187)]
[New Thread 0x7ffc85cb3700 (LWP 95188)]
[New Thread 0x7ffc852b2700 (LWP 95189)]
[New Thread 0x7ffc848b1700 (LWP 95190)]
[New Thread 0x7ffc83eb0700 (LWP 95191)]
[New Thread 0x7ffc834af700 (LWP 95192)]
[New Thread 0x7ffc81de2700 (LWP 95193)]
[New Thread 0x7ffc813e1700 (LWP 95194)]
[New Thread 0x7ffc6bfff700 (LWP 95195)]
Global seed set to 42
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Number of parameters in network: 29.7k
Detaching after vfork from child process 95196.
[New Thread 0x7ffc6a64d700 (LWP 95206)]
Detaching after vfork from child process 95207.
Global seed set to 42
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Number of parameters in network: 29.7k
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.10/runpy.py", line 289, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.10/runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/ec2-user/SageMaker/gpu-test.py", line 139, in <module>
    trainer.fit(
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 113, in launch
    mp.start_processes(
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 189, in start_processes
    process.start()
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

Unfortunately no stacktrace was created and I also don’t see any crashes in the run:

Global seed set to 42
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Done.

updated the backtrace (please refer to the same code block above), hope this time it helps!

Your code now fails with a RuntmeError and does not show the segfault.

the RuntimeError only happens when I try to use more than one gpu, when I use one, there is no error, see the results with one GPU:
`sh-4.2$ gdb --args /home/ec2-user/anaconda3/envs/python3/bin/python3 gpu-test.py
GNU gdb (GDB) Red Hat Enterprise Linux 8.0.1-36.amzn2.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type “show copying”
and “show warranty” for details.
This GDB was configured as “x86_64-redhat-linux-gnu”.
Type “show configuration” for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.
For help, type “help”.
Type “apropos word” to search for commands related to “word”…
Reading symbols from /home/ec2-user/anaconda3/envs/python3/bin/python3…done.
(gdb) run
Starting program: /home/ec2-user/anaconda3/envs/python3/bin/python3 gpu-test.py
Missing separate debuginfos, use: debuginfo-install glibc-2.26-62.amzn2.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib64/libthread_db.so.1”.
[New Thread 0x7fffed823700 (LWP 21383)]
[New Thread 0x7fffece22700 (LWP 21384)]
[New Thread 0x7fffe4421700 (LWP 21385)]
[New Thread 0x7fffdba20700 (LWP 21386)]
[New Thread 0x7fffd301f700 (LWP 21387)]
[New Thread 0x7fffca61e700 (LWP 21388)]
[New Thread 0x7fffc1c1d700 (LWP 21389)]
[New Thread 0x7fffb921c700 (LWP 21390)]
[New Thread 0x7fffb081b700 (LWP 21391)]
[New Thread 0x7fff9fe1a700 (LWP 21392)]
[New Thread 0x7fff97419700 (LWP 21393)]
[New Thread 0x7fff8ea18700 (LWP 21394)]
[New Thread 0x7fff86017700 (LWP 21395)]
[New Thread 0x7fff85616700 (LWP 21396)]
[New Thread 0x7fff74c15700 (LWP 21397)]
[New Thread 0x7fff6c214700 (LWP 21398)]
[New Thread 0x7fff6b813700 (LWP 21399)]
[New Thread 0x7fff62e12700 (LWP 21400)]
[New Thread 0x7fff5a411700 (LWP 21401)]
[New Thread 0x7fff51a10700 (LWP 21402)]
[New Thread 0x7fff4900f700 (LWP 21403)]
[New Thread 0x7fff4060e700 (LWP 21404)]
[New Thread 0x7fff2fc0d700 (LWP 21405)]
[New Thread 0x7fff2720c700 (LWP 21406)]
[New Thread 0x7fff1e80b700 (LWP 21407)]
[New Thread 0x7fff1de0a700 (LWP 21408)]
[New Thread 0x7fff0d409700 (LWP 21409)]
[New Thread 0x7fff0ca08700 (LWP 21410)]
[New Thread 0x7ffefc007700 (LWP 21411)]
[New Thread 0x7ffef3606700 (LWP 21412)]
[New Thread 0x7ffeeac05700 (LWP 21413)]
[New Thread 0x7ffeea204700 (LWP 21414)]
[New Thread 0x7ffee1803700 (LWP 21415)]
[New Thread 0x7ffed8e02700 (LWP 21416)]
[New Thread 0x7ffec8401700 (LWP 21417)]
[New Thread 0x7ffebfa00700 (LWP 21418)]
[New Thread 0x7ffeb6fff700 (LWP 21419)]
[New Thread 0x7ffeae5fe700 (LWP 21420)]
[New Thread 0x7ffea5bfd700 (LWP 21421)]
[New Thread 0x7ffe9d1fc700 (LWP 21422)]
[New Thread 0x7ffe947fb700 (LWP 21424)]
[New Thread 0x7ffe8bdfa700 (LWP 21425)]
[New Thread 0x7ffe8b3f9700 (LWP 21427)]
[New Thread 0x7ffe7a9f8700 (LWP 21428)]
[New Thread 0x7ffe79ff7700 (LWP 21429)]
[New Thread 0x7ffe695f6700 (LWP 21430)]
[New Thread 0x7ffe60bf5700 (LWP 21431)]
[New Thread 0x7ffe601f4700 (LWP 21432)]
[New Thread 0x7ffe577f3700 (LWP 21433)]
[New Thread 0x7ffe46df2700 (LWP 21434)]
[New Thread 0x7ffe3e3f1700 (LWP 21436)]
[New Thread 0x7ffe359f0700 (LWP 21437)]
[New Thread 0x7ffe2cfef700 (LWP 21438)]
[New Thread 0x7ffe245ee700 (LWP 21439)]
[New Thread 0x7ffe23bed700 (LWP 21440)]
[New Thread 0x7ffe1b1ec700 (LWP 21441)]
[New Thread 0x7ffe127eb700 (LWP 21442)]
[New Thread 0x7ffe09dea700 (LWP 21443)]
[New Thread 0x7ffdf93e9700 (LWP 21444)]
[New Thread 0x7ffdf89e8700 (LWP 21445)]
[New Thread 0x7ffdeffe7700 (LWP 21446)]
[New Thread 0x7ffddf5e6700 (LWP 21447)]
[New Thread 0x7ffdd6be5700 (LWP 21448)]
[New Thread 0x7ffdc89ff700 (LWP 21470)]
[New Thread 0x7ffdc726b700 (LWP 21474)]
[New Thread 0x7ffdc686a700 (LWP 21475)]
[New Thread 0x7ffdc5e69700 (LWP 21476)]
[New Thread 0x7ffdc5468700 (LWP 21477)]
[New Thread 0x7ffdc4a67700 (LWP 21478)]
[New Thread 0x7ffdbffff700 (LWP 21479)]
[New Thread 0x7ffdbf5fe700 (LWP 21480)]
[New Thread 0x7ffdbebfd700 (LWP 21481)]
Missing separate debuginfo for /home/ec2-user/anaconda3/envs/python3/lib/python3.10/site-packages/torch/lib/libgomp-a34b3233.so.1
Try: yum --enablerepo=‘debug’ install /usr/lib/debug/.build-id/5f/4fb88af97be3ecacc71363136bb015b2a07119.debug
Missing separate debuginfo for /lib64/libcuda.so.1
Try: yum --enablerepo=‘debug’ install /usr/lib/debug/.build-id/d7/a1b5fea0f4f1f6375e4ed8a59d4afbd04ef289.debug
[New Thread 0x7ffd35aa6700 (LWP 21639)]
[New Thread 0x7ffd350a5700 (LWP 21640)]
[New Thread 0x7ffd2ffff700 (LWP 21641)]
[New Thread 0x7ffd2edff700 (LWP 21642)]
[New Thread 0x7ffd2e3fe700 (LWP 21643)]
[New Thread 0x7ffd2d9fd700 (LWP 21644)]
[New Thread 0x7ffd2cffc700 (LWP 21645)]
[New Thread 0x7ffd13fff700 (LWP 21646)]
[New Thread 0x7ffd135fe700 (LWP 21647)]
[New Thread 0x7ffd0bfff700 (LWP 21649)]
[New Thread 0x7ffd123ff700 (LWP 21648)]
[New Thread 0x7ffd119fe700 (LWP 21650)]
[New Thread 0x7ffd10ffd700 (LWP 21651)]
[New Thread 0x7ffd0b5fe700 (LWP 21652)]
[New Thread 0x7ffd0abfd700 (LWP 21653)]
[New Thread 0x7ffd0a1fc700 (LWP 21654)]
[New Thread 0x7ffd097fb700 (LWP 21655)]
[New Thread 0x7ffd08dfa700 (LWP 21656)]
[New Thread 0x7ffd03fff700 (LWP 21657)]
[New Thread 0x7ffd035fe700 (LWP 21658)]
[New Thread 0x7ffd02bfd700 (LWP 21659)]
[New Thread 0x7ffd021fc700 (LWP 21660)]
[New Thread 0x7ffd017fb700 (LWP 21661)]
[New Thread 0x7ffd00dfa700 (LWP 21662)]
[New Thread 0x7ffcc3fff700 (LWP 21663)]
[New Thread 0x7ffcbbfff700 (LWP 21664)]
[New Thread 0x7ffcc35fe700 (LWP 21665)]
[New Thread 0x7ffcc2bfd700 (LWP 21666)]
[New Thread 0x7ffcc21fc700 (LWP 21667)]
[New Thread 0x7ffcc17fb700 (LWP 21668)]
[New Thread 0x7ffcc0dfa700 (LWP 21669)]
[New Thread 0x7ffcbb5fe700 (LWP 21670)]
[New Thread 0x7ffcbabfd700 (LWP 21671)]
[New Thread 0x7ffc98dff700 (LWP 21672)]
[New Thread 0x7ffc93fff700 (LWP 21673)]
[New Thread 0x7ffc935fe700 (LWP 21674)]
[New Thread 0x7ffc92bfd700 (LWP 21675)]
[New Thread 0x7ffc921fc700 (LWP 21676)]
[New Thread 0x7ffc917fb700 (LWP 21677)]
[New Thread 0x7ffc90dfa700 (LWP 21678)]
[New Thread 0x7ffc903f9700 (LWP 21679)]
[New Thread 0x7ffc8f9f8700 (LWP 21680)]
[New Thread 0x7ffc8eff7700 (LWP 21681)]
[New Thread 0x7ffc8e5f6700 (LWP 21682)]
[New Thread 0x7ffc8dbf5700 (LWP 21683)]
[New Thread 0x7ffc8d1f4700 (LWP 21684)]
[New Thread 0x7ffc8c7f3700 (LWP 21685)]
[New Thread 0x7ffc8bdf2700 (LWP 21686)]
[New Thread 0x7ffc8b3f1700 (LWP 21687)]
[New Thread 0x7ffc8a9f0700 (LWP 21688)]
[New Thread 0x7ffc89fef700 (LWP 21689)]
[New Thread 0x7ffc895ee700 (LWP 21690)]
[New Thread 0x7ffc88bed700 (LWP 21691)]
[New Thread 0x7ffc881ec700 (LWP 21692)]
[New Thread 0x7ffc877eb700 (LWP 21693)]
[New Thread 0x7ffc86dea700 (LWP 21694)]
[New Thread 0x7ffc863e9700 (LWP 21695)]
[New Thread 0x7ffc859e8700 (LWP 21696)]
[New Thread 0x7ffc84fe7700 (LWP 21697)]
[New Thread 0x7ffc845e6700 (LWP 21698)]
[New Thread 0x7ffc83be5700 (LWP 21699)]
[New Thread 0x7ffc831e4700 (LWP 21700)]
[New Thread 0x7ffc827e3700 (LWP 21701)]
[New Thread 0x7ffc81de2700 (LWP 21702)]
[Thread 0x7ffc845e6700 (LWP 21698) exited]
[Thread 0x7ffc8bdf2700 (LWP 21686) exited]
[Thread 0x7ffc8dbf5700 (LWP 21683) exited]
[Thread 0x7ffc827e3700 (LWP 21701) exited]
[Thread 0x7ffc877eb700 (LWP 21693) exited]
[Thread 0x7ffc8e5f6700 (LWP 21682) exited]
[Thread 0x7ffc90dfa700 (LWP 21678) exited]
[Thread 0x7ffc921fc700 (LWP 21676) exited]
[Thread 0x7ffc895ee700 (LWP 21690) exited]
[Thread 0x7ffc831e4700 (LWP 21700) exited]
[Thread 0x7ffc81de2700 (LWP 21702) exited]
[Thread 0x7ffc8c7f3700 (LWP 21685) exited]
[Thread 0x7ffc917fb700 (LWP 21677) exited]
[Thread 0x7ffc8b3f1700 (LWP 21687) exited]
[Thread 0x7ffc88bed700 (LWP 21691) exited]
[Thread 0x7ffc903f9700 (LWP 21679) exited]
[Thread 0x7ffc84fe7700 (LWP 21697) exited]
[Thread 0x7ffc89fef700 (LWP 21689) exited]
[Thread 0x7ffc83be5700 (LWP 21699) exited]
[Thread 0x7ffc881ec700 (LWP 21692) exited]
[Thread 0x7ffc8eff7700 (LWP 21681) exited]
[Thread 0x7ffc8d1f4700 (LWP 21684) exited]
[Thread 0x7ffc863e9700 (LWP 21695) exited]
[Thread 0x7ffc86dea700 (LWP 21694) exited]
[Thread 0x7ffc859e8700 (LWP 21696) exited]
[Thread 0x7ffc8a9f0700 (LWP 21688) exited]
[Thread 0x7ffc8f9f8700 (LWP 21680) exited]
[New Thread 0x7ffc8f9f8700 (LWP 21703)]
[New Thread 0x7ffc8a9f0700 (LWP 21704)]
[New Thread 0x7ffc859e8700 (LWP 21705)]
[New Thread 0x7ffc91e61700 (LWP 21706)]
[New Thread 0x7ffc91460700 (LWP 21707)]
[New Thread 0x7ffc90a5f700 (LWP 21708)]
[New Thread 0x7ffc8eff7700 (LWP 21709)]
[New Thread 0x7ffc8e5f6700 (LWP 21710)]
[New Thread 0x7ffc8dbf5700 (LWP 21711)]
[New Thread 0x7ffc8d1f4700 (LWP 21712)]
[New Thread 0x7ffc8c7f3700 (LWP 21713)]
[New Thread 0x7ffc8bdf2700 (LWP 21714)]
[New Thread 0x7ffc8b3f1700 (LWP 21715)]
[New Thread 0x7ffc89fef700 (LWP 21716)]
[New Thread 0x7ffc895ee700 (LWP 21717)]
[New Thread 0x7ffc88bed700 (LWP 21718)]
[New Thread 0x7ffc881ec700 (LWP 21719)]
[New Thread 0x7ffc877eb700 (LWP 21720)]
[New Thread 0x7ffc86dea700 (LWP 21721)]
[New Thread 0x7ffc863e9700 (LWP 21722)]
[New Thread 0x7ffc84fe7700 (LWP 21723)]
[New Thread 0x7ffc845e6700 (LWP 21724)]
[New Thread 0x7ffc83be5700 (LWP 21725)]
[New Thread 0x7ffc831e4700 (LWP 21726)]
[New Thread 0x7ffc827e3700 (LWP 21727)]
[New Thread 0x7ffc81de2700 (LWP 21728)]
[New Thread 0x7ffc813e1700 (LWP 21729)]
[Thread 0x7ffc813e1700 (LWP 21729) exited]
[Thread 0x7ffc81de2700 (LWP 21728) exited]
[Thread 0x7ffc827e3700 (LWP 21727) exited]
[Thread 0x7ffc863e9700 (LWP 21722) exited]
[Thread 0x7ffc8bdf2700 (LWP 21714) exited]
[Thread 0x7ffc8d1f4700 (LWP 21712) exited]
[Thread 0x7ffc8e5f6700 (LWP 21710) exited]
[Thread 0x7ffc8dbf5700 (LWP 21711) exited]
[Thread 0x7ffc90a5f700 (LWP 21708) exited]
[Thread 0x7ffc8c7f3700 (LWP 21713) exited]
[Thread 0x7ffc8f9f8700 (LWP 21703) exited]
[Thread 0x7ffc8b3f1700 (LWP 21715) exited]
[Thread 0x7ffc91460700 (LWP 21707) exited]
[Thread 0x7ffc895ee700 (LWP 21717) exited]
[Thread 0x7ffc859e8700 (LWP 21705) exited]
[Thread 0x7ffc84fe7700 (LWP 21723) exited]
[Thread 0x7ffc91e61700 (LWP 21706) exited]
[Thread 0x7ffc89fef700 (LWP 21716) exited]
[Thread 0x7ffc8eff7700 (LWP 21709) exited]
[Thread 0x7ffc88bed700 (LWP 21718) exited]
[Thread 0x7ffc83be5700 (LWP 21725) exited]
[Thread 0x7ffc845e6700 (LWP 21724) exited]
[Thread 0x7ffc86dea700 (LWP 21721) exited]
[Thread 0x7ffc881ec700 (LWP 21719) exited]
[Thread 0x7ffc8a9f0700 (LWP 21704) exited]
[Thread 0x7ffc831e4700 (LWP 21726) exited]
[Thread 0x7ffc877eb700 (LWP 21720) exited]

[New Thread 0x7ffc877eb700 (LWP 21742)]
[New Thread 0x7ffc831e4700 (LWP 21743)]
[New Thread 0x7ffc8a9f0700 (LWP 21744)]
[New Thread 0x7ffc91efc700 (LWP 21745)]
[New Thread 0x7ffc914fb700 (LWP 21746)]
[New Thread 0x7ffc90afa700 (LWP 21747)]
[New Thread 0x7ffc900f9700 (LWP 21748)]
[New Thread 0x7ffc8f6f8700 (LWP 21749)]
[New Thread 0x7ffc8ecf7700 (LWP 21750)]
[New Thread 0x7ffc8e2f6700 (LWP 21751)]
[New Thread 0x7ffc8d8f5700 (LWP 21752)]
[New Thread 0x7ffc8cef4700 (LWP 21753)]
[New Thread 0x7ffc8c4f3700 (LWP 21754)]
[New Thread 0x7ffc8baf2700 (LWP 21755)]
[New Thread 0x7ffc89fef700 (LWP 21756)]
[New Thread 0x7ffc895ee700 (LWP 21757)]
[New Thread 0x7ffc88bed700 (LWP 21758)]
[New Thread 0x7ffc881ec700 (LWP 21759)]
[New Thread 0x7ffc86dea700 (LWP 21760)]
[New Thread 0x7ffc863e9700 (LWP 21761)]
[New Thread 0x7ffc859e8700 (LWP 21762)]
[New Thread 0x7ffc84fe7700 (LWP 21763)]
[New Thread 0x7ffc845e6700 (LWP 21764)]
[New Thread 0x7ffc83be5700 (LWP 21765)]
[New Thread 0x7ffc827e3700 (LWP 21766)]
[New Thread 0x7ffc81de2700 (LWP 21767)]
[New Thread 0x7ffc813e1700 (LWP 21768)]
[Thread 0x7ffc81de2700 (LWP 21767) exited]
[Thread 0x7ffc845e6700 (LWP 21764) exited]
[Thread 0x7ffc881ec700 (LWP 21759) exited]
[Thread 0x7ffc895ee700 (LWP 21757) exited]
[Thread 0x7ffc8baf2700 (LWP 21755) exited]
[Thread 0x7ffc88bed700 (LWP 21758) exited]
[Thread 0x7ffc8ecf7700 (LWP 21750) exited]
[Thread 0x7ffc900f9700 (LWP 21748) exited]
[Thread 0x7ffc863e9700 (LWP 21761) exited]
[Thread 0x7ffc831e4700 (LWP 21743) exited]
[Thread 0x7ffc827e3700 (LWP 21766) exited]
[Thread 0x7ffc84fe7700 (LWP 21763) exited]
[Thread 0x7ffc91efc700 (LWP 21745) exited]
[Thread 0x7ffc8c4f3700 (LWP 21754) exited]
[Thread 0x7ffc8d8f5700 (LWP 21752) exited]
[Thread 0x7ffc813e1700 (LWP 21768) exited]
[Thread 0x7ffc859e8700 (LWP 21762) exited]
[Thread 0x7ffc914fb700 (LWP 21746) exited]
[Thread 0x7ffc8e2f6700 (LWP 21751) exited]
[Thread 0x7ffc83be5700 (LWP 21765) exited]
[Thread 0x7ffc86dea700 (LWP 21760) exited]
[Thread 0x7ffc8f6f8700 (LWP 21749) exited]
[Thread 0x7ffc8a9f0700 (LWP 21744) exited]
[Thread 0x7ffc8cef4700 (LWP 21753) exited]
[Thread 0x7ffc89fef700 (LWP 21756) exited]
[Thread 0x7ffc90afa700 (LWP 21747) exited]
[New Thread 0x7ffc90afa700 (LWP 21769)]
[New Thread 0x7ffc89fef700 (LWP 21770)]
[New Thread 0x7ffc8cef4700 (LWP 21771)]
[New Thread 0x7ffc91efc700 (LWP 21772)]
[New Thread 0x7ffc914fb700 (LWP 21773)]
[New Thread 0x7ffc900f9700 (LWP 21774)]
[New Thread 0x7ffc8f6f8700 (LWP 21775)]
[New Thread 0x7ffc8ecf7700 (LWP 21776)]
[New Thread 0x7ffc8e2f6700 (LWP 21777)]
[New Thread 0x7ffc8d8f5700 (LWP 21778)]
[New Thread 0x7ffc8c4f3700 (LWP 21779)]
[New Thread 0x7ffc8baf2700 (LWP 21780)]
[New Thread 0x7ffc8aff1700 (LWP 21781)]
[New Thread 0x7ffc895ee700 (LWP 21782)]
[New Thread 0x7ffc88bed700 (LWP 21783)]
[New Thread 0x7ffc881ec700 (LWP 21784)]
[New Thread 0x7ffc86dea700 (LWP 21785)]
[New Thread 0x7ffc863e9700 (LWP 21786)]
[New Thread 0x7ffc859e8700 (LWP 21787)]
[New Thread 0x7ffc84fe7700 (LWP 21788)]
[New Thread 0x7ffc845e6700 (LWP 21789)]
[New Thread 0x7ffc83be5700 (LWP 21790)]
[New Thread 0x7ffc831e4700 (LWP 21791)]
[New Thread 0x7ffc827e3700 (LWP 21792)]
[New Thread 0x7ffc81de2700 (LWP 21793)]
[New Thread 0x7ffc813e1700 (LWP 21794)]
Global seed set to 42
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Number of parameters in network: 29.7k
[New Thread 0x7ffbd304e700 (LWP 21807)]
[New Thread 0x7ffbd264d700 (LWP 21808)]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]

| Name | Type | Params

0 | loss | QuantileLoss | 0
1 | logging_metrics | ModuleList | 0
2 | input_embeddings | MultiEmbedding | 1.3 K
3 | prescalers | ModuleDict | 256
4 | static_variable_selection | VariableSelectionNetwork | 3.4 K
5 | encoder_variable_selection | VariableSelectionNetwork | 8.0 K
6 | decoder_variable_selection | VariableSelectionNetwork | 2.7 K
7 | static_context_variable_selection | GatedResidualNetwork | 1.1 K
8 | static_context_initial_hidden_lstm | GatedResidualNetwork | 1.1 K
9 | static_context_initial_cell_lstm | GatedResidualNetwork | 1.1 K
10 | static_context_enrichment | GatedResidualNetwork | 1.1 K
11 | lstm_encoder | LSTM | 2.2 K
12 | lstm_decoder | LSTM | 2.2 K
13 | post_lstm_gate_encoder | GatedLinearUnit | 544
14 | post_lstm_add_norm_encoder | AddNorm | 32
15 | static_enrichment | GatedResidualNetwork | 1.4 K
16 | multihead_attn | InterpretableMultiHeadAttention | 1.1 K
17 | post_attn_gate_norm | GateAddNorm | 576
18 | pos_wise_ff | GatedResidualNetwork | 1.1 K
19 | pre_output_gate_norm | GateAddNorm | 576
20 | output_layer | Linear | 119

29.7 K Trainable params
0 Non-trainable params
29.7 K Total params
0.119 Total estimated model params size (MB)
[New Thread 0x7ffbd1b4c700 (LWP 21845)]
[New Thread 0x7ffbd114b700 (LWP 21846)]
Epoch 0: 0%| | 0/31 [00:00<?, ?it/s][New Thread 0x7ffbb1fff700 (LWP 21871)]
[New Thread 0x7ffbb15fe700 (LWP 21872)]
[New Thread 0x7ffbb0bfd700 (LWP 21873)]
[New Thread 0x7ffba9fff700 (LWP 21874)]
[New Thread 0x7ffba95fe700 (LWP 21875)]
[New Thread 0x7ffba8bfd700 (LWP 21876)]
[New Thread 0x7ffba1fff700 (LWP 21877)]
[New Thread 0x7ffba15fe700 (LWP 21878)]
Epoch 4: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 31/31 [00:09<00:00, 3.44it/s, loss=132, v_num=1, train_loss_step=154.0, val_loss=192.0, train_loss_epoch=136.0]Trainer.fit stopped: max_epochs=5 reached.
Epoch 4: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 31/31 [00:09<00:00, 3.38it/s, loss=132, v_num=1, train_loss_step=154.0, val_loss=192.0, train_loss_epoch=136.0]
[Thread 0x7ffbd1b4c700 (LWP 21845) exited]
[New Thread 0x7ffbd1b4c700 (LWP 22435)]
[Thread 0x7ffc84fe7700 (LWP 21788) exited]
[Thread 0x7ffc859e8700 (LWP 21787) exited]
[Thread 0x7ffc881ec700 (LWP 21784) exited]
[Thread 0x7ffc8baf2700 (LWP 21780) exited]
[Thread 0x7ffc8f6f8700 (LWP 21775) exited]
[Thread 0x7ffc8ecf7700 (LWP 21776) exited]
[Thread 0x7ffc8c4f3700 (LWP 21779) exited]
[Thread 0x7ffc88bed700 (LWP 21783) exited]
[Thread 0x7ffc863e9700 (LWP 21786) exited]
[Thread 0x7ffc900f9700 (LWP 21774) exited]
[Thread 0x7ffc83be5700 (LWP 21790) exited]
[Thread 0x7ffc813e1700 (LWP 21794) exited]
[Thread 0x7ffc81de2700 (LWP 21793) exited]
[Thread 0x7ffc831e4700 (LWP 21791) exited]
[Thread 0x7ffc827e3700 (LWP 21792) exited]
[Thread 0x7ffc8e2f6700 (LWP 21777) exited]
[Thread 0x7ffc86dea700 (LWP 21785) exited]
[Thread 0x7ffc845e6700 (LWP 21789) exited]
[Thread 0x7ffc895ee700 (LWP 21782) exited]
[Thread 0x7ffc914fb700 (LWP 21773) exited]
[Thread 0x7ffc8d8f5700 (LWP 21778) exited]
[Thread 0x7ffc8aff1700 (LWP 21781) exited]
[New Thread 0x7ffc8aff1700 (LWP 22436)]
[New Thread 0x7ffc8d8f5700 (LWP 22437)]
[New Thread 0x7ffc914fb700 (LWP 22438)]
[New Thread 0x7ffc900f9700 (LWP 22439)]
[New Thread 0x7ffc8f6f8700 (LWP 22440)]
[New Thread 0x7ffc8ecf7700 (LWP 22441)]
[New Thread 0x7ffc8e2f6700 (LWP 22442)]
[New Thread 0x7ffc8c4f3700 (LWP 22443)]
[New Thread 0x7ffc8baf2700 (LWP 22444)]
[New Thread 0x7ffc895ee700 (LWP 22445)]
[New Thread 0x7ffc88bed700 (LWP 22446)]
[New Thread 0x7ffc881ec700 (LWP 22447)]
[New Thread 0x7ffc86dea700 (LWP 22448)]
[New Thread 0x7ffc863e9700 (LWP 22449)]
[New Thread 0x7ffc859e8700 (LWP 22450)]
[New Thread 0x7ffc84fe7700 (LWP 22451)]
[New Thread 0x7ffc845e6700 (LWP 22452)]
[New Thread 0x7ffc83be5700 (LWP 22453)]
[New Thread 0x7ffc831e4700 (LWP 22454)]
[New Thread 0x7ffc827e3700 (LWP 22455)]
[New Thread 0x7ffc81de2700 (LWP 22456)]
[New Thread 0x7ffc813e1700 (LWP 22457)]
[Thread 0x7ffc827e3700 (LWP 22455) exited]
[Thread 0x7ffc83be5700 (LWP 22453) exited]
[Thread 0x7ffc8baf2700 (LWP 22444) exited]
[Thread 0x7ffc8c4f3700 (LWP 22443) exited]
[Thread 0x7ffc831e4700 (LWP 22454) exited]
[Thread 0x7ffc859e8700 (LWP 22450) exited]
[Thread 0x7ffc845e6700 (LWP 22452) exited]
[Thread 0x7ffc84fe7700 (LWP 22451) exited]
[Thread 0x7ffc895ee700 (LWP 22445) exited]
[Thread 0x7ffc813e1700 (LWP 22457) exited]
[Thread 0x7ffc8aff1700 (LWP 22436) exited]
[Thread 0x7ffc863e9700 (LWP 22449) exited]
[Thread 0x7ffc86dea700 (LWP 22448) exited]
[Thread 0x7ffc900f9700 (LWP 22439) exited]
[Thread 0x7ffc881ec700 (LWP 22447) exited]
[Thread 0x7ffc8d8f5700 (LWP 22437) exited]
[Thread 0x7ffc914fb700 (LWP 22438) exited]
[Thread 0x7ffc8f6f8700 (LWP 22440) exited]
[Thread 0x7ffc81de2700 (LWP 22456) exited]
[Thread 0x7ffc88bed700 (LWP 22446) exited]
[Thread 0x7ffc8ecf7700 (LWP 22441) exited]
[New Thread 0x7ffb98d39700 (LWP 22458)]
[Thread 0x7ffc8e2f6700 (LWP 22442) exited]
[New Thread 0x7ffc8e2f6700 (LWP 22459)]
[New Thread 0x7ffc8ecf7700 (LWP 22460)]
[New Thread 0x7ffc88bed700 (LWP 22461)]
[New Thread 0x7ffc914fb700 (LWP 22462)]
[New Thread 0x7ffc900f9700 (LWP 22463)]
[New Thread 0x7ffc8f6f8700 (LWP 22464)]
[New Thread 0x7ffc8d8f5700 (LWP 22465)]
[New Thread 0x7ffc8c4f3700 (LWP 22466)]
[New Thread 0x7ffc8baf2700 (LWP 22467)]
[New Thread 0x7ffc8aff1700 (LWP 22468)]
[New Thread 0x7ffc895ee700 (LWP 22469)]
[New Thread 0x7ffc881ec700 (LWP 22470)]
[New Thread 0x7ffc86dea700 (LWP 22471)]
[New Thread 0x7ffc863e9700 (LWP 22472)]
[New Thread 0x7ffc859e8700 (LWP 22473)]
[New Thread 0x7ffc84fe7700 (LWP 22474)]
[New Thread 0x7ffc845e6700 (LWP 22475)]
[New Thread 0x7ffc83be5700 (LWP 22476)]
[New Thread 0x7ffc831e4700 (LWP 22477)]
[New Thread 0x7ffc827e3700 (LWP 22478)]
[New Thread 0x7ffc81de2700 (LWP 22479)]
[Thread 0x7ffc827e3700 (LWP 22478) exited]
[Thread 0x7ffc845e6700 (LWP 22475) exited]
[Thread 0x7ffc84fe7700 (LWP 22474) exited]
[Thread 0x7ffc8c4f3700 (LWP 22466) exited]
[Thread 0x7ffc8f6f8700 (LWP 22464) exited]
[Thread 0x7ffc81de2700 (LWP 22479) exited]
[Thread 0x7ffc8d8f5700 (LWP 22465) exited]
[Thread 0x7ffc900f9700 (LWP 22463) exited]
[Thread 0x7ffc8baf2700 (LWP 22467) exited]
[Thread 0x7ffc881ec700 (LWP 22470) exited]
[Thread 0x7ffc831e4700 (LWP 22477) exited]
[Thread 0x7ffc83be5700 (LWP 22476) exited]
[Thread 0x7ffc8aff1700 (LWP 22468) exited]
[Thread 0x7ffc863e9700 (LWP 22472) exited]
[Thread 0x7ffc86dea700 (LWP 22471) exited]
[Thread 0x7ffc8ecf7700 (LWP 22460) exited]
[Thread 0x7ffc895ee700 (LWP 22469) exited]
[Thread 0x7ffc8e2f6700 (LWP 22459) exited]
[Thread 0x7ffc914fb700 (LWP 22462) exited]
[Thread 0x7ffb98d39700 (LWP 22458) exited]
[New Thread 0x7ffc859e8700 (LWP 22480)]
[Thread 0x7ffc859e8700 (LWP 22473) exited]
[Thread 0x7ffc88bed700 (LWP 22461) exited]
[New Thread 0x7ffc88bed700 (LWP 22481)]
[New Thread 0x7ffb98d39700 (LWP 22482)]
[New Thread 0x7ffc914fb700 (LWP 22483)]
[New Thread 0x7ffc900f9700 (LWP 22484)]
[New Thread 0x7ffc8f6f8700 (LWP 22485)]
[New Thread 0x7ffc8ecf7700 (LWP 22486)]
[New Thread 0x7ffc8e2f6700 (LWP 22487)]
[New Thread 0x7ffc8d8f5700 (LWP 22488)]
[New Thread 0x7ffc8c4f3700 (LWP 22489)]
[New Thread 0x7ffc8baf2700 (LWP 22490)]
[New Thread 0x7ffc8aff1700 (LWP 22491)]
[New Thread 0x7ffc895ee700 (LWP 22492)]
[New Thread 0x7ffc881ec700 (LWP 22493)]
[New Thread 0x7ffc86dea700 (LWP 22494)]
[New Thread 0x7ffc863e9700 (LWP 22495)]
[New Thread 0x7ffc84fe7700 (LWP 22496)]
[New Thread 0x7ffc845e6700 (LWP 22497)]
[New Thread 0x7ffc83be5700 (LWP 22498)]
[New Thread 0x7ffc831e4700 (LWP 22499)]
[New Thread 0x7ffc827e3700 (LWP 22500)]
[New Thread 0x7ffc81de2700 (LWP 22501)]
[Thread 0x7ffc81de2700 (LWP 22501) exited]
[Thread 0x7ffc8c4f3700 (LWP 22489) exited]
[Thread 0x7ffc8d8f5700 (LWP 22488) exited]
[Thread 0x7ffc831e4700 (LWP 22499) exited]
[Thread 0x7ffc86dea700 (LWP 22494) exited]
[Thread 0x7ffb98d39700 (LWP 22482) exited]
[Thread 0x7ffc8baf2700 (LWP 22490) exited]
[Thread 0x7ffc863e9700 (LWP 22495) exited]
[Thread 0x7ffc88bed700 (LWP 22481) exited]
[Thread 0x7ffc859e8700 (LWP 22480) exited]
[Thread 0x7ffc900f9700 (LWP 22484) exited]
[Thread 0x7ffc83be5700 (LWP 22498) exited]
[Thread 0x7ffc895ee700 (LWP 22492) exited]
[Thread 0x7ffc881ec700 (LWP 22493) exited]
[Thread 0x7ffc8aff1700 (LWP 22491) exited]
[Thread 0x7ffc914fb700 (LWP 22483) exited]
[Thread 0x7ffc84fe7700 (LWP 22496) exited]
[Thread 0x7ffc8ecf7700 (LWP 22486) exited]
[Thread 0x7ffc8e2f6700 (LWP 22487) exited]
[Thread 0x7ffc827e3700 (LWP 22500) exited]
[Thread 0x7ffc845e6700 (LWP 22497) exited]
[Thread 0x7ffc8f6f8700 (LWP 22485) exited]
[New Thread 0x7ffc8c4f3700 (LWP 22502)]
[New Thread 0x7ffc8f6f8700 (LWP 22503)]
[New Thread 0x7ffc845e6700 (LWP 22504)]
[New Thread 0x7ffc827e3700 (LWP 22505)]
[New Thread 0x7ffc914fb700 (LWP 22506)]
[New Thread 0x7ffc900f9700 (LWP 22507)]
[New Thread 0x7ffc8ecf7700 (LWP 22508)]
[New Thread 0x7ffc8e2f6700 (LWP 22509)]
[New Thread 0x7ffc8d8f5700 (LWP 22510)]
[New Thread 0x7ffc8baf2700 (LWP 22511)]
[New Thread 0x7ffc8aff1700 (LWP 22512)]
[New Thread 0x7ffc895ee700 (LWP 22513)]
[New Thread 0x7ffc88bed700 (LWP 22514)]
[New Thread 0x7ffc881ec700 (LWP 22515)]
[New Thread 0x7ffc86dea700 (LWP 22516)]
[New Thread 0x7ffc863e9700 (LWP 22517)]
[New Thread 0x7ffc859e8700 (LWP 22518)]
[New Thread 0x7ffc84fe7700 (LWP 22519)]
[New Thread 0x7ffc83be5700 (LWP 22520)]
[New Thread 0x7ffc831e4700 (LWP 22521)]
[New Thread 0x7ffc81de2700 (LWP 22522)]
[New Thread 0x7ffc813e1700 (LWP 22523)]
[Thread 0x7ffc813e1700 (LWP 22523) exited]
[Thread 0x7ffc83be5700 (LWP 22520) exited]
[Thread 0x7ffc8d8f5700 (LWP 22510) exited]
[Thread 0x7ffc900f9700 (LWP 22507) exited]
[Thread 0x7ffc827e3700 (LWP 22505) exited]
[Thread 0x7ffc845e6700 (LWP 22504) exited]
[Thread 0x7ffc859e8700 (LWP 22518) exited]
[Thread 0x7ffc86dea700 (LWP 22516) exited]
[Thread 0x7ffc8c4f3700 (LWP 22502) exited]
[Thread 0x7ffc881ec700 (LWP 22515) exited]
[Thread 0x7ffc8baf2700 (LWP 22511) exited]
[Thread 0x7ffc8ecf7700 (LWP 22508) exited]
[Thread 0x7ffc81de2700 (LWP 22522) exited]`

May I know if you find the solution to this error? I have exactly the same problem using Pytorch lightening and forecasting. I get the same segmentation error. Mine also works with one GPU.

Hello! The error I asked 2 days ago was solved. But I’ve encountered a new error which is similar to the error in this page and I don’t have an idea about how to solve it :sweat:…Here is the code:

def pre_train_and_fine_tune(rank: int, world_size: int, hyperparams: dict, times: mp.Queue):
    ddp_setup(rank, world_size)
    training_mode = "pre_train"
    configs = modify_the_configuration(training_mode, hyperparams)
    train_or_fine_tune_or_test(rank, training_mode, configs)

    training_mode = "fine_tune"
    configs = modify_the_configuration(training_mode, hyperparams)
    ts = train_or_fine_tune_or_test(rank, training_mode, configs)

    times.put(ts)  # This line of code is where the error is reported

    destroy_process_group()


if __name__ == '__main__':
    # adjust_parameters_in_a_given_range()
    best_acc = float(77.0)

    while True:
        # MODIFIED
        remove_previous_model_parameters()

        hyperparams = dict(
            lr = rand_float(0.001, 0.003),
            batch_size = random.randrange(256, 512+1, step=64),
            target_batch_size = random.randrange(32, 64, step=8),
            temperature = rand_float(0.01, 0.02),
            weight_decay = rand_float(0.00020, 0.0003),
            num_epoch = random.randrange(60, 70, 2),
            lam = rand_float(0.740, 0.764),
            kernel_size = random.choice([3, 4, 5, 6, 7, 8]),
        )

        world_size = torch.cuda.device_count()

        times = mp.Queue(maxsize=4)

        mp.spawn(pre_train_and_fine_tune, args=(world_size, hyperparams, times), nprocs=world_size)

        training_mode = "test"

And here is my traceback:
Traceback (most recent call last):
File “/home/…/.conda/envs/lrw/lib/python3.10/runpy.py”, line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/home/…/.conda/envs/lrw/lib/python3.10/runpy.py”, line 86, in _run_code
exec(code, run_globals)
File “/home/…/bundled/libs/debugpy/adapter/…/…/debugpy/launcher/…/…/debugpy/main.py”, line 39, in
cli.main()
File “/home/…/bundled/libs/debugpy/adapter/…/…/debugpy/launcher/…/…/debugpy/…/debugpy/server/cli.py”, line 430, in main
run()
File “/home/…/bundled/libs/debugpy/adapter/…/…/debugpy/launcher/…/…/debugpy/…/debugpy/server/cli.py”, line 284, in run_file
runpy.run_path(target, run_name=“main”)
File “/home/…/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py”, line 321, in run_path
return _run_module_code(code, init_globals, run_name,
File “/home/…/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py”, line 135, in _run_module_code
_run_code(code, mod_globals, init_globals,
File “/home/…/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py”, line 124, in _run_code
exec(code, run_globals)
File “/home/…/TFC-pretraining-main/code/TFC/main_modified.py”, line 573, in
mp.spawn(pre_train_and_fine_tune, args=(world_size, hyperparams, times), nprocs=world_size)
File “/home/…/.conda/envs/lrw/lib/python3.10/site-packages/torch/multiprocessing/spawn.py”, line 246, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method=“spawn”)
File “/home/…/.conda/envs/lrw/lib/python3.10/site-packages/torch/multiprocessing/spawn.py”, line 202, in start_processes
while not context.join():
File “/home/…/.conda/envs/lrw/lib/python3.10/site-packages/torch/multiprocessing/spawn.py”, line 145, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGSEGV
Your help will be appreciated. Thanks a lot! :pray:

Try to grab a stacktrace of the segfault using my instructions to narrow down the issue further.