My laptop automatically shuts without warning (sometimes) during training.
Machine config - Lenovo y540, RTX 2060, Ubuntu 18.04, PyTorch 1.4. I tried training a simple binary image classification model (4 conv layers with batchnorm and Dropout). The model trained for 20 epochs (batch size = 8) and then my laptop shut down.
Output of nvidia-smi
:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2060 Off | 00000000:01:00.0 Off | N/A |
| N/A 47C P8 3W / N/A | 10MiB / 5934MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Following is the log file before the system crashed I think. I found it in - cat /var/log/kern.log
.
Mar 10 17:05:01 maverick kernel: [ 9.279289] audit: type=1400 audit(1583840101.525:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=837 comm="apparmor_parser"
Mar 10 17:05:01 maverick kernel: [ 9.280042] audit: type=1400 audit(1583840101.529:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/sbin/dhclient" pid=828 comm="apparmor_parser"
Mar 10 17:05:01 maverick kernel: [ 9.325087] intel_rapl_common: Found RAPL domain package
Mar 10 17:05:01 maverick kernel: [ 9.325092] intel_rapl_common: Found RAPL domain core
Mar 10 17:05:01 maverick kernel: [ 9.325096] intel_rapl_common: Found RAPL domain uncore
Mar 10 17:05:01 maverick kernel: [ 9.325100] intel_rapl_common: Found RAPL domain dram
Mar 10 17:05:01 maverick kernel: [ 9.355748] input: HDA Intel PCH Mic as /devices/pci0000:00/0000:00:1f.3/sound/card0/input13
Mar 10 17:05:01 maverick kernel: [ 9.355987] input: HDA Intel PCH Headphone as /devices/pci0000:00/0000:00:1f.3/sound/card0/input14
Mar 10 17:05:01 maverick kernel: [ 9.356199] input: HDA Intel PCH HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input15
Mar 10 17:05:01 maverick kernel: [ 9.356895] input: HDA Intel PCH HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input16
Mar 10 17:05:01 maverick kernel: [ 9.357074] input: HDA Intel PCH HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input17
Mar 10 17:05:01 maverick kernel: [ 9.357296] input: HDA Intel PCH HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input18
Mar 10 17:05:01 maverick kernel: [ 9.357497] input: HDA Intel PCH HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input19
Mar 10 17:05:01 maverick kernel: [ 9.432866] dw-apb-uart.2: ttyS4 at MMIO 0x8f802000 (irq = 20, base_baud = 115200) is a 16550A
Mar 10 17:05:01 maverick kernel: [ 9.434397] iwlwifi 0000:00:14.3 wlp0s20f3: renamed from wlan0
Mar 10 17:05:01 maverick kernel: [ 9.445610] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 430.50 Thu Sep 5 22:39:50 CDT 2019
Mar 10 17:05:01 maverick kernel: [ 9.575171] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 234
Mar 10 17:05:01 maverick kernel: [ 9.623512] Bluetooth: BNEP (Ethernet Emulation) ver 1.3
Mar 10 17:05:01 maverick kernel: [ 9.623516] Bluetooth: BNEP filters: protocol multicast
Mar 10 17:05:01 maverick kernel: [ 9.623525] Bluetooth: BNEP socket layer initialized
Mar 10 17:05:01 maverick kernel: [ 9.664785] input: MSFT0001:01 06CB:CD5F Touchpad as /devices/pci0000:00/0000:00:15.1/i2c_designware.1/i2c-2/i2c-MSFT0001:01/0018:06CB:CD5F.0003/input/input24
Mar 10 17:05:01 maverick kernel: [ 9.665154] hid-multitouch 0018:06CB:CD5F.0003: input,hidraw2: I2C HID v1.00 Mouse [MSFT0001:01 06CB:CD5F] on i2c-MSFT0001:01
Mar 10 17:05:01 maverick kernel: [ 9.669632] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input20
Mar 10 17:05:01 maverick kernel: [ 9.669880] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input21
Mar 10 17:05:01 maverick kernel: [ 9.669932] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input22
Mar 10 17:05:02 maverick kernel: [ 9.767641] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190703/nsarguments-66)
Mar 10 17:05:02 maverick kernel: [ 10.035982] Generic Realtek PHY r8169-700:00: attached PHY driver [Generic Realtek PHY] (mii_bus:phy_addr=r8169-700:00, irq=IGNORE)
Mar 10 17:05:02 maverick kernel: [ 10.149333] r8169 0000:07:00.0 enp7s0: Link is Down
Mar 10 17:05:02 maverick kernel: [ 10.179246] iwlwifi 0000:00:14.3: Applying debug destination EXTERNAL_DRAM
Mar 10 17:05:02 maverick kernel: [ 10.296096] iwlwifi 0000:00:14.3: Applying debug destination EXTERNAL_DRAM
Mar 10 17:05:02 maverick kernel: [ 10.361833] iwlwifi 0000:00:14.3: FW already configured (0) - re-configuring
Mar 10 17:05:02 maverick kernel: [ 10.374304] iwlwifi 0000:00:14.3: BIOS contains WGDS but no WRDS
Mar 10 17:05:02 maverick kernel: [ 10.378535] Bluetooth: hci0: Waiting for firmware download to complete
Mar 10 17:05:02 maverick kernel: [ 10.379322] Bluetooth: hci0: Firmware loaded in 1598306 usecs
Mar 10 17:05:02 maverick kernel: [ 10.379451] Bluetooth: hci0: Waiting for device to boot
Mar 10 17:05:02 maverick kernel: [ 10.392359] Bluetooth: hci0: Device booted in 12671 usecs
Mar 10 17:05:02 maverick kernel: [ 10.395240] Bluetooth: hci0: Found Intel DDC parameters: intel/ibt-17-16-1.ddc
Mar 10 17:05:02 maverick kernel: [ 10.398388] Bluetooth: hci0: Applying Intel DDC parameters completed
Mar 10 17:05:03 maverick kernel: [ 11.148057] nvidia-uvm: Unloaded the UVM driver in 8 mode
Mar 10 17:05:03 maverick kernel: [ 11.171826] nvidia-modeset: Unloading
Mar 10 17:05:03 maverick kernel: [ 11.219065] nvidia-nvlink: Unregistered the Nvlink Core, major device number 237
Mar 10 17:05:04 maverick kernel: [ 12.125832] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
Mar 10 17:05:04 maverick kernel: [ 12.127484] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
Mar 10 17:05:04 maverick kernel: [ 12.175644] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 430.50 Thu Sep 5 22:36:31 CDT 2019
Mar 10 17:05:05 maverick kernel: [ 13.205291] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 430.50 Thu Sep 5 22:39:50 CDT 2019
Mar 10 17:05:05 maverick kernel: [ 13.250663] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 234
Mar 10 17:05:06 maverick kernel: [ 13.986003] wlp0s20f3: authenticate with 58:c1:7a:1b:bd:d0
Mar 10 17:05:06 maverick kernel: [ 13.994385] wlp0s20f3: send auth to 58:c1:7a:1b:bd:d0 (try 1/3)
Mar 10 17:05:06 maverick kernel: [ 14.047103] iwlwifi 0000:00:14.3: Unhandled alg: 0x707
Mar 10 17:05:06 maverick kernel: [ 14.063692] wlp0s20f3: authenticated
Mar 10 17:05:06 maverick kernel: [ 14.068040] wlp0s20f3: associate with 58:c1:7a:1b:bd:d0 (try 1/3)
Mar 10 17:05:06 maverick kernel: [ 14.097924] wlp0s20f3: RX AssocResp from 58:c1:7a:1b:bd:d0 (capab=0x431 status=0 aid=4)
Mar 10 17:05:06 maverick kernel: [ 14.143288] iwlwifi 0000:00:14.3: Unhandled alg: 0x707
Mar 10 17:05:06 maverick kernel: [ 14.177499] wlp0s20f3: associated
Mar 10 17:05:06 maverick kernel: [ 14.296025] IPv6: ADDRCONF(NETDEV_CHANGE): wlp0s20f3: link becomes ready
Mar 10 17:05:08 maverick kernel: [ 16.376337] bpfilter: Loaded bpfilter_umh pid 1511
Mar 10 17:05:18 maverick kernel: [ 26.325876] Bluetooth: RFCOMM TTY layer initialized
Mar 10 17:05:18 maverick kernel: [ 26.325884] Bluetooth: RFCOMM socket layer initialized
Mar 10 17:05:18 maverick kernel: [ 26.325892] Bluetooth: RFCOMM ver 1.11
Mar 10 17:05:19 maverick kernel: [ 27.169380] rfkill: input handler disabled
Mar 10 17:08:10 maverick kernel: [ 198.039283] ucsi_ccg 0-0008: failed to reset PPM!
Mar 10 17:08:10 maverick kernel: [ 198.039292] ucsi_ccg 0-0008: PPM init failed (-110)
Mar 10 17:10:11 maverick kernel: [ 319.690728] mce: CPU11: Core temperature above threshold, cpu clock throttled (total events = 75)
Mar 10 17:10:11 maverick kernel: [ 319.690729] mce: CPU5: Core temperature above threshold, cpu clock throttled (total events = 75)
Mar 10 17:10:11 maverick kernel: [ 319.690730] mce: CPU11: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [ 319.690730] mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [ 319.690772] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [ 319.690773] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [ 319.690774] mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [ 319.690775] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [ 319.690776] mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [ 319.690777] mce: CPU9: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [ 319.690778] mce: CPU4: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [ 319.690779] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [ 319.690780] mce: CPU10: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [ 319.690781] mce: CPU8: Package temperature above threshold, cpu clock throttled (total events = 290)
Mar 10 17:10:11 maverick kernel: [ 319.691710] mce: CPU5: Core temperature/speed normal
Mar 10 17:10:11 maverick kernel: [ 319.691713] mce: CPU11: Core temperature/speed normal
Mar 10 17:10:11 maverick kernel: [ 319.691716] mce: CPU11: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [ 319.691717] mce: CPU5: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [ 319.691777] mce: CPU0: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [ 319.691781] mce: CPU7: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [ 319.691783] mce: CPU6: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [ 319.691787] mce: CPU2: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [ 319.691790] mce: CPU1: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [ 319.691793] mce: CPU8: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [ 319.691798] mce: CPU10: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [ 319.691800] mce: CPU4: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [ 319.691804] mce: CPU3: Package temperature/speed normal
Mar 10 17:10:11 maverick kernel: [ 319.691807] mce: CPU9: Package temperature/speed normal
Mar 10 17:13:35 maverick kernel: [ 523.048575] wlp0s20f3: authenticate with 58:c1:7a:1b:bd:d0
Mar 10 17:13:35 maverick kernel: [ 523.055288] wlp0s20f3: send auth to 58:c1:7a:1b:bd:d0 (try 1/3)
Mar 10 17:13:35 maverick kernel: [ 523.097819] wlp0s20f3: authenticated
Mar 10 17:13:35 maverick kernel: [ 523.099819] wlp0s20f3: associate with 58:c1:7a:1b:bd:d0 (try 1/3)
Mar 10 17:13:35 maverick kernel: [ 523.107873] wlp0s20f3: RX AssocResp from 58:c1:7a:1b:bd:d0 (capab=0x431 status=0 aid=1)
Mar 10 17:13:35 maverick kernel: [ 523.109523] iwlwifi 0000:00:14.3: Unhandled alg: 0x707
Mar 10 17:13:35 maverick kernel: [ 523.110798] wlp0s20f3: associated
Mar 10 17:13:35 maverick kernel: [ 523.119975] IPv6: ADDRCONF(NETDEV_CHANGE): wlp0s20f3: link becomes ready
How can I stop this from happening again ie. stop pytorch training and not crash my system?