BUG IN Training a Classifier

eslam_fouda · November 3, 2020, 7:20pm

I installed pytorch 1.7 and cuda 11

then I start learning pytorch but when I implement the code in the following link:
https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html

using jupyter notebook I faced this error:

the kernel appears to have died. it will restart automatically.

I try the following:

first: downgrade to cuda 10.2
then try again with jupyter notebook I face the same error and when trying to rerun the cell that cause the error in the forth step I found the following error:
TypeError: Caught TypeError in DataLoader worker process 0
and try to set num_worker to zero. the cell required to redefine some predefined cells but in vain.

second: I implement the code in simple .py file and found the following error:

Illegal instruction (core dumped)
and found the problem in following lines using print statements:
loss.backward()
optimizer.step()

How can I fix this problem?

Thanks in advance.

eslam_fouda · November 3, 2020, 8:19pm

the computer also starts to hang and give these screens:

eslam_fouda · November 3, 2020, 8:22pm

was found before the problem

eslam_fouda · November 6, 2020, 7:54am

my system info is:
GPU gtx 1070
(base) eslam@scholar:~$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 6
model name : AMD Athlon™ II X2 250 Processor
stepping : 3
microcode : 0x10000c8
cpu MHz : 1800.000
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save
bugs : tlb_mmatch fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2
bogomips : 6026.74
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

processor : 1
vendor_id : AuthenticAMD
cpu family : 16
model : 6
model name : AMD Athlon™ II X2 250 Processor
stepping : 3
microcode : 0x10000c8
cpu MHz : 2300.000
cache size : 1024 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save
bugs : tlb_mmatch fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2
bogomips : 6026.74
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

(base) eslam@scholar:~$ sudo dmidecode -t 2
[sudo] password for eslam:

dmidecode 3.2

Getting SMBIOS data from sysfs.
SMBIOS 2.4 present.

Handle 0x0002, DMI type 2, 8 bytes
Base Board Information
Manufacturer: Gigabyte Technology Co., Ltd.
Product Name: GA-MA74GMT-S2
Version: x.x
Serial Number:

ptrblck · November 6, 2020, 10:56am

I would recommend to check the system logs to see what’s causing the machine to restart.
A failing Python script should just raise an error, but shouldn’t take down the whole workstation so I guess your current system might encounter some critical issues.

eslam_fouda · November 7, 2020, 10:02am

Thanks for your response.
The machine stopped restarting and hanging but the second screen still found.
The error already in the following two lines:
loss.backward()
optimizer.step()

eslam_fouda · November 7, 2020, 12:41pm

I also tried to build from source after I got rid of the two screens but in vain

ptrblck · November 7, 2020, 9:28pm

I still think you are facing a potential hardware error as shown in the screenshot so I would recommend to look into the system error logs and try to find any clues what might be wrong.
If you cannot find anything, try to run some RAM tests etc.

eslam_fouda · November 11, 2020, 4:19pm

I’m beginner. what are system error logs? how could I run some RAM tests?
while install ubuntu 16.04 I faced the second screen above it disappeared after installation.
I faced no original module exists within this kernel while setup gpu driver.

ptrblck · November 12, 2020, 5:38am

You could start with checking dmesg. However, it seems that your initial installation might have some issues based on your last comments.

Is this the error message you were seeing?