Kernel death confusion

Revan · June 13, 2024, 7:06pm

Hey everyone,

Apologies in advance since I’m really new to pytorch and machine learning in general,

I’m trying to run a notebook where I convert data in the form of a list of 4-element lists into a tensor, then shuffle that data. When I start to shuffle the tensor based on random indices, the kernel dies, and I’m lost as to why. Can anyone help explain what’s going on here?

Everything above the screenshot is class/method definitions and import statements, the method adds 4-element lists to the parent list it takes as an argument.

Thanks!

ptrblck · June 13, 2024, 7:11pm

Try to run your code directly in a terminal as it should print an error message.

Revan · June 13, 2024, 7:21pm

There’s not really an error message here, it kind of just prints a ton of statements ending in (no debug info)

I’ve sampled a bit of it below, there’s way more but it looks more or less the same, just with different names:

[/opt/anaconda3/envs/rootML/lib/libomp.dylib] void __kmp_suspend_64<false, true>(int, kmp_flag_64<false, true>) (no debug info)
[/opt/anaconda3/envs/rootML/lib/python3.11/site-packages/torch/lib/libomp.dylib] kmp_flag_64<false, true>::wait(kmp_info, int, void*) (no debug info)
[/opt/anaconda3/envs/rootML/lib/python3.11/site-packages/torch/lib/libomp.dylib] __kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) (no debug info)
[/opt/anaconda3/envs/rootML/lib/python3.11/site-packages/torch/lib/libomp.dylib] __kmp_fork_barrier(int, int) (no debug info)
[/opt/anaconda3/envs/rootML/lib/python3.11/site-packages/torch/lib/libomp.dylib] __kmp_launch_thread (no debug info)
[/opt/anaconda3/envs/rootML/lib/python3.11/site-packages/torch/lib/libomp.dylib] __kmp_launch_worker(void*) (no debug info)
[/usr/lib/system/libsystem_pthread.dylib] _pthread_start (no debug info)
[/usr/lib/system/libsystem_pthread.dylib] thread_start (no debug info)

ptrblck · June 13, 2024, 7:54pm

libomp might be segfaulting and you could use a debugger, e.g. gdb, to create the stacktrace.

Revan · June 14, 2024, 1:51am

I use an M1 mac, so gdb doesn’t seem to be an option. Are there any alternatives you can recommend?

Revan · June 15, 2024, 7:55pm

Issue is still going. I’ve retried running in terminal after decoupling a different package, and got the following error:

rootML) (base) [npolishetty@arm64-apple-darwin20 MachineLearningStatistics (master ✗)]$ ipython modelNetwork.ipynb
No event loop hook running.
[1]    37447 segmentation fault  ipython modelNetwork.ipynb

Revan · June 15, 2024, 8:18pm

Tried doing some digging, looks like segmentation faults are related to Cython and memory, so I attached the following lines to highlight the issue:

data_train = tensorData.view(-1)[idx].view(tensorData.size())

idx is just torch.randperm(8591). Would this just be the result of an extremely memory inefficient approach? I used the method listed here. If it’s just a memory issue, are there any alternative approaches to shuffling that I should look into?

fasterinnerlooper · June 27, 2024, 6:10am

It does feel like this might be a memory issue. One quick thing you could do is to try running this on another machine with more memory and see if it still fails.
Another thing you could try is some of the top suggestions from this SO thread. If you’re in Python 3.4 then theasmalloc might be the best thing to try, otherwise there appear to be pretty good third party libraries that you could use to see whether this is indeed a memory allocation issue.
I hope these suggestions help you out.

Revan · June 27, 2024, 1:36pm

Thanks for responding, I forgot to check back here to give an update. It was in fact just a memory issue, and after cutting down the amount of data I was feeding, everything worked fine. Thanks!