Hi Soumith, I just saw your reply and went to try my earlier script.
I ran two versions of the actual training code, one with a thread lock on the model and the other without the lock. It turned out the one with lock is stilling running now (>1 hour), and the one without lock gave Segmentation fault sooner or later (I tried more than once).
I also tried the script I put in the gist and I also got the error.
To recover the error, I usually run the script for 20 times by running
for i in `seq 0 20`; do python multithread.py; done
(It's better to put the shell script in a .sh file in case you wanna kill it).
BTW, it happens with both GPU and CPU (make the network smaller so that the script is faster for running 20 runs).
For your information, my environment is python2.7 with cuda 8.0.