A potential error of torch.zeros function

my_torch · October 13, 2018, 5:09am

Hi, I met some error when using torch.zeros function and finally found solutions to it.
However, I think there may be something wrong with the implementation itself. Can anyone help me with the root reason? Thanks.

Bug

Program gets a Segmentation fault when the parameter setting of torch.zeros function is very large and the second parameter being a tensor instead of an integer.

To Reproduce

Steps to reproduce the behavior:

seq_length = torch.LongTensor(range(895))
torch.zeros((69137, seq_length.max(), 13))
Segmentation Fault

Expected behavior

If I do the following

import torch
torch.zeros((69137, torch.LongTensor([895]).max(), 13))

An error of TypeError: an integer is required will be shown, indicating we should change torch.LongTensor([895]) to torch.LongTensor([895]).item().
If I do the following

torch.zeros((69137, torch.LongTensor([1]).max(), 13))

No error will be produced.

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

PyTorch version: 0.4.1.post2
Is debug build: No
CUDA used to build PyTorch: 9.0.176
OS: Debian GNU/Linux 9.4 (stretch)
GCC version: (Debian 4.9.2-10+deb8u1) 4.9.2
CMake version: version 3.9.4

Python version: 2.7

Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti
Nvidia driver version: 387.26
cuDNN version: Probably one of the following:
/usr/local/cuda-8.0/lib64/libcudnn.so.6
/usr/local/cuda-9.0/lib64/libcudnn.so
/usr/local/cuda-9.0/lib64/libcudnn.so.7
/usr/local/cuda-9.0/lib64/libcudnn.so.7.0.5
/usr/local/cuda-9.0/lib64/libcudnn.so.7.1.2
/usr/local/cuda-9.0/lib64/libcudnn_static.a
/usr/local/cuda-9.1/lib64/libcudnn.so
/usr/local/cuda-9.1/lib64/libcudnn.so.7
/usr/local/cuda-9.1/lib64/libcudnn.so.7.1.2
/usr/local/cuda-9.1/lib64/libcudnn_static.a

Versions of relevant libraries:

[pip] Could not collect
[conda] magma-cuda90 2.3.0 1 pytorch
[conda] pytorch 0.4.1 py27__9.0.176_7.1.2_2 pytorch
[conda] torch 0.4.0a0+964707e
[conda] torch 0.4.0a0+92a0f78
[conda] torchfile 0.1.0
[conda] torchnet 0.0.2
[conda] torchvision 0.2.0
[conda] torchvision 0.2.1 py27_1 pytorch

InnovArul · October 13, 2018, 6:22am

I am unable to reproduce this in pytorch 0.4.0, 0.4.1, 0.5.0a0+ab6afc2, 1.0.0.dev20181008

ptrblck · October 13, 2018, 12:02pm

Couldn’t reproduce the error either in 0.4.1 not the current master build.

@my_torch could you try to run your script with gdb as explained here?

InnovArul · October 13, 2018, 12:59pm

How did you install pytorch? If you have installed it by building from source, it might be possible that there are some libs missing or wrongly linked. Try installing from a binary source in that case. Here is one such source:

my_torch · October 14, 2018, 2:38am

Hi, I followed the instructions of that post and found really weird.
I created a file pytorch.py

import torch
seq_length = torch.LongTensor([895])
torch.zeros((69137, seq_length.max(), 13))

When I type python2 pytorch.py in my bash. I got segmentation fault.
While I follow the gdb instruction in that post,
I got

(gdb) run
Starting program: /home/dsk/anaconda2/bin/python2 pytorch.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Traceback (most recent call last):
  File "pytorch.py", line 5, in <module>
    torch.zeros((69137, seq_length.max(), 13))

I tried 5 times and all the cases are as I said.
Could you give me more guidance?
Thanks

my_torch · October 14, 2018, 2:38am

Hi, I just installed pytorch using conda install pytorch torchvision -c pytorch taken from the official website.