A potential error of torch.zeros function

Hi, I met some error when using torch.zeros function and finally found solutions to it.
However, I think there may be something wrong with the implementation itself. Can anyone help me with the root reason? Thanks.

:bug: Bug

Program gets a Segmentation fault when the parameter setting of torch.zeros function is very large and the second parameter being a tensor instead of an integer.

To Reproduce

Steps to reproduce the behavior:

  1. seq_length = torch.LongTensor(range(895))
  2. torch.zeros((69137, seq_length.max(), 13))
  3. Segmentation Fault

Expected behavior

If I do the following

import torch
torch.zeros((69137, torch.LongTensor([895]).max(), 13))

An error of TypeError: an integer is required will be shown, indicating we should change torch.LongTensor([895]) to torch.LongTensor([895]).item().
If I do the following

torch.zeros((69137, torch.LongTensor([1]).max(), 13))

No error will be produced.

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

  • PyTorch version: 0.4.1.post2

  • Is debug build: No

  • CUDA used to build PyTorch: 9.0.176

  • OS: Debian GNU/Linux 9.4 (stretch)

  • GCC version: (Debian 4.9.2-10+deb8u1) 4.9.2

  • CMake version: version 3.9.4

Python version: 2.7

  • Is CUDA available: Yes

  • CUDA runtime version: Could not collect

  • GPU models and configuration:

  • GPU 0: GeForce GTX 1080 Ti

  • GPU 1: GeForce GTX 1080 Ti

  • Nvidia driver version: 387.26

  • cuDNN version: Probably one of the following:

  • /usr/local/cuda-8.0/lib64/libcudnn.so.6

  • /usr/local/cuda-9.0/lib64/libcudnn.so

  • /usr/local/cuda-9.0/lib64/libcudnn.so.7

  • /usr/local/cuda-9.0/lib64/libcudnn.so.7.0.5

  • /usr/local/cuda-9.0/lib64/libcudnn.so.7.1.2

  • /usr/local/cuda-9.0/lib64/libcudnn_static.a

  • /usr/local/cuda-9.1/lib64/libcudnn.so

  • /usr/local/cuda-9.1/lib64/libcudnn.so.7

  • /usr/local/cuda-9.1/lib64/libcudnn.so.7.1.2

  • /usr/local/cuda-9.1/lib64/libcudnn_static.a

Versions of relevant libraries:

  • [pip] Could not collect
  • [conda] magma-cuda90 2.3.0 1 pytorch
  • [conda] pytorch 0.4.1 py27__9.0.176_7.1.2_2 pytorch
  • [conda] torch 0.4.0a0+964707e
  • [conda] torch 0.4.0a0+92a0f78
  • [conda] torchfile 0.1.0
  • [conda] torchnet 0.0.2
  • [conda] torchvision 0.2.0
  • [conda] torchvision 0.2.1 py27_1 pytorch

I am unable to reproduce this in pytorch 0.4.0, 0.4.1, 0.5.0a0+ab6afc2, 1.0.0.dev20181008

1 Like

Couldn’t reproduce the error either in 0.4.1 not the current master build.

@my_torch could you try to run your script with gdb as explained here?

How did you install pytorch? If you have installed it by building from source, it might be possible that there are some libs missing or wrongly linked. Try installing from a binary source in that case. Here is one such source:

Hi, I followed the instructions of that post and found really weird.
I created a file pytorch.py

import torch
seq_length = torch.LongTensor([895])
torch.zeros((69137, seq_length.max(), 13))

When I type python2 pytorch.py in my bash. I got segmentation fault.
While I follow the gdb instruction in that post,
I got

(gdb) run
Starting program: /home/dsk/anaconda2/bin/python2 pytorch.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Traceback (most recent call last):
  File "pytorch.py", line 5, in <module>
    torch.zeros((69137, seq_length.max(), 13))

I tried 5 times and all the cases are as I said.
Could you give me more guidance?
Thanks

Hi, I just installed pytorch using conda install pytorch torchvision -c pytorch taken from the official website.