Torch.tensor() throws "RuntimeError: Overflow when unpacking long" exception?!

rbelew · September 10, 2024, 10:20pm

i see this exception mentioned in some older and more exotic contexts but this is as straight-forward as it gets:

    import torch

    for package in [torch]: 
        print(package.__name__, package.__version__)
    
    tstList = [10,20,30,40,50]
    tstTensor = torch.tensor(tstList)
    print('tst',type(tstTensor),tstTensor)
    
    tstLongTensor = torch.tensor(tstList,dtype=torch.int64)
    print('long',type(tstLongTensor),tstLongTensor)
    
    tstLng2Tensor = torch.tensor(tstList).type(torch.int64)
    print('long2',type(tstLng2Tensor),tstLng2Tensor) 

    indexedTst = [10239237003504588839, 9921686513378912864, 11901859001352538922, 4316640507316845735, 7162342725099726394, 17494803046312582752]
    idxTensor = torch.tensor(indexedTst)
    print('idx',type(idxTensor),idxTensor)

works fine until it’s given long ints:

torch 2.3.0
tst <class 'torch.Tensor'> tensor([10, 20, 30, 40, 50])
long <class 'torch.Tensor'> tensor([10, 20, 30, 40, 50])
long2 <class 'torch.Tensor'> tensor([10, 20, 30, 40, 50])
Traceback (most recent call last):
  File ".../tstTensor.py", line 33, in <module>
    main()
  File ".../tstTensor.py", line 29, in main
    idxTensor = torch.tensor(indexedTst)
                ^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Overflow when unpacking long

version info:

Python 3.11.9 (main, Apr 19 2024, 11:43:47) [Clang 14.0.6 ] on darwin
same exception using device=cpu or mps

ptrblck · September 10, 2024, 10:45pm

This error is expected since torch.long==torch.int64 corresponds to a max. value of 9223372036854775807 while your values are larger than this:

torch.iinfo(torch.long).max > 10239237003504588839
# False

Use torch.uint64 if you really need such large values:

indexedTst = [10239237003504588839, 9921686513378912864, 11901859001352538922, 4316640507316845735, 7162342725099726394, 17494803046312582752]
idxTensor = torch.tensor(indexedTst, dtype=torch.uint64)
print('idx',type(idxTensor),idxTensor)
# idx <class 'torch.Tensor'> tensor([10239237003504588839,  9921686513378912864, 11901859001352538922,
#          4316640507316845735,  7162342725099726394, 17494803046312582752],
#        dtype=torch.uint64)

I would also be interested to learn more about your use case and when these indices are useful.

rbelew · September 10, 2024, 11:31pm

Excellent, thanks so much @ptrblck ! i had been looking to find the max value, now i know about torch.iinfor()!

I would also be interested to learn more about your use case and when these indices are useful.

I am getting these large ints from spaCy’s token hashing function. Spacy uses hashing on texts to get unique ids (cf. SO )

>>> import spacy
>>> nlp = spacy.load('en')
>>> text = "here is some test text"
>>> doc = nlp(text)
>>> [token.norm for token in doc]
[411390626470654571, 3411606890003347522, 7000492816108906599, 1618900948208871284, 15099781594404091470]

ptrblck · September 10, 2024, 11:33pm

Ah, this is indeed an interesting use case! Thanks for sharing as I was wondering how large your tensors are to need such large indexing values

rbelew · September 10, 2024, 11:46pm

hmm, but now embedding() doesn’t know how to deal with these big uint64

RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got MPSUInt64Type instead (while checking arguments for embedding)

ptrblck · September 11, 2024, 1:46am

OK, in this case I might not fully understand your use case yet.
An nn.Embedding layer is creating an internal weight matrix (in float32 by default) in the shape [num_embeddings, embedding_dim], which is indexed with the input tensor.
Here is a small example:

num_embeddings = 10
embedding_dim = 100
emb = nn.Embedding(num_embeddings, embedding_dim)

print(emb.weight.shape)
# torch.Size([10, 100])

x = torch.tensor([0, 4, 9])
out = emb(x)
print(out.shape)
# torch.Size([3, 100])

print((out[0] == emb.weight[0]).all())
# tensor(True)
print((out[1] == emb.weight[4]).all())
# tensor(True)
print((out[2] == emb.weight[9]).all())
# tensor(True)

Note that the input indices should have the range [0, num_embeddings-1]. Out-of-range values will trigger an error:

x = torch.tensor([10])
out = emb(x)
# IndexError: index out of range in self

Your input seems to be a hash value with a max. value of >=10239237003504588839.
If you would thus try to create a weight matrix with num_embeddings=10239237003504588839 and even with embedding_dim=1, the weight matrix would use 10239237003504588839 * 4 / 1024**3 = 38144130272.81 GB of memory. Even if you could use 1-bit to store the embedding weight it would still use 10239237003504588839 / 1024**3 = 9536032568.20 GB.

Could you double check if you really want to use these has values as indices or would rather want to remap them to [0, num_embeddings-1]?

rbelew · September 13, 2024, 7:21pm

you make very good sense. the indices were intended for use with a captum tutorial that depended on a pre-trained CNN model, but you help me see that trying to build my own spacy token->index vocabulary isn’t the way. i created a different post: IMDB_TorchText_Interpret tutorial uses deprecated torchtext.