How to get custom dataset to return tuple of embedding sizes instead of tensors

Jordan_Howell · December 27, 2019, 1:09pm

Hello,

I hope everyone in the community is well.

I’m trying to get my custom data set to return the following:

Image tensor
Policy (unique ID)
numerical columns tensor
categorical columns tensor
categorical embedding sizes tuple

I have 1 through 4 coming back correctly. However, when trying to return the embedding size tuples, I am not getting tuples but tensors and I’m not sure why. This throws an error trying to instantiate my model. The below is the code to my data set:

class image_Dataset(Dataset):
    '''
    image class data set   
    
    '''
    def __init__(self, data, transform = None):
        '''
        Args:
        ------------------------------------------------------------
            data = dataframe
            image = column in dataframe with absolute path to the image
            label = column in dataframe that is the target classification variable
            numerical_columns =  numerical columns from data
            categorical_columns = categorical columns from data
            policy = ID variable
            
        '''
        self.image_frame = data
        self.transform = transform
        
    def __len__(self):
        return len(self.image_frame)
    
    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
         
        label = self.image_frame.loc[idx, 'target']
        pic = Path(self.image_frame.loc[idx,'location'])
        img = Image.open(pic)
        policy = self.image_frame.loc[idx, 'policy']
        sample = {'image': img, 'policy': policy, 'label':label}
        numerical_data = self.image_frame.loc[idx, numerical_columns]

        if self.transform:
            image = self.transform(img)
            
        for category in categorical_columns:
            self.image_frame[category] = self.image_frame[category].astype('category')
            
        categorical_column_sizes = [len(self.image_frame[column].cat.categories) for column in categorical_columns]
        categorical_embedding_sizes = [(col_size, min(50, (col_size+1)//2)) for col_size in categorical_column_sizes]
        
        self.image_frame[category] = self.image_frame[category].astype('category').cat.codes.values
            
        categorical_data = self.image_frame.loc[idx, categorical_columns]
        categorical_data = torch.tensor(categorical_data, dtype = torch.int64)
        
        numerical_data = torch.tensor(numerical_data, dtype = torch.float)
            
        return image, label, policy, numerical_data, categorical_data, categorical_embedding_sizes

I’m not sure if this is needed, but this is the model object:

class Image_Embedd(nn.Module):

    def __init__(self, embedding_size):
        '''
        Args
        ---------------------------
        embedding_size: Contains the embedding size for the categorical columns
        num_numerical_cols: Stores the total number of numerical columns
        output_size: The size of the output layer or the number of possible outputs.
        layers: List which contains number of neurons for all the layers.
        p: Dropout with the default value of 0.5
        
        '''
        super(Image_Embedd, self).__init__()
        
        self.all_embeddings = nn.ModuleList([nn.Embedding(ni, nf) for ni, nf in embedding_size])
        self.embedding_dropout = nn.Dropout(p)
        
        
        self.cnn = models.resnet50(pretrained=False)
        
        self.cnn.fc = nn.Linear(self.cnn.fc.in_features, 1000)
        self.fc1 = nn.Linear(1000, 1017)
        self.fc2 = nn.Linear(1017, 128)
        self.fc3 = nn.Linear(128, 2)
        
        
    #define the foward method
    def forward(self, image, x_numerical, x_categorical):
        
        embeddings = []
        for i, e in enumerate(self.all_embeddings):
            embeddings.append(e(x_categorical[:,i]))
        
        x1 = self.cnn(image)
        x2 = numerical_data
        
        x = torch.cat((x1, x2), dim = 1)
        x = torch.cat(embeddings, 1)
        #x = F.relu(self.fc1(x))
        x = self.fc2(x)
        #x = F.relu(self.fc2(x))
        x = self.fc3(x)
        x = F.log_softmax(x)
        return x

And this is the error thrown when I try to instantiate:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-671-2a2bc631b997> in <module>
----> 1 combined_model = Image_Embedd(categorical_embedding_sizes)
      2 
      3 combined_model = combined_model.cuda()

<ipython-input-669-6ab41a05624c> in __init__(self, embedding_size)
     14         super(Image_Embedd, self).__init__()
     15 
---> 16         self.all_embeddings = nn.ModuleList([nn.Embedding(ni, nf) for ni, nf in embedding_size])
     17         self.embedding_dropout = nn.Dropout(p)
     18 

<ipython-input-669-6ab41a05624c> in <listcomp>(.0)
     14         super(Image_Embedd, self).__init__()
     15 
---> 16         self.all_embeddings = nn.ModuleList([nn.Embedding(ni, nf) for ni, nf in embedding_size])
     17         self.embedding_dropout = nn.Dropout(p)
     18 

C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\sparse.py in __init__(self, num_embeddings, embedding_dim, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse, _weight)
     95         self.scale_grad_by_freq = scale_grad_by_freq
     96         if _weight is None:
---> 97             self.weight = Parameter(torch.Tensor(num_embeddings, embedding_dim))
     98             self.reset_parameters()
     99         else:

TypeError: new() received an invalid combination of arguments - got (Tensor, Tensor), but expected one of:
 * (torch.device device)
 * (torch.Storage storage)
 * (Tensor other)
 * (tuple of ints size, torch.device device)
      didn't match because some of the arguments have invalid types: (!Tensor!, !Tensor!)
 * (object data, torch.device device)
      didn't match because some of the arguments have invalid types: (!Tensor!, !Tensor!)

Thank you for the help.

ptrblck · December 29, 2019, 9:38am

You could try to get the Python scalar values for ni and nf via:

... nn.Embedding(ni.item(), nf.item()) ...

Let me know, if that works for you.

Jordan_Howell · December 29, 2019, 12:24pm

Thank you. I am about to go out of town but will try when I get back. Thanks again. Happy new year.

Jordan

Jordan_Howell · January 2, 2020, 5:57pm

I’ve tried a few different things (one post withdrawn). It finally worked by bringing in the embedding sizes at the instantiation, not from the data loader. However, if I want to test the model out, I want to use the test/validation data set’s embedding sizes. So bringing in the embedding sizes as an argument doesn’t work for new data. I tried the method you suggested and I get the following error:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-179-e5525e645ef2> in <module>
      1 torch.manual_seed(101)
----> 2 combined_model = Image_Embedd()
      3 criterion = torch.nn.NLLLoss().cuda()
      4 optimizer = torch.optim.Adam(combined_model.parameters(), lr=0.001)
      5 scheduler = ReduceLROnPlateau(optimizer, 'min', patience = 4, verbose = True, min_lr = .00000001)

<ipython-input-178-54593bbab9b1> in __init__(self)
     14         super().__init__()
     15 
---> 16         self.all_embeddings = nn.ModuleList([nn.Embedding(ni.item(), nf.item()) for ni, nf in embedding_size])
     17         self.embedding_dropout = nn.Dropout(p = .04)
     18 

NameError: name 'embedding_size' is not defined

Here is the model object for reference:

class Image_Embedd(nn.Module):

    def __init__(self):
        '''
        Args
        ---------------------------
        embedding_size: Contains the embedding size for the categorical columns
        num_numerical_cols: Stores the total number of numerical columns
        output_size: The size of the output layer or the number of possible outputs.
        layers: List which contains number of neurons for all the layers.
        p: Dropout with the default value of 0.5
        
        '''
        super().__init__()
              
        self.all_embeddings = nn.ModuleList([nn.Embedding(ni.item(), nf.item()) for ni, nf in embedding_size])
        self.embedding_dropout = nn.Dropout(p = .04)
        
        
        self.cnn = models.resnet50(pretrained=False).cuda()
        
        self.cnn.fc = nn.Linear(self.cnn.fc.in_features, 1000)
        self.fc1 = nn.Linear(1000, 1077)
        self.fc2 = nn.Linear(1077, 128)
        self.fc3 = nn.Linear(128, 2)
        
        
    #define the foward method
    def forward(self, image, x_numerical, x_categorical):
        
        embeddings = []
        for i, e in enumerate(self.all_embeddings):
            embeddings.append(e(x_categorical[:,i]))
            
        x = torch.cat(embeddings, 1)
        x = self.embedding_dropout(x)
        x1 = self.cnn(image)
        x2 = numerical_data
        
        x3 = torch.cat((x1, x2), dim = 1)
        x4 = torch.cat((x, x3), dim = 1)
        x4 = F.relu(self.fc2(x4))
        x4 = F.relu(self.fc3(x4))
        x4 = F.log_softmax(x4)
        return x4

So the problem set is still how to get the embedding sizes in.

ptrblck · January 4, 2020, 3:28am

In your first code snippet, you were passing embedding_size as an argument to __init__, while you’ve removed it in your current code, which will throw this error.

Jordan_Howell · January 6, 2020, 5:16pm

Yes. I’ve added it back in. I still haven’t comprehended how to bring new data in if the embedding sizes are different. Do you know anyone that can recommend a solution for this problem?

ptrblck · January 7, 2020, 5:24am

An embedding is similar to a lookup table, where you provide indices as the input and get dense tensors (defined via embedding_dim) as the output.

Could you explain a bit, how your embedding sizes differ?

The input shape is flexible as seen here:


nb_words = 10
emb_dim = 100

emb = nn.Embedding(nb_words, emb_dim)

x = torch.randint(0, nb_words, (10,))
output = emb(x)
print(output.shape)
> torch.Size([10, 100])

x = torch.randint(0, nb_words, (10, 2))
output = emb(x)
print(output.shape)
> torch.Size([10, 2, 100])

x = torch.randint(0, nb_words, (10, 2, 3))
output = emb(x)
print(output.shape)
> torch.Size([10, 2, 3, 100])

Do you want to add new (unknown) word indices to the embedding matrix?

Jordan_Howell · January 7, 2020, 2:30pm

I have the training data, testing data and validation data separate. My training/test data embedding sizes are:

[(8, 4), (21, 11), (3, 2), (48, 24), (5, 3), (9, 5), (5, 3), (21, 11), (10, 5)]

My validation embedding sizes are:

[(8, 4), (24, 12), (3, 2), (48, 24), (4, 2), (8, 4), (4, 2), (21, 11), (10, 5)]

One of those sizes represent states out of the US. If I were to add a state I don’t currently have, wouldn’t this error out the model due to the difference in embedding sizes? How would I solve for that type of problem set?

ptrblck · January 9, 2020, 9:22am

I assume the new state would have a new index?
If so, it’s not easy to answer, as it would be comparable to an increased feature space for a linear layer.
The weight matrix does not contain valid parameters for this index yet, so you could e.g. extend the weight parameter and initialize the new parameters randomly.

We had similar discussions in this forum about adding new feature dimensions to a linear or conv layer and I’m not sure, what the best approach was (initializing the new parameters randomly or copying from existing parameters).

Jordan_Howell · January 10, 2020, 5:48pm

Sounds like a “user design” question for the team then. Thank you for all the help.