Porting theano function() with updates to Pytorch (negative sampling RuntimeError: Expected hidden size)

ez_ok · July 2, 2021, 1:39pm

Hi, I’m trying to port code from Theano to PyTorch, I have very limited understanding of how both frameworks actually work to be frank, so please bear with me! I would greatly appreciate any help in furthering my understanding.

github.com

hidasib/GRU4Rec/blob/master/gru4rec.py#L614


      
          out_idx = data_items[start+i+1]
          if self.n_sample and store_type == 'cpu':
              if sample_store:
                  if sample_pointer == generate_length:
                      neg_samples = self.generate_neg_samples(pop, generate_length)
                      sample_pointer = 0
                  sample = neg_samples[sample_pointer]
                  sample_pointer += 1
              else:
                  sample = self.generate_neg_samples(pop, 1)
              y = np.hstack([out_idx, sample])
          else:
              y = out_idx
              if self.n_sample:
                  if sample_pointer == generate_length:
                      generate_samples()
                      sample_pointer = 0
                  sample_pointer += 1
          reset = (start+i+1 == end-1)
          cost = train_function(in_idx, y, len(iters), reset.reshape(len(reset), 1))
          c.append(cost)

Is the code I’m trying to port. Part of the code has already been ported to PyTorch, this can be found here: https://github.com/hungthanhpham94/GRU4REC-pytorch/tree/master/lib

A number of features are missing from the PyTorch implementation that exist in the original code. I’ve made a bunch of modifications already but have a hit a block with regards to negative sampling.

In the original code, a batch size is defined (default = 32) and additional negative samples (default n_sample = 2048 per batch afaik) are stored in GPU memory.

In Theano:

                P = theano.shared(pop.astype(theano.config.floatX), name='P')
                self.ST = theano.shared(np.zeros((generate_length, self.n_sample), dtype='int64'))
                self.STI = theano.shared(np.asarray(0, dtype='int64'))
                X = mrng.uniform((generate_length*self.n_sample,))
                updates_st = OrderedDict()
                updates_st[self.ST] = gpu_searchsorted(P, X, dtype_int64=True).reshape((generate_length, self.n_sample))
                updates_st[self.STI] = np.asarray(0, dtype='int64')
                generate_samples = theano.function([], updates=updates_st)
                generate_samples()
                sample_pointer = 0

The above block is creating an array of idxs stored in gpu memory. Which I’ve implemented in the DataLoader class as:

def generate_negatives(self):
    P = torch.FloatTensor(self.pop)
    ST = torch.LongTensor(np.zeros((self.generate_length, self.n_sample), dtype='int64'))
    STI = torch.LongTensor(np.asarray(0, dtype='int64'))
    X = torch.rand((self.generate_length * self.n_sample,))
    return torch.searchsorted(P, X).reshape((self.generate_length, self.n_sample))

In Theano, the negative generator is used here:

        while not finished:
               ........
                else:
                    y = out_idx
                    if self.n_sample:
                        if sample_pointer == generate_length:
                            generate_samples()
                            sample_pointer = 0
                        sample_pointer += 1
                reset = (start+i+1 == end-1)
                cost = train_function(in_idx, y, len(iters), reset.reshape(len(reset), 1))

where the train_function is defined as:

train_function = function(inputs=[X, Y, M, R], outputs=cost, updates=updates, allow_input_downcast=True, on_unused_input=‘ignore’)

and an example loss function is as follows:

def bpr(self, yhat, M):
    return T.cast(T.sum(-T.log(T.nnet.sigmoid(gpu_diag(yhat, keepdims=True)-yhat))), theano.config.floatX)

In PyTorch, I’ve attempted to implement the negative generator in the same way:

while not finished:
            minlen = (end - start).min()
            # Item indices(for embedding) for clicks where the first sessions start
            idx_target = df.item_idx.values[start]
            for i in range(minlen - 1):
                # Build inputs & targets
                idx_input = idx_target
                idx_target = df.item_idx.values[start + i + 1]
                if self.n_sample:
                    if sample_pointer == self.generate_length:
                        neg_samples = self.generate_negatives()
                        sample_pointer = 0
                    sample = neg_samples[sample_pointer]
                    sample_pointer += 1
                    # idx_target = np.hstack([idx_target, sample]) # like cpu version (doesn't work due to hidden size)
                input = torch.LongTensor(idx_input)
                target = torch.LongTensor(idx_target)
                yield input, target, mask

The above generator is used in train_epoch method in Trainer class:

if self.n_sample:
    dataloader = DataLoader(self.train_data, self.batch_size, self.n_sample, self.generate_length)
else:
    dataloader = DataLoader(self.train_data, self.batch_size)
for ii, (input, target, mask) in enumerate(dataloader):
    input = input.to(self.device)
    target = target.to(self.device)
    self.optim.zero_grad()
    hidden = reset_hidden(hidden, mask).detach()
    logit, hidden = self.model(input, hidden)
    # output sampling
    logit_sampled = logit[:, target.view(-1)]
    loss = self.loss_func(logit_sampled)
    losses.append(loss.item())
    loss.backward()
    self.optim.step()

The same loss function is defined as:

class BPRLoss(nn.Module):
    def __init__(self):
        super(BPRLoss, self).__init__()
    def forward(self, logit):
        diff = logit.diag().view(-1, 1).expand_as(logit) - logit
        loss = -torch.mean(F.logsigmoid(diff))
        return loss

from my understanding, in Theano, in_idx and y (input item idxs, target item idxs respectively), are of the same shape (and must be for the loss function to work, where the diag is the scores for target items and remaining elements are scores for negative items sampled from the current mini-batch). That is also the case in the PyTorch implementation. Given this, how then is the loss calculated on the additional negative samples?

In the Theano CPU implementation (which is deprecated):

y = np.hstack([out_idx, sample])

The GPU implementation:

    def model(self, X, H, M, R=None, Y=None, drop_p_hidden=0.0, drop_p_embed=0.0, predict=False):
        sparams, full_params, sidxs = [], [], []
        if (hasattr(self, 'ST')) and (Y is not None) and (not predict) and (self.n_sample > 0):
            A = self.ST[self.STI]
            Y = T.concatenate([Y, A], axis=0)

If our batch size was 32, and n_sample was 2048, using the above logic (concatenating sample to target), we would obtain an input of size 32, a target of size 32 + 2048 = 2080. Resulting in the following error:

RuntimeError: Expected hidden size (3, 2080, 100), got [3, 32, 100].

How can this dimension mismatch be resolved?

Kind regards