Best way to train on one-hot vectors?

Hey,
I am working on nlp tasks but this question is about data representation.
For my work I am mostly using one-hot vector representation for my features, like postags or things like that. For the moment I naively use basic Tensors as my one-hot vectors.

The training of my modele is very slow, and I am sure their is a way to better represent one-hot vectors.

Which loss function are you using?
Could you post the output, target shapes and code snippet showing your training routine?

I am using BCELoss because my output should be binary.

An input batch is a (6,1001) shaped Tensor. The 1001 values of each line are mostly zeros. The output is then a (6,0) shaped Tensor.

Nobody has good ideas for one-hot encoding in Pytorch ? I heard about an embedding layer at the beginning of the network when using one-hot vectors.

If the one-hot encoded input tensors are representing some indexing, e.g. a word index, you could use an nn.Embedding layer. Have a look at this tutorial for more information.

The one-hot encoded input tensors represent a sequence of pos tags.
One input line is composed by (for my simplest model) Three distance numbers, and 6 pos tags which are encoded as one-hot vectors. It gives me a ~195 tensor which is composed by mostly zeros.

In this condition, do you think it’s a good idea to use a nn.Embedding layer at the beginning of my network ?

Any ideas about a good practice ? Maybe should I encode my pos tags as intergers and have an nn.Embedding layer at the beginning of my network ? Or should I do the same but with one-hot encoded pos tags ?

I’m no expert in NLP, but I would assume using an nn.Embedding layer to get a dense representation of your sparse input data makes sense.
How are your distance numbers encoded? Are they integers as well?
If not, you might want to feed them separately into a linear layer and concatenate them with the embedding tensor.

No, my distance numbers are floats.

The problem is that for a given distance number, I have two corresponding pos tags. A sample is like :
0.2 [0, 0, 1] [0, 0, 1] 0.4 [0, 1, 0] [0, 0, 1] 0.7 [1, 0, 0] [0, 1, 0] 0.5 [0, 0, 1] [1, 0, 0], etc.

If I concatenate the distance numbers with the embedding tensor after, I’ll lose the fact that for example 0.2 is related with [0, 0, 1] [0, 0, 1], no ?

I’m not sure. Would passing both pos tags to the embedding layer and then concatenating it with the distance make sense? Your embedding layer should output dense representations of the concatenated pos tags, so the relationship between them and the distance should still be there, if I’m not mistaken.

How could I pass my pos tags two by two ?
0.2 [0, 0, 1] [0, 0, 1] 0.4 [0, 1, 0] [0, 0, 1] 0.7 [1, 0, 0] [0, 1, 0] 0.5 [0, 0, 1] [1, 0, 0]” is a an example of batchSize = 1.

I’m not sure what “two by two” means.
If you would like to concatenate the pos tags, could you post dummy tensors containing the pos tags and distance for a single sample?

I need my sample to be composed by several [distance postag1 postag2] parts.

I’ve created a small example using an embedding for your pos tags.
I assume your pos tags have valid values between [0, 0, 0] to [1, 1, 1].
Also, I’m treating each combination as one “word”.
Could you check, if this would work for you?

def prepare_sequence(seq, word_to_ix):
    idxs = [word_to_ix[str(w.numpy())] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)


class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.embedding = nn.Embedding(8, 3)
        self.fc = nn.Linear(7, 8)
        
    def forward(self, x_dist, x_pos1, x_pos2):
        x_pos1 = self.embedding(x_pos1)
        x_pos2 = self.embedding(x_pos2)
        x = torch.cat((x_dist, x_pos1, x_pos2), 1)
        x = self.fc(x)
        return x


# Create word_to_ix lookup for [0, 0, 0], [0, 0, 1], ...
word_to_ix = {
    '[{}]'.format(' '.join('{:b}'.format(i).zfill(3))): i
    for i in range(8)
}    

# Create dummy data
nb_samples = 10
x_dist = torch.randn(nb_samples, 1)
x_pos1 = torch.randint(0, 2, (nb_samples, 3))
x_pos2 = torch.randint(0, 2, (nb_samples, 3))

# Prepare sequences ([0, 0, 0] -> 0; [0, 0, 1] -> 1; ...)
x_pos1_idx = prepare_sequence(x_pos1, word_to_ix)
x_pos2_idx = prepare_sequence(x_pos2, word_to_ix)

model = MyModel()
output = model(x_dist, x_pos1_idx, x_pos2_idx)

So let’s suppose my sample is " 0.2 [0, 0, 1] [0, 0, 1] 0.4 [0, 1, 0] [0, 0, 1] 0.7 [1, 0, 0] [0, 1, 0]". I should create a x_pos1 tensor which should be [[0, 0, 1], [0, 1, 0], [1, 0, 0]], a x_pos2 tensor [[0, 0, 1], [0, 0, 1], [0, 1, 0]] and a x_dist tensor [0.2, 0.4, 0.7], and then pass it to the network ?

Yes, that would be my idea. However, I’m not familiar with your data and use case so your model might completely fail to learn something useful using this approach.

Hi Ptrblck,

I am using conditional GAN, which I should convert the extra information (labels) as a hot-vector and concatenate them with the noise and feed to the generator. I am really confused that in some code they used “embedding” to pass the condition (labels) . what is the difference between embedding and using one hot vector? IN the main paper said that extra information should be passed with one-hot vector, but in the codes I saw embedding.

Embedding layers expect an input with indices (your one-hot encoded tensor would have to be converted to the index representation using e.g. torch.argmax) and output a dense feature tensor.
It depends on your use case, if the sparse one-hot encoded tensor or the dense embedding output works better.

Hi Ptrblck,

I really appreciate your help. My aim is running the conditional GAN in which the labels should be embedded and concatenated to the real data for discriminator and the noise for generator.
My labels are different float numbers from 0 to 100. I used the one-hot vector to convert each float labels to the one-hot vector and then as you said get “argmax” to have index for each label.
The maximum value from “argmax” can be 400. My question is that is “nn.embedding(401,10)” correct in the current code? The size of the Labes44 is 64 including different indices from 0 to 400.

class Generator113D_v1(nn.Module):
    def __init__(self,ngpu,nz,ngf):
        super(Generator113D_v1, self).__init__()
        
        ## ---- embedding 401 different numbers from argmax one-hot vector to dim of 10
        self.embedding=nn.Embedding(401, 10)

        self.ngpu=ngpu
        self.nz=nz
        self.ngf=ngf
        self.l1= nn.Sequential(
            nn.ConvTranspose3d(self.nz+10, self.ngf * 8, 3, 1, 0, bias=False),
            nn.BatchNorm3d(self.ngf * 8),
            nn.ReLU(True),)
        self.l2=nn.Sequential(nn.ConvTranspose3d(self.ngf * 8, self.ngf * 4, 3, 2, 0, bias=False),
            nn.BatchNorm3d(self.ngf * 4),
            nn.ReLU(True),)
        self.l3=nn.Sequential(nn.ConvTranspose3d( self.ngf * 4, self.ngf * 2, 3, 1, 0, bias=False),
            nn.BatchNorm3d(self.ngf * 2),
             nn.ReLU(True),)
        self.l4=nn.Sequential(nn.ConvTranspose3d( self.ngf*2, 1, 3, 1, 0, bias=False),nn.Sigmoid()

        )

    def forward(self, input,Labels):

        Out1=self.embedding(Labels)
        print("1",Out1.shape)
        ## ---- concatenate labels and noise from channels
        Out1=Out1.unsqueeze(2).unsqueeze(3).unsqueeze(4)
        Out2=torch.cat((Out1,input),1)
        # print("2",Out2.shape)

        out=self.l1(Out2)
        # print("3",out.shape)

        out=self.l2(out)
        out=self.l3(out)
        out=self.l4(out)

        return out

class Discriminator4layer113D(nn.Module):
    def __init__(self, ngpu,ndf):
        super(Discriminator4layer113D, self).__init__()
        
        ## ---- embedding 401 different numbers from argmax one-hot vector to dim of 10
        
        self.embedding=nn.Embedding(401, 10)
        
        self.ngpu = ngpu
        self.ndf=ndf
        self.l1= nn.Sequential(nn.Conv3d(2, self.ndf, 3, 1, 0, bias=False),nn.LeakyReLU(0.2, inplace=True))
        self.l2=nn.Sequential(nn.Conv3d(self.ndf, self.ndf * 2, 3, 1, 0, bias=False),nn.BatchNorm3d(ndf * 2),nn.LeakyReLU(0.2, inplace=True))
        self.drop_out2 = nn.Dropout(0.5)
        self.l3= nn.Sequential(nn.Conv3d(self.ndf * 2, self.ndf * 4, 3, 2, 0, bias=False), nn.BatchNorm3d(ndf * 4), nn.LeakyReLU(0.2, inplace=True))
        self.drop_out3 = nn.Dropout(0.5)

        self.l4= nn.Sequential(nn.Conv3d(self.ndf * 4, 1, 3, 1, 0, bias=False),nn.Sigmoid())


    def forward(self, x,Labels):
        
         Out1=self.embedding(Labels)
         # print("d1",Out1.shape)
         # print(Out1)
         ## apply linear layer to convert the size of embdded number to the input size
         Out2= nn.Linear(10, x.shape[2]*x.shape[3]*x.shape[4])(Out1)
         # print("d2",Out2.shape)

         ## ---- reshape the label size to the size of input for concatenation
         Out3=Out2.view(-1,11,11,11).unsqueeze(1)
         # print("d3",Out3.shape)

         ## ---- concatenate labels and inputs  
         Out4=torch.cat((x,Out3),1)
 
         out = self.l1(Out4)
         out=self.l2(out)
         out=self.drop_out2(out)
         out=self.l3(out)
         out=self.drop_out3(out)
         out=self.l4(out)

         return out

 

def make_one_hot(volumein):
    
        ret44= np.array([])
        ret44=torch.from_numpy(ret44)
        for ii in range(volumein.shape[0]):
        
            volume=volumein[ii]
            
            maxRange = 100
            discretisation = .25
            ## the steps based on the 0.25 for each integer 
            nUnit =int(1/discretisation) 
            #this thing power -1 should be an integer
            sizeHot = nUnit*(maxRange)+1
            index = int(np.round(volume*nUnit))
            # print('imma putting there in ',index*discretisation)
            ret = np.zeros(sizeHot)
            ret[index] = 1
            
            ret22=torch.from_numpy(ret)
            
            ret33=ret22.unsqueeze(0)
            ret44=torch.cat((ret44,ret33),0)
            
        return ret44
    

batchsize=64
## --labels are float number from 0 to 100
Labels11=( (100-1)*torch.rand( batchsize))+1
##-- the output size is 64x401 from (OneHoted)
OneHoted=make_one_hot(Labels11)
Labels33=OneHoted
## ---- size Labels44 is 64
Labels44=torch.argmax(Labels33,1)

for epoch in range(num_epochs):
    netD.zero_grad()
    netD=netD.float()
        
    b_size = real_cpu.size(0)
    label = torch.full((b_size,), real_label, device=device)
    
    output = netD(real_cpu,Labels44).view(-1)

    errD_real = criterion(output, label)
    errD_real.backward()

    noise = torch.randn(b_size, nz,1, 1, 1, device=device)    
    netG=netG.float()
    
    label.fill_(fake_label)

    fake = netG(noise,Labels44).to(device)
    
    output = netD(fake.detach()).view(-1)
    
    errD_fake = criterion(output, label)
    errD_fake.backward()
    # Update D
    optimizerD.step()

    # (2) Update G network
    ###########################
    netG.zero_grad()
    label.fill_(real_label) 
    output = netD(fake,Labels44).view(-1)
    errG = criterion(output, label)
    errG.backward()
    # Update G
    optimizerG.step()
   

If I understand your use case correctly you are working with 400 floating point “labels” in the range [0, 100], so e.g. 0.25 would be a valid label?
If that’s the case and you are indeed working with only 400 unique values, then your workflow should be correct and the embedding layer should at least have num_embeddings=400.

1 Like