Using a word embedding model for inference - question

Hi everybody, a Python newbie question and ask for guidance:

I want to adapt a couple of functions from a Github Python script to use them for a inference job using an already saved pth model that was trained using word embeddings and vocabulary (https://github.com/caolingyu/Purr/blob/master/inference.py). The current base method of the file, called do_inference takes as parameter a string, decomposes it to characters (here is my main problem since I need a word decomposition, not a character decomposition), looks up every item against a vocabulary file, converts it to a index that is accumulated in an array that is further passed to a torch model (loaded from file) and then writes back the result. I need to change the two methods input2array and do_inference to work with whole words not characters. I started to work on the first part of the do_inference method, can tokenize the input string but further on I am stuck and can’t seem to get it through.

So, the two methods (do_inference calls the input2array at some point) are:

def do_inference(data):
batch_data = []
result_total = []
for i in data: # here the method takes character by character, I can tokenize it to words array
text = i.strip()
batch_data.append(text)
if len(batch_data) == BATCH_SIZE: # BATCH_SIZE I don’t understand how it works…
result = [[] for i in range(len(batch_data))]
text = input2array(batch_data)
data = Variable(torch.LongTensor(text), volatile=True) # set volatile=True for inference
model.train(False)
output, _, _ = model(data, target=None)
output = output.data.cpu().numpy()
output = np.round(output)
non_zero_index = np.nonzero(output)

        for i, x in enumerate(non_zero_index[0]):
            result[x].append(ind2l[non_zero_index[1][i]])

        result_total.extend(result)
        batch_data = []

if len(batch_data) != 0:
    result = [[] for i in range(len(batch_data))]
    text = input2array(batch_data)
    data = Variable(torch.LongTensor(text), volatile=True)
    model.train(False)
    output, _, _ = model(data, target=None)
    output = output.data.cpu().numpy()
    output = np.round(output)
    non_zero_index = np.nonzero(output)

    for i, x in enumerate(non_zero_index[0]):
        result[x].append(ind2l[non_zero_index[1][i]])

    result_total.extend(result)

return result_total

#############################################

def input2array(batch_input):
data = []
max_len = 0
for item in batch_input:
text = [int(w2ind[w]) if w in w2ind else len(w2ind)+1 for w in item.split()]
data.append(text)
max_len = max(len(text), max_len)
padded_data = pad_input(data, max_len)
data = np.array(padded_data)
return data

#####################################################

Thanks to everybody,

Radu