Hi everybody, a Python newbie question and ask for guidance:

I want to adapt a couple of functions from a Github Python script to use them for a inference job using an already saved pth model that was trained using word embeddings and vocabulary (https://github.com/caolingyu/Purr/blob/master/inference.py). The current base method of the file, called do_inference takes as parameter a string, decomposes it to characters (here is my main problem since I need a word decomposition, not a character decomposition), looks up every item against a vocabulary file, converts it to a index that is accumulated in an array that is further passed to a torch model (loaded from file) and then writes back the result. I need to change the two methods input2array and do_inference to work with whole words not characters. I started to work on the first part of the do_inference method, can tokenize the input string but further on I am stuck and can’t seem to get it through.

So, the two methods (do_inference calls the input2array at some point) are:

def do_inference(data):

batch_data = []

result_total = []

for i in data: # here the method takes character by character, I can tokenize it to words array

text = i.strip()

batch_data.append(text)

if len(batch_data) == BATCH_SIZE: # BATCH_SIZE I don’t understand how it works…

result = [[] for i in range(len(batch_data))]

text = input2array(batch_data)

data = Variable(torch.LongTensor(text), volatile=True) # set volatile=True for inference

model.train(False)

output, _, _ = model(data, target=None)

output = output.data.cpu().numpy()

output = np.round(output)

non_zero_index = np.nonzero(output)

```
for i, x in enumerate(non_zero_index[0]):
result[x].append(ind2l[non_zero_index[1][i]])
result_total.extend(result)
batch_data = []
if len(batch_data) != 0:
result = [[] for i in range(len(batch_data))]
text = input2array(batch_data)
data = Variable(torch.LongTensor(text), volatile=True)
model.train(False)
output, _, _ = model(data, target=None)
output = output.data.cpu().numpy()
output = np.round(output)
non_zero_index = np.nonzero(output)
for i, x in enumerate(non_zero_index[0]):
result[x].append(ind2l[non_zero_index[1][i]])
result_total.extend(result)
return result_total
```

#############################################

def input2array(batch_input):

data = []

max_len = 0

for item in batch_input:

text = [int(w2ind[w]) if w in w2ind else len(w2ind)+1 for w in item.split()]

data.append(text)

max_len = max(len(text), max_len)

padded_data = pad_input(data, max_len)

data = np.array(padded_data)

return data

#####################################################

Thanks to everybody,

Radu