I am trying to sample from an autoencoder model (Char RNN) in a small web app. The sampling works perfectly fine, the problem is performance: it’s slow - too slow to use beyond fairly limited sample length. On the container I’m serving from it can take tens of seconds to sample a thousand characters. This has to happen in realtime, so I’m wondering am I missing a trick?
Not surprisingly the majority of time is spent iterating over the number of requested samples, for which the code looks like this:
for p in range(predict_len): output, hidden = model(inp, hidden) # Sample from the network as a multinomial distribution output_dist = output.data.view(-1).div(temperature).exp() top_i = torch.multinomial(output_dist, 1) # Add predicted character to string and use as next input predicted_char = index_to_char[top_i] predicted += predicted_char inp = index_to_tensor(char_to_index[predicted_char])
And specifically this line:
output, hidden = model(inp, hidden)
Is there any way to speed this up?
- I see a significant (although still not ideal) speed-up when using GPU as opposed to CPU, however GPU is not available to me for serving the web app.
- the model checkpoint is 230Mb on disk, which is obviously not small, so if there is a way to reduce this after training that would be an option as well (not sure how though)
More broadly, how do people serve models in realtime when sampling tends to be this slow? Or is that par for the course really and I’m not going to be able to optimise it much?
Thanks in advance for any advice!