Can i use deep learning to measure the similarity between two variable length voice sequences?

Usually we use DTW(Dynamic Time warpping) to measure the similarity between two variabel voice sequences. However DTW is time-cunsuming and not easy to run in the GPU since too much control in it. I wish to find a deep learning algorithm to measure the similarty. Anybody has ideas? Thanks:)

What kind of input are you using for DTW?
Some kind of mel frequency cepstral coefficient?
If the DTW works sufficiently well and you have the data, you could try to train a model to predict the DTW score for two voice sequences.
If your DTW results aren’t really good, you would have to get somehow the “ground truth” for your data.