Detecting word boundaries in a speech using LSTMs

I have a training dataset that contains speech recordings (wav files) of several hundred speakers
speaking English sentences. Each wav file is associated with a word file (.wrd file) that would contain the words that are there in the sentence along with their temporal boundaries, i.e. the starting and ending time of the word. The sampling is done at 16000Hz.

I wish to determine the word boundaries in a new .wav file given the list of words that are spoken.
Please suggest some approaches and references on how should I proceed.

I have some idea that this problem can be solved by using approaches like connectionist temporal cost, but don’t know how t proceed. Any help would be appreciated.
Thanks a lot!!

Please comment with your suggestions