I am using an LSTM model with 3 hidden layers to produce speech embeddings. The input to the model are MFCC features taken on a window of 0.025 seconds and the hop length is 0.01 seconds. I have two inputs for inference and their size is as follows:
- torch.Size([54, 24, 40])
- torch.Size([439, 24, 40])
The first 18 frames in both the inputs is same. I load the model as follows:
embedder_net = SpeechEmbedder() embedder_net.load_state_dict(torch.load(hp.model.model_path)) embedder_net.eval()
The state_dict looks like following:
LSTM_stack.weight_hh_l0 torch.Size([3072, 768]) LSTM_stack.bias_ih_l0 torch.Size() LSTM_stack.bias_hh_l0 torch.Size() LSTM_stack.weight_ih_l1 torch.Size([3072, 768]) LSTM_stack.weight_hh_l1 torch.Size([3072, 768]) LSTM_stack.bias_ih_l1 torch.Size() LSTM_stack.bias_hh_l1 torch.Size() LSTM_stack.weight_ih_l2 torch.Size([3072, 768]) LSTM_stack.weight_hh_l2 torch.Size([3072, 768]) LSTM_stack.bias_ih_l2 torch.Size() LSTM_stack.bias_hh_l2 torch.Size() projection.weight torch.Size([256, 768]) projection.bias torch.Size()
When I use the model for inference on the two inputs, I get results of the following size:
- torch.Size([54, 256])
- torch.Size([439, 256])
Since the first 18 frames are same in both the inputs, I expect the first 18 embeddings in both results to be the same. But this is not the case.
In fact the results in both cases are very different from the result I get when I just take frames 18 frames as the input. Any idea why that would happen?