I am currently working on a network for speech sentiment analysis. I want to use an LSTM architecture-based model.
My data is of the shape (10039, 4, 68). So I have 10039 samples, and each sample has 20 timesteps with 68 features for each timestep.
From what I have seen here:
LSTM Cell refers to this:
https://pytorch.org/docs/stable/generated/torch.nn.LSTMCell.html
My implementation below I believe to be correct. At each epoch in training, I will reinitialize my hidden states and retrieve from my whole dataset (10039 samples) a batch_size portion of for example 32. These 32 samples will get into the network and for each sample, I will go timestep by timestep up until 20, feeding the with 68 features, keeping the hidden (h_t2) from the last layer in the network. I will do this from each sample in my batch. Then I will concat the outputs from each batch and return them.
However I am confused because when I researched how to approach this problem, I haven’t seen anyone doing what I am doing, I see everyone else feeding the LSTM cells with inputs of shape=(timesteps, num_features). Also by doing what I do, my hidden_states have shape=(1,num_featuers) which seems odd. Still, if I feed the LSTMCell as everyone else will it not be completely missing the point?
I mean if you don’t feed the cells timestep by timestep and just give it the whole thing (timestep, num_features) arent you losing the information you gradually obtain overtime in the LSTMCells? Or if you give them (timesteps, num_features) they already handle this issue inside in the PyTorch implementation?
Basically, if I want to implement a 3 layer LSTM with hidden sizes=(50, 25, 25) is this the correct way to do it in pytorch?
from torch import nn
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)
class AudioLSTM(nn.Module):
def __init__(self, batch_size, timesteps, feature_size, hidden_size, dropout):
super(AudioLSTM, self).__init__()
self.hidden_size = hidden_size
self.timesteps = timesteps
self.batch_size = batch_size
self.feature_size = feature_size
self.batch_norm = nn.BatchNorm1d(self.timesteps)
self.lstm1 = nn.LSTMCell(feature_size, hidden_size[0])
self.lstm2 = nn.LSTMCell(hidden_size[0], hidden_size[1])
self.lstm3 = nn.LSTMCell(hidden_size[1], hidden_size[2])
self.dropout = nn.Dropout(dropout)
def init_hidden(self):
h_t0 = torch.zeros(1, self.hidden_size[0], dtype=torch.float32)#.to(device)
c_t0 = torch.zeros(1, self.hidden_size[0], dtype=torch.float32)
h_t1 = torch.zeros(1, self.hidden_size[1], dtype=torch.float32)
c_t1 = torch.zeros(1, self.hidden_size[1], dtype=torch.float32)
h_t2 = torch.zeros(1, self.hidden_size[2], dtype=torch.float32)
c_t2 = torch.zeros(1, self.hidden_size[2], dtype=torch.float32)
return [(h_t0, c_t0), (h_t1, c_t1), (h_t2, c_t2)]
def forward(self, input, hidden):
input = self.batch_norm(input)
outputs = []
for b in range(self.batch_size):
outputs_t = []
for t in range(self.timesteps):
input_t = input[b][t]
input_t = input_t.view(1, self.feature_size)
#print(hidden[0][0][:5])
#hidden[0] = (h_t0, c_t0)
hidden[0] = self.lstm1(input_t, hidden[0])
hidden[1] = self.lstm2(hidden[0][0], hidden[1])
hidden[2] = self.lstm3(hidden[1][0], hidden[2])
output = hidden[2][0]
outputs_t.append(output)
outputs_t = torch.cat(outputs_t, dim=1)
outputs_t = outputs_t.reshape(1, -1).squeeze() #to "flatten"?
outputs.append(outputs_t)
outputs = torch.stack(outputs)
outputs = self.dropout(outputs)
return outputs
model = AudioLSTM(batch_size=1, timesteps=4, feature_size=68, hidden_size=(50, 25, 2), dropout=0.3)
# toy example training
a = torch.arange(10039*4*68).reshape(1,4,68).type(torch.FloatTensor)
batch_size = 32
for epoch in range(10):
a = split a in batches of batch_size
hidden = model.init_hidden()
out = model(a, hidden)
Keras implementation I am trying to emulate:
# speech network
input_speech = Input(shape=(20, 68), name='speech_input')
net_speech = BatchNormalization()(input_speech)
net_speech = LSTM(50, return_sequences=True)(net_speech)
net_speech = LSTM(25, return_sequences=True)(net_speech)
net_speech = LSTM(25, return_sequences=True)(net_speech)
net_speech = Flatten()(net_speech)
model_speech = Dropout(0.3)(net_speech)