It seems my CNN model does not train

I have a CNN network that designed for speech emotion recognition task and I am trying to learn the network from the scratch in an end-to-end manner. I used concordance correlation coefficient (CCC) as objective function ( I am trying to maximize CCC or minimize 1-CCC). But, It seems my network does not learn. My input is audio sample extracted from video clip. There are totally 23 video clip with 5 minute duration that the audio frames of 14 person considered as training data. The audio sample extracted each 40 ms that results 7500 audio frame for each video clip. At 16 kHz sampling rate, this correspond to 640 sample in each frame. Each frame labeled with arousal and valence. The structure of my training network is as follow:

for i_batch, sample_batched in enumerate(train_loader):
    data_time.update(time.time() - since)
    ground_truth, audio =  sample_batched['landmark'],   sample_batched['audio'] 
    ground_truth = Variable (ground_truth, requires_grad = False)
    audio = Variable (audio, requires_grad = True)               
     # Define model.
    prediction = model(audio)
    mse_mean = 0
    ccc_mean = 0
    for i, name in enumerate(['arousal', 'valence']):
        gt_single = ground_truth[:,i]
        gt_single = gt_single.float()
        pred_single = prediction[:, i]
        ccc_loss = criterion1(pred_single , gt_single)
        if i==0:
            ccc_arousal = ccc_loss.item()
            ccc_valence = ccc_loss.item()
        ccc_mean += ccc_loss
    CCC_arousal.update(ccc_arousal, audio.size(0))           
    CCC_valence.update(ccc_valence, audio.size(0))
    losses.update((ccc_mean/2).item(), audio.size(0))

the loss value (1-CCC) is around 1 from the beginning and does not change considerably after even 50 epoch. Can anyone help me about this issue?

One way to debug it is to try to overfit your network by training with only 1 mini batch repeatedly. If your network is designed properly, your network should converge very quickly and get a loss of nearly 0. Otherwise, there might be serious design issues in your network.


Thank you for reply. As your suggestion I used 1 mini batch with 500 sample and trained my network. Unfortunately, the loss did not change considerably. My prediction model is as follow:
class RecurrentModel(nn.Module):
def init(self, input_size = args.input_size, hidden_units=256, number_of_outputs=2):
super(RecurrentModel, self).init()
self.hidden = hidden_units
self. input_size = input_size
self. n_outputs = number_of_outputs
self.lstm = nn.LSTM(input_size = self.input_size, hidden_size = self.hidden, num_layers = 2, batch_first=True)
self.linear = nn.Linear (hidden_units, number_of_outputs)
def forward(self, net):
batch_size, seq_length, num_features = list (net.size())
outputs, _ = self.lstm (net)
prediction = self.linear(outputs[0])
return torch.reshape(prediction, (batch_size*seq_length, self.n_outputs))

class AudioModel(nn.Module):
def init(self):
super(AudioModel, self).init()
self.drop = nn.Dropout()
self.conv1 = nn.Conv2d(1, 40, (1,20), padding=(0,9))
self.relu1 = nn.ReLU()
self.pool1 = nn.MaxPool2d((1, 2), (1, 2))
self.conv2 = nn.Conv2d(20, 40, (1,160), padding=(0,80) ) #padding = (0,38)
self.relu2 = nn.ReLU()
self.pool2 = nn.MaxPool2d((1, 10), (1, 10))
def forward(self, audio_frames, conv_filters = 40):

    batch_seq_length, num_features = list (audio_frames.size())
    seq_length = args.seq_length
    batch_size = args.batch_size
    rnn = RecurrentModel ()
    audio_input = torch.reshape(audio_frames, [1, 1, batch_size * seq_length, num_features])
    net = self.drop(audio_input)
    net = self.conv1(net)
    net = self.relu1(net)
    # Subsampling of the signal to 8KhZ.
    net = self.pool1(net)
    net = self.conv2(net)
    net = self.relu2(net)
    net = torch.reshape(net, (1,batch_size * seq_length,
                           num_features // 2, conv_filters)) #(num_features // 3)+37
    net = self.pool2(net)
    net = torch.reshape(net, (batch_size, seq_length,  num_features // 2 * 4)) 
    net = rnn (net)
    return net

batch_size and seq_length are equal to 25 and 1, respectively.

This suggests your network is not designed properly. I would suggest you simplify your model to a couple of layers to start with and train it with 1 mini batch. Once confirmed the network can reliably converge, you can start adding more layers.