I have a CNN network that designed for speech emotion recognition task and I am trying to learn the network from the scratch in an end-to-end manner. I used concordance correlation coefficient (CCC) as objective function ( I am trying to maximize CCC or minimize 1-CCC). But, It seems my network does not learn. My input is audio sample extracted from video clip. There are totally 23 video clip with 5 minute duration that the audio frames of 14 person considered as training data. The audio sample extracted each 40 ms that results 7500 audio frame for each video clip. At 16 kHz sampling rate, this correspond to 640 sample in each frame. Each frame labeled with arousal and valence. The structure of my training network is as follow:
for i_batch, sample_batched in enumerate(train_loader): data_time.update(time.time() - since) ground_truth, audio = sample_batched['landmark'], sample_batched['audio'] ground_truth = Variable (ground_truth, requires_grad = False) audio = Variable (audio, requires_grad = True) # Define model. optimizer.zero_grad() prediction = model(audio) mse_mean = 0 ccc_mean = 0 for i, name in enumerate(['arousal', 'valence']): gt_single = ground_truth[:,i] gt_single = gt_single.float() pred_single = prediction[:, i] ccc_loss = criterion1(pred_single , gt_single) if i==0: ccc_arousal = ccc_loss.item() else: ccc_valence = ccc_loss.item() ccc_mean += ccc_loss CCC_arousal.update(ccc_arousal, audio.size(0)) CCC_valence.update(ccc_valence, audio.size(0)) losses.update((ccc_mean/2).item(), audio.size(0)) (ccc_mean/2).backward(retain_graph=True) optimizer.step()
the loss value (1-CCC) is around 1 from the beginning and does not change considerably after even 50 epoch. Can anyone help me about this issue?