I would like to fine-tune the pre-trained VGG-Face network as described below:

min_{W,θ} ∑ {i=1 to N} L(sigmoid(Wη(a_i; θ)), yi)

where where η(a_i; θ) represents the output of the last full connected layer in the VGG-Face network.
θ and W separately denotes the network parameters of the VGG-Face network and the weights of the sigmoid layer.
L is the cross entropy.
Please some one help me.
Thank you

Want to minimize to loss function as described below:
Given i-th input video clip a i (i = 1, 2, · · · N) and its corresponding Big-Five personality score y_i , we fine-tune the pre-trained VGG-Face network to obtain deep segment-level audio feature representations, as described below:
min W_VG,θ_VG ∑ {i=1 to N } L(sigmoid(W_VG * η_VG(a_i; θ_VG)), y_i)
where η_VG(a_i; θ_VG) represents the output of the last full connected layer in the VGG-Face network. θ_VG and W_VG separately denotes the network parameters of the VGGish network and the weights of the sigmoid layer.