Hello everyone,

I am fine tuning a model for sound event detection, taken from https://github.com/qiuqiangkong/audioset_tagging_cnn, on the Urbansed dataset. In this task, the model should predict a [batch, n_classes, time_steps] matrix, with a value of one indicating the presence of an event at a certain time step.

However, my network does not seem to train. Specifically, after about the first 10 epochs, my loss stops decreasing. If i check the predictions of the model, the output is composed entirely of 0.5s.

I have tried:

- Changing the amount of l2 regularization, even turning it off completely
- Changing the learning rate
- Doing a mock training with only 2 samples to see if the network could learn the simple problem. The result was the same matrix of 0.5.
- Different optimizers (Adam and SGD so far)
- BCEWithLogitsLoss with reduction = mean and sum (using this loss as in theory multiple classes can be active at a time)

My loss and optimizer:

```
criterion = nn.BCEWithLogitsLoss(reduction='sum')
optimizer = optim.SGD(model.parameters(), lr=0.001)
```

My training loop:

```
for i, data in enumerate(dataloader_train):
inputs, labels = data
inputs = inputs.type(torch.FloatTensor)
optimizer.zero_grad()
outputs = model(inputs).cpu()
loss = 0
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
```

I can’t figure out what’s wrong. Any thoughts?

Thanks,

Federico