Which loss and metric to chose

Hi, I’m trying to train a classifier on images that belong to 3 classes, but because one image can belong to more than one class, I’m not sure which loss or metric to use to evaluate my model.
The output of my classifier is a sigmoid function on 3 neurons. My labels are an array with three binary elements, e.g. [[1,1,0], [0,0,0], [1, 0, 0]].

I decided to use the binary cross entropy loss and use the mean average precision to evaluate my models predictions but while the training loss decreases (a little bit) the validation loss goes up during training and the mean average precision of the predictions oscillates between 0.01 and 0.03. So I guess that I am doing something wrong. Am I using the wrong loss?
This is my code for the training:

# load simple resnet classifier
net = ClfImg(flags, classes).to(flags.device)
criterion = nn.BCELoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)


def train_loop(m, epoch):
    running_loss = 0.0
    m.train()

    # for i, (inputs, labels) in tqdm(enumerate(trainloader, 0), total=len(trainloader)):
    for i, (inputs, labels) in enumerate(trainloader, 0):
        inputs, labels = Variable(inputs['PA']).to(device), Variable(labels).to(device)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = m(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        # print statistics
        running_loss += loss.item()
        if i % 100 == 99:  # print every 100 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 100))
            running_loss = 0.0
    return m


def eval_loop(m, epoch):
    running_loss = 0.0
    predictions = torch.Tensor()
    gts = torch.Tensor()
    m.eval()
    with torch.no_grad():
        # for i, (inputs, labels) in tqdm(enumerate(testloader, 0), total=len(testloader)):
        for i, (inputs, labels) in enumerate(testloader, 0):
            inputs, labels = Variable(inputs['PA']).to(device), Variable(labels).to(device)

            outputs = m(inputs)
            loss = criterion(outputs, labels)

            predictions = torch.cat((predictions, outputs.cpu()), 0)
            gts = torch.cat((gts, labels.cpu()), 0)

            running_loss += loss.item()
            if i % len(testloader) == len(testloader) - 1:  # print every 100 mini-batches
                print('[%d, %5d] eval loss: %.3f' %
                      (epoch + 1, i + 1, running_loss / len(testloader)))
                running_loss = 0.0

    for i in range(len(classes)):
        print(f'average precision score for label {classes[i]}:',
              average_precision_score((predictions[:, i].numpy().ravel() > 0.5) * 1,
                                      (gts[:, i].numpy().ravel() > 0.5) * 1))
    print('total average precision score: ',
          average_precision_score((predictions.numpy().ravel() > 0.5) * 1, (gts.numpy().ravel() > 0.5) * 1))


for epoch in range(5):
    net = train_loop(net, epoch)
    eval_loop(net, epoch)

which outputs this:

[1,   100] loss: 0.318
[1,   200] loss: 0.207
[1,   300] loss: 0.204
[1,   400] loss: 0.209
[1,     4] eval loss: 0.216
average precision score for label Lung Opacity: 0.02746288798920378
average precision score for label Pleural Effusion: 0.04696741854636591
/home/hendrik/miniconda3/envs/mimic/lib/python3.8/site-packages/sklearn/metrics/_ranking.py:681: RuntimeWarning: invalid value encountered in true_divide
  recall = tps / tps[-1]
average precision score for label Support Devices: nan
total average precision score:  0.02530319882546603
[2,   100] loss: 0.198
[2,   200] loss: 0.201
[2,   300] loss: 0.200
[2,   400] loss: 0.204
[2,     4] eval loss: 0.234
average precision score for label Lung Opacity: 0.012267206477732794
average precision score for label Pleural Effusion: 0.031197747455057396
average precision score for label Support Devices: nan
total average precision score:  0.014783662452835385
[3,   100] loss: 0.196
[3,   200] loss: 0.193
[3,   300] loss: 0.200
[3,   400] loss: 0.201
[3,     4] eval loss: 0.244
average precision score for label Lung Opacity: 0.019230769230769232
average precision score for label Pleural Effusion: 0.06400816856957207
average precision score for label Support Devices: nan
total average precision score:  0.024417337048915994
[4,   100] loss: 0.196
[4,   200] loss: 0.199
[4,   300] loss: 0.196
[4,   400] loss: 0.192
[4,     4] eval loss: 0.234
average precision score for label Lung Opacity: 0.012267206477732794
average precision score for label Pleural Effusion: 0.07982306192832508
average precision score for label Support Devices: nan
total average precision score:  0.030305291765393632
[5,   100] loss: 0.195
[5,   200] loss: 0.199
[5,   300] loss: 0.193
[5,   400] loss: 0.190
[5,     4] eval loss: 0.254
average precision score for label Lung Opacity: 0.01537883169462117
average precision score for label Pleural Effusion: 0.0471451355661882
average precision score for label Support Devices: nan
total average precision score:  0.01959257117151854

Since my dataset is very unbalanced (counts: [6990, 3668, 932]), I tried using the WeightedRandomSampler like this:

        label_counts = get_label_counts()
        weights = calculateWeights(label_counts, trainset)
        weights = torch.DoubleTensor(weights)
        sampler = torch.utils.data.sampler.WeightedRandomSampler(weights, flags.batch_size, replacement=True)

        trainloader = DataLoader(trainset, batch_size=batch_size, sampler=sampler,
                                 shuffle=False,
                                 num_workers=dataloader_workers, pin_memory=False)

While this resulted in a decreasing eval loss, it also resulted in a decrease of the average precision metric:

[1,     4] eval loss: 0.649                                                                                                            
average precision score for label Lung Opacity: 0.023283850652271707                           
average precision score for label Pleural Effusion: 0.8453838299452334                                                                 
average precision score for label Support Devices: 0.30195558827137775                                                                 
total average precision score:  0.3840782327170712                                                                                     
[2,     4] eval loss: 0.559                                                                                                            
average precision score for label Lung Opacity: 0.00631578947368421                             
average precision score for label Pleural Effusion: 0.5314078144938429                                                                 
average precision score for label Support Devices: 0.16108225108225105                                                                 
total average precision score:  0.22906761877806514                                                                                    
[3,     4] eval loss: 0.554                                                                                                                                                                                                                                                   
average precision score for label Lung Opacity: 0.019230769230769232                                                                   
average precision score for label Pleural Effusion: 0.38544439176018125                                                                
average precision score for label Support Devices: 0.24436025408348455                                                                 
total average precision score:  0.20874856524908808                                                                                    
[4,     4] eval loss: 0.597                                                                                                            
average precision score for label Lung Opacity: 0.02124156545209177                                                                    
average precision score for label Pleural Effusion: 0.47914230019493176                                                                
average precision score for label Support Devices: 0.3903263535182041                                                                  
total average precision score:  0.289516208895232                                                                                      
[5,     4] eval loss: 0.639                                                                                                            
average precision score for label Lung Opacity: 0.06151532677848468                                                                    
average precision score for label Pleural Effusion: 0.5080161445568739                         
average precision score for label Support Devices: 0.5386326711720936                                                                  
total average precision score:  0.3640341420969798                                                                                     
[6,     4] eval loss: 0.639                                                                                                            
average precision score for label Lung Opacity: 0.058812897628687105     
average precision score for label Pleural Effusion: 0.37564192863989315                         
average precision score for label Support Devices: 0.6009432116877559                                                                  
total average precision score:  0.342691608052389                                                                                                                                                                                                                             
[7,     4] eval loss: 0.619                                                                                                                                                                                                                                                   
average precision score for label Lung Opacity: 0.0368578497525866                                                                     
average precision score for label Pleural Effusion: 0.20133116295894374                   
average precision score for label Support Devices: 0.6092644842346415                                                                  
total average precision score:  0.28078888011996644                                                                                    
[8,     4] eval loss: 0.593                                                                                                            
average precision score for label Lung Opacity: 0.034815432245772805                                                                   
average precision score for label Pleural Effusion: 0.1060627603868058              
average precision score for label Support Devices: 0.5772248803827751
total average precision score:  0.23698103747819874
[9,     4] eval loss: 0.580
average precision score for label Lung Opacity: 0.034815432245772805
average precision score for label Pleural Effusion: 0.0837037037037037
average precision score for label Support Devices: 0.5552300562727573
total average precision score:  0.22253502844545667
[10,     4] eval loss: 0.573
average precision score for label Lung Opacity: 0.04508465218991535 
average precision score for label Pleural Effusion: 0.0816934073074424
average precision score for label Support Devices: 0.52619486964423 
total average precision score:  0.21558748943364328
[11,     4] eval loss: 0.567
average precision score for label Lung Opacity: 0.049222334682860996
average precision score for label Pleural Effusion: 0.08571863262492273
average precision score for label Support Devices: 0.4878277238159777
total average precision score:  0.2060174442300899
[12,     4] eval loss: 0.555
average precision score for label Lung Opacity: 0.049222334682860996
average precision score for label Pleural Effusion: 0.08976109524457139
average precision score for label Support Devices: 0.41929824561403506
total average precision score:  0.18447285023275561

Is my metric wrong?