BCE loss stuck at 0.693 in the beginnng of training and then started to decrease, why?

Hi @dhruvbird , I think I don’t have data imbalance issue. The raw data only has positive class (clicks). I sampled the negative samples based on popularity with exactly 1:1 proportion.

After the negative sampling, I splitted the train and validation sets based on TimeSeriesSplit. The clicking data in validation set are always after the clicking behaviours in training data.

In training, I did shuffling on the training dataloader (shuffle is True by default for DistributedSampler):

    # Setting num_workers as 4 * num of GPUs
    train_dataloader = DataLoader(
        dataset_dict["train"], batch_size=batch_size, collate_fn=custom_collate_function, pin_memory=True,
        num_workers=num_workers,
        shuffle=False,
        sampler=DistributedSampler(dataset_dict["train"])
    )
    valid_dataloader = DataLoader(
        dataset_dict["valid"], batch_size=batch_size, collate_fn=custom_collate_function, pin_memory=True,
        num_workers=num_workers,
        shuffle=False,
        sampler=DistributedSampler(dataset_dict["valid"], shuffle=False, drop_last=True)  # 504114 % (64 * 4) == 50 samples
    )

I think the 0.693 loss issue probably has something to do with the last fully connectly block with 4 linear layers and leakyReLU layers as mentioned by KFrank. Because when I replaced the whole fully connectly block with an inner product operation, there is no more 0.693 loss issue. But there is another problem, so I posted it separately: Does my loss curve show the model is overfitting?