Questions when addressing token classification using transformers

I have a PyTorch model composed of a Distilbert and a BiLSTM with the following structure. Its purpose involves performing token classification over a vast amount of categories (num_labels=1182) by attaching the output of the transformer to the input of the BiLSTM.

Here is an example of the input data stored in the Dataloaders, shortened to a max length of 32 instead of 256 for the sake of readability. Special tokens set by the tokenizer (including padding) have label -100 and tokens without any category have label 0:

Sample of the training set: 
{'labels': [-100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1114, 0, 242, 242, 0, 425, 0, 0, 0, 0, 182, 182, 182, 0, 0, 0, -100, -100, -100, -100],
'input_ids': [2, 48, 30, 525, 67, 311, 18, 18, 780, 30, 18, 332, 5717, 389, 6802, 9987, 18, 910, 251, 708, 6311, 18, 6821, 22732, 22, 270, 3345, 246, 3, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]}

And here is the aforementioned Pytorch model:

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoModelForTokenClassification

import utilities as utils
from global_constants import MAX_DOC_LENGTH # equals to 256

class CustomTorchModel(nn.Module):
    def __init__(self, args_model_name_or_path):
        id_to_label, label_to_id = utils.unshelve_label_converters()
        label_qty = len(list(label_to_id)) # equals to 1182
        self.distilbert_layer = AutoModelForTokenClassification.from_pretrained(
            args_model_name_or_path,
            id2label=id_to_label,
            label2id=label_to_id,
            num_labels=label_qty
        )
        self.bilstm_layer = nn.LSTM(input_size=MAX_DOC_LENGTH,
                                    hidden_size=self.distilbert_layer.config.dim, # equals to 768
                                    num_layers=1, 
                                    batch_first=True,
                                    bidirectional=True)
        self.classification_layer = nn.Linear(2*self.distilbert_layer.config.dim, label_qty)

    def forward(self, inputs):
        distilbert_output = self.distilbert_layer(input_ids=inputs[0], attention_mask=inputs[1])
        bilstm_output, (last_hidden, last_cell) = self.bilstm_layer(distilbert_output.last_hidden_state)
        output = self.classification_layer(bilstm_output)
		print("input_ids size: " + str(inputs[0].size())) # prints torch.Size([8, 256])
		print("attention_mask size: " + str(inputs[1].size())) # prints torch.Size([8, 256])
		print("distilbert_output.last_hidden_state size: " + str(distilbert_output.last_hidden_state.size())) # prints torch.Size([8, 256, 768])
		print("BiLSTM output size: " + str(bilstm_output.size())) # prints torch.Size([8, 256, 1536])
		print("output size: " + str(output.size())) # prints torch.Size([8, 256, 1182])
        return F.softmax(output)

This model represented by CustomTorchModel is used in the main file, which leverages Ignite. The core of the problem is located in the create_supervised_trainer method, which handles a CrossEntropyLoss. This loss demands the order [batch, categories, sequence], so a transposition is made before calculating the loss:

    criterion = nn.CrossEntropyLoss(reduction='mean')
    optimizer = AdamW(model.parameters(), lr=1e-5)
    lr_scheduler = ExponentialLR(optimizer, gamma=0.90)
    trainer = create_supervised_trainer1(model.to(device), optimizer, criterion, device=device)

def _prepare_batch(batch, device=None, non_blocking=False):

    x = [batch["input_ids"], batch["attention_mask"]] # list
    y = batch["labels"]
    return (convert_tensor(x, device=device, non_blocking=non_blocking),
            convert_tensor(y, device=device, non_blocking=non_blocking))

def create_supervised_trainer1(model, optimizer, loss_fn, metrics={}, device=None):

    def _update(engine, batch):
        model.train()
        optimizer.zero_grad()
        x, y = _prepare_batch(batch, device=device)
        y_pred = model(x)
        transposed_y_pred = torch.transpose(y_pred, 1, 2)
        loss = loss_fn(transposed_y_pred, y.long())
        loss.backward()
        optimizer.step()

        return loss.item(), transposed_y_pred, y.long()

    def _metrics_transform(output):
        return output[1], output[2]

    engine = Engine(_update)

    for name, metric in metrics.items():
        metric._output_transform = _metrics_transform
        metric.attach(engine, name)

    return engine

I have two main questions:

A) The execution of the above code produces the following error just at the end of the first epoch. How must I proceed to troubleshoot this issue?

Current run is terminating due to exception: Expected target size [8, 1182], got [8, 256]
Engine run is terminating due to exception: Expected target size [8, 1182], got [8, 256]
Engine run is terminating due to exception: Expected target size [8, 1182], got [8, 256]
Traceback (most recent call last):
  File "/home/usuaris/user/august/src/main/ignite_script.py", line 479, in run
    trainer.run(train_dataloader, max_epochs=epochs)
  File "/home/usuaris/user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 892, in run
    return self._internal_run()
  File "/home/usuaris/user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 935, in _internal_run
    return next(self._internal_run_generator)
  File "/home/usuaris/user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 993, in _internal_run_as_gen
    self._handle_exception(e)
  File "/home/usuaris/user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 638, in _handle_exception
    raise e
  File "/home/usuaris/user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 965, in _internal_run_as_gen
    self._fire_event(Events.EPOCH_COMPLETED)
  File "/home/usuaris/user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 425, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/home/usuaris/user/august/src/main/ignite_script.py", line 460, in log_training_results
    evaluator.run(train_dataloader)
  File "/home/usuaris/user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 892, in run
    return self._internal_run()
  File "/home/usuaris/user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 935, in _internal_run
    return next(self._internal_run_generator)
  File "/home/usuaris/user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 993, in _internal_run_as_gen
    self._handle_exception(e)
  File "/home/usuaris/user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 638, in _handle_exception
    raise e
  File "/home/usuaris/user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 959, in _internal_run_as_gen
    epoch_time_taken += yield from self._run_once_on_dataset_as_gen()
  File "/home/usuaris/user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 1087, in _run_once_on_dataset_as_gen
    self._handle_exception(e)
  File "/home/usuaris/user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 638, in _handle_exception
    raise e
  File "/home/usuaris/user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 1069, in _run_once_on_dataset_as_gen
    self._fire_event(Events.ITERATION_COMPLETED)
  File "/home/usuaris/user/.local/lib/python3.9/site-packages/ignite/engine/engine.py", line 425, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/home/usuaris/user/.local/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/usuaris/user/.local/lib/python3.9/site-packages/ignite/metrics/metric.py", line 311, in iteration_completed
    self.update(output)
  File "/home/usuaris/user/.local/lib/python3.9/site-packages/ignite/metrics/metric.py", line 596, in wrapper
    func(self, *args, **kwargs)
  File "/home/usuaris/user/.local/lib/python3.9/site-packages/ignite/metrics/loss.py", line 92, in update
    average_loss = self._loss_fn(y_pred, y, **kwargs).detach()
  File "/home/usuaris/user/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/usuaris/user/.local/lib/python3.9/site-packages/torch/nn/modules/loss.py", line 1163, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/home/usuaris/user/.local/lib/python3.9/site-packages/torch/nn/functional.py", line 2996, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: Expected target size [8, 1182], got [8, 256]

B) My proposal deals solely with predicting the category of the entities, not with their recognition (i.e. this is not a NER problem). Therefore, in order to mask the labels, have I to adopt an approach to calculate the loss similar to the one featured in the last manual approach of this post? My labels store values between 0 and 1181, along with -100 for the special tokens (padding and begin/end of a sentence), so such an approach would filter out both special characters (-100) and labels belonging to tokens which are not of interest (0).

Thanks for your time!