Trying to understand targets in ASR with CTCLoss

Hi everyone,
I am trying to work with the LibriSpeech dataset to build an ASR model with LSTM layers.
It is still not very clear to me how I should preprocess the data correctly. I have a list of input arrays which are the MFCC values for each whole sentence file, and a list of labels.
I padded all the input MFCC vectors to have the same lenght as the longest audio file.
In order to get the labels I mapped character indexes to each character in the sentence, so that

label_i = "how are you?"
vectorlabel_i = np.array([4,2,5,0,11,22,0,25,2,14])

I then padded the labels as well with a blank character to the longest sentence length.

It is still not clear to me, however, how I should move from here in order to feed inputs and labels into the model.
Is each whole sentence a single input? Or should I feed batches of frames (e.g. a sentence with sequence 640 fed in batches of 32)?
Is the format of the label correct? I don’t think I should use an embedder given how the output is supposed to be.

Thanks for reading.