What are the stages for training a neural net for speech recognition?

is that:

  1. creating a mel spectrogram
  2. feed it to the neural net
  3. backprop(and by the way what loss function should I use?)
  4. repeat until the loss decreases

or should I pass in batches or something? i’m super newbie and its my first project, so I dont know how to create an audio dataser and pass in batches… the tutorials that pytorch provides cant help me unfortunately… maybe can someone explain the steps in the comments or provide a good tutorial?

There also are models working directly on waveforms, e.g. wav2letter.
For an even simpler entry, you might look at the Speech Commands tutorial where the waveform is translated into class predictions for keywords (a much simpler problem and relatively fast to train).

Best regards