How to select image from Morse Code Spectrogram

A torchaudio advisor suggested I post my request for advice here. I want to try a deep learning solution to create a morse code decoder. I am at the beignining stages to create a training set of images and have lots of options as to how to do it. I am looking for suggestions from experts to hopefully get a working result that is useful. After playing around with audio spectrograms of received morse code signals, it appears that I I should select small images from the spectrograms following along a narrow frequency band, maybe something like 50Hz x 5secs. This would translate to about an 8 x 512 slice assuming a 1024 FFT and 44100 audio sampling rate. Is this reasonable? I suspect the slices need to overlap about 50% . The below spectrogram shows 10 seconds captured on a busy Sunday afternoon. The 22kHz audio capture translates to 512 audio channels [vertical height] and most code signals are about 2-3 channels wide. The spectrogram is showing only the 3 kHz bandpass of audio from the ham radio.

Visual code patterns for strong signals are easily seen and I can make images of individual letters and letter combinations. Should I stick to gray scale for learning?. How should the training handle the multiply letter patterns in each time slice? I am beginning to wonder how to handle overlapping information in adjacent images but hope that might be handled with a post filtering merge process. Individual letter patterns can widely differ in length, for e.g. “e” is dit, where as “0” is 5 dah dah dah dah dah. That is 18 times longer. As seen in the image, the code is also sent at different speeds. Tones rarely overlap and most have clean patterns. The torchaudio spectrogram worked about the same as examples for Numpy/Scipy or commercial Audacity. program and it appears that slices of a few seconds can be handled even on a raspberry pi. Converting to tensors from image numpy was easy and scaling to -1 to 1. I am wondering how fast the decoder can actually run. Need suggestions for designing the traiining sets and likely more when get to actual training. Anyway suggestions would be appreciated for selecting the images and numbering or labelling the result patterns.