mel spectrum was sent to resnet34 for pre-training to extract high-dimensional audio representation. If log_S(mel, frames) (128,196), each sample needs to be filled or intercepted (128,500) and sent to the resnet34 model for training?
Question 1. Does the padding need to be normalized before it can be sent to the model for training?
Question 2. Does resnet34 need to be consistent for the latter two dimensions (height, width), analogous to mel, frames, (128, 500), for better results? Looking at resnet34, most of the last two dimensions are 224, 224 or the same dimension.
Question 3: Transforms are required for watching the tutorial. What is the normal operation? transforms = transforms.Compose([