BERT style pretraining on spectrograms

Hey guys,

(This ended up being a wall of text to describe my setup, actual question is at bottom)

I’m working on, as best I can, trying to apply the BERT pretraining setup to my model I’ve been working on. The model itself is a transformer, where each frame is treated as a token, which makes use of the evolved transformer encoder architecture with the modifications from the Primer-EZ architecture. It also includes relative positional encoding as seen in the Music Transformer. The input to each transformer encoder is a linear bottleneck to extract a single channel representation of both frames; the output is a transformed frame that is concatenated with the input to form a DenseNet out of the transformer encoders, the output of which is then bottlenecked back to 2 channels at the end.

I have a custom data loader which takes in a collection of spectrograms (double the size trained on so it can cut out different slices of the spectrogram for more data). From here, the input spectrogram is split into groups of N frames. For the first task, the groups are treated as whole ‘words’ and they are selected at a rate of 15%. As with BERT, 80% of the selected mask tokens are whited/masked out by setting each frequency bin to 1.0, 10% of the time I take the max between random noise and the original value, and then 10% of the time it uses the original frames without modification. The neural network then is tasked with creating a multiplicative mask by passing the output through a sigmoid which allows it to either leave frames as they are or chisel away at them to create the expected audio.

For the second task, a separator ‘token’ (alternating 0s and 1s in frequency bins which is highly unlikely to occur naturally though maybe I’m not thinking about something obvious) is inserted in between the first and second halves of the spectrogram. 50% of the time the last half will be replaced with a random slice from a different spectrogram. The neural network then learns to predict whether the second half of the spectrogram is a continuation of the first half of the spectrogram. To facilitate this task, I blank out a half tokens worth of frames (16/2 frames in my case) at the end of the first half and beginning of the second half so that the transition area is not available for the neural network. For the separator token mentioned above, this is included in both the input tensor and target tensor.

Now, needless to say, there are some serious differences here compared to NLP. For one, embeddings in this architecture are the frames of the spectrogram itself, this works quite well in a strictly supervised fashion when trying to convert mixes with vocals to instrumental mixes but is definitely a major difference. Another is that spectrograms are far more fluid in nature than language and different frames will typically smoothly transition between each other (hence masking out larger chunks rather than individual frames otherwise it could probably learn some interpolation function and be reasonably effective I’d imagine).

So, this brings me to my question: Does anyone have any critiques or suggestions for this idea? I’d be more than happy to share pretrained checkpoints afterward in an open-source fashion and maybe have a little community microproject or something (I do have this on GitHub if anyone is interested though its highly experimental and changes rapidly, was forked from an MMDENSELSTM implementation and evolved over time). Currently have a model pretraining on thousands of albums (little hard to gauge how many songs at this point, well over 1TB though as the data has to rest on 3 different ssds, one of which is 2TB dedicated to just this with another 1TB ssd with 95% dedicated to this and a third 1TB external SSD with probably 400GB as more leads to latency as its usb-c)

My main goal is to use this for track separation, so this isn’t a commercial or academic endeavor; I just love instrumental music lol. Would also be open to being told I’m making some unsound judgments here and to have someone correct me on any of this, I’m a software engineer at a fintech company trying to learn as much as I can about machine learning and submersing myself in a somewhat challenging problem seems to be the best way to do that.

Edit: did some more testing, in case any of this gives anyone any ideas I actually changed the architecture to a u-net and am getting significantly higher quality with it. I use Nx1 kernel convolutions with a stride of 2x1 to only convolve features from the same frame of the spectrogram which embeds them into a lower dimensional space and adds locality along the pitch dimension which I imagine would be important due to the nature of sound i.e. octaves. It includes frame encoders which use the Nx1 kernel convs which are followed by a sequence of transformer encoders - a smaller number than usual, in my case 2 (since they are at every downsampling stage of the u-net which in this case means 5 stages so 10 encoders at different pitch scales for instance). After this, the following frame encoder will downsample it along the frequency dimension and embed more pitch locality into it while retaining the resolution on the temporal dimension. What’s interesting is that increasing the channel count at each downsampling doesn’t have a huge effect which seems to imply its the downsampling itself that is helpful, but I haven’t tested this out yet short of some basic tests with verifying that lowering channel count did not in fact hurt validation loss which is not what I would have expected. The slightly weird part is that before each frame decoder, I make use of the transformer decoder architecture instead (or at least, my hybrid variant of the Evolved Transformer/Primer/Music Transformer). For the memory, I use the u-nets skip connection which includes the output from the transformer encoders at that level, allowing it to effectively query for global information from the original representation at that level.

Need to actually test what makes the u-net version so much higher quality though, this is all mainly speculation regarding the octaves and pitch locality. The pure transformer variant works and works fairly well, however the u-net variant is significantly higher quality. Little weird having a u-net that downsamples only on one dimension, but a single u-net with just the frequency embedding outperforms a DenseNet setup with three u-nets using the frequency and temporal embedding setup even when it includes the transformer modules. Kinda interesting. Will be pretraining it on my dataset over this next week and will likely update this post with a pretrained checkpoint if things go well, though its currently 205M parameters so for training requires 12gb of VRAM. Pretty excited to see how it turns out, though I worry that the probability distributions that BERT predicts are where it really gets its power from…

Hmmm are you aware that there this “bert for spectrograms” already exist? [2106.07447] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
Also you have Audio-visual Hubert (2022 paper)

I may read tomorrow the whole paragraph :smiley:

1 Like

Interesting, I was not! Will definitely be giving that a read tonight, thanks for sharing it!

That paper is definitely interesting. I think they are doing things slightly differently and given the different task it probably makes sense (speech recognition vs music processing). One thing they mention is that they apply the unmasking loss only to the portion that was masked rather than applying it to the entire sequence. This differs in my setup, as I have it predicting a multiplicative mask which unmasks the spectrogram as needed and its applied to the entire spectrogram. One thing that also differs in my setup, however, is that I have the second task of ‘next spectrogram prediction’ which is analogous to BERT’s next sentence prediction. The reason they mention in the paper for applying loss only to the unmasked portions is to learn both the unmasked representations and long term structure which ultimately analogous to what the next sentence prediction in BERT seeks to accomplish (at least from my understanding). Still reading this but that’s one thing that popped out to me.

Hi @carperbr, as @JuanFMontesinos mentioned, HuBERT is the BERT-alike model for audio pre-training. In torchaudio, there is a pre-training recipe for HuBERT model which you may be interested (check it here audio/examples/hubert at main · pytorch/audio · GitHub). The model is built for 16kHz audio, you may need to tune the feature extraction part to suit your music data (most music audios have higher sample rate).

If you want to try applying losses on both masked and unmasked portions, you can tune the parameter masked_weight and unmasked_weight in hubert_loss. I haven’t compared the performance difference with masked or unmasked losses. Happy to hear your experience on this.

1 Like

Hey! Thanks for the response. If I’m not mistaken (and am happy to be corrected here), HuBERT is meant for something slightly different (speech recognition) and has some priors built in for that.

My goal is to use bert style pretraining specifically for music. Sounds like what HuBERT is doing is using k-means clustering to arrive at a model that can effectively assign classes to frames/chunks of samples from the waveform and then uses the transformer to try and predict those classes.

While I’m sure the hidden units in HuBERT wouldn’t be meaningless here (although I think they would be slightly more abstract and would potentially require more classes for something more abstract like music? who knows maybe they’d work better, definitely on my list of things to try), mainly doing this as a learning experience so want to try doing things in a slightly weird way. As well, the supervised model I’ve been working on is using a custom transformer architecture that is meant specifically for spectrograms so I would ideally like to stick with spectrograms. To add a probabilistic term to the loss from my original post and try to more closely match Bert/HuBERT with their probability distributions I ended up adding adversarial loss to my setup and am using two of the frame transformers in a generator/discriminator setup (basically just a modification of pix2pix for spectrograms using transformers, though not sure about the discriminator as it outputs a probability for each frame which might be excessive), but not too sure if this will work well in the end vs the hidden units in HuBERT especially given its just binary cross entropy…

I guess to add a bit more context: my downstream task is going to be music processing, specifically vocal removing but some further ideas. In an attempt to make this task more transferable, I have it learning to unmask entire frames by generating a mask via sigmoid which is multiplied with the full spectrogram where masked frames are masked with all 1s (so masked input comes in and the transformer outputs the mask to either unmask masked frames with or leave them alone if not unmasked frames). The hope there is that since it had to learn unmasking via multiplying all 1s by an output number for each frequency bin that it will be able to learn to remove vocals more effectively (also by multiplying an output number for each frequency bin).

While I’m sure the hidden units in HuBERT wouldn’t be meaningless here (although I think they would be slightly more abstract and would potentially require more classes for something more abstract like music?

I agree, the k-means labels of HuBERT is either generated by clustering MFCC features that is for ASR task, or the transformer features in previous HuBERT model. For music data I’m sure it will still learn the data distribution, for example, low frequency for bass, high frequency for string, etc. You can have a try.

If your goal is to remove the vocal from the music, there are several models that can handle it, for example, open-unmix, demucs. You can use those models as baseline for your self-supervised learning method. Hope that helps :slight_smile:

1 Like

Yeah, I actually have a project already using the above transformer setup I coded for spectrograms inspired by MMDENSELSTM to remove vocals and have trained it quite far already, interestingly enough it actually surpasses demucs’ quality in many areas although demucs is doing more than just removing vocals so that kinda makes sense. This is why I can’t really use HuBERT directly - my model treats the STFT phase of preprocessing as the embedding layer for the transformer. This means that when the model outputs new encoded ‘tokens’, those are really just frames in a spectrogram where each embedding dimension is a frequency bin in the spectrogram. This allows me to pass in a source with vocals and output a mask that removes those vocals when multiplied with the source. Each transformer module bottlenecks its input to a single channel and passes it through a transformer (primer with relative positional encoding) module where each frequency bin is an embedding dimension of its respective frame. The linear layers are slid across the image basically as a 1d convolution and has its output concatenated with the input in the style of DenseNet and then bottlenecks back to the 2 channel output expected from a spectrogram; in a sense you could say my model uses STFT to create the embeddings/tokens rather than using clustering as in HuBERT to create embeddings. It actually works quite well which is why I want to stick with it (wouldn’t be surprised if it already exists but there is a certain element of fun to building it myself and finding out it already exists, makes me feel like I’m on the right track haha). At this point I’m just kinda addicted to improving the project and was hoping to get some more ideas haha. Pretraining seemed like a good way to improve the model if I could find a way to pretrain a vocal remover model in a meaningful way and since I was already using transformers mimicking BERT seemed like a reasonable next step.

Edit: I realize now though, I could just preprocess the spectrogram frames in the same way as HuBERT processes frames from an audio source and have the generator output a label prediction for each token it unmasks; should be interesting to see if that helps with L1 reconstruction as well.

My reason for this thread was to just get ideas for my project and hopefully some input/suggestions on the general idea, so you’ve definitely been helpful! I actually implemented more localized loss as in HuBERT and am applying my L1 loss only to the unmasked regions of the spectrogram and it is working extremely well, the final unmasked image has a very low L1 loss for the full spectrogram so it clearly is learning context as they mention in the HuBERT paper (also kinda has to, only uses linear layers on individual frames of pixels so cross frame communication can only happen within the multihead attention module).

Very happy to have had that paper linked as it seems to have had a fairly large impact on my project so far! It actually also gave me the idea to use adversarial loss in a weird way, so I’m now extracting the unmasked tokens from the spectrograms and using the extracted tokens as items in a second training batch specifically for the adversarial training; this allowed me to switch to a simple convnet for the discriminator which is having no problem keeping up with the generator and is balancing out extremely well while L1 loss for the unmasked regions actually is lower than where it was for the pure L1 loss thus far.

Again, I appreciate the posts in this thread! I know I can be fairly verbose so I appreciate you both taking the time.

1 Like