Hi everybody. I am working to use pytorch to improve my WAV2VGM project, which (deterministically) recreates an arbitrary sound using OPL3 synthesis to play a sum of sine waves. .It works pretty well, but the OPL3 chip can do way better than sine wavs, so I turned to pytorch for help.
The synth has up to 18 2-oscillator channels, up to 6 of which can be paired to make more sophisticated 4-oscillator channels. the configuration data is highly structured, and is converted back and forth between the binary register settings of the real synthesizer, maybe around 128 bytes of data, and a float32[222] configuration vector I’m using for training/inference.
The synth has a great deal of redundancy: 36 identical FM operators, each with a handful of settings consisting of bits and short bit-strings to specify things such as waveform type, phase-rate, amplitude, etc, and up to 18 channels (voices) which each may utilize 2-4 operators and have their own configuration fields describing frequency, whether the key is on, operator configuration, etc.
The point I’m trying to make is that there are many functionally-equivalent configurations. I am hard coding all the volume-envelope stuff to ‘always on’ since I’m more interested in individual spectra and not spectrograms. So, I decided to break out every relevant config field, whether it is a single bit or a 10-bits (frequency), and assign each to a single element in a float32[222] configuration vector. If a field is a bit, it becomes a 1.0 or 0.0 depending on whether it is set. If two-bits, the float might be any of four values from zero to full-scale, etc. f = (intN_val)/(intN_max)
I can make gigabytes of training data, with each set consisting of a 2048-bin freq. spectrum (bins in dbFS ) converted to a float8[2048], and a synth configuration vector as described above, a float32[222]. To make the training set, I’m randomly permuting the configuration vector, sometimes a little, sometimes a lot, and noting the audio spectrum.
My newbie hope is to train some kind of model with this data such that when presented with an arbitrary frequency spectrum, it infers a suitable OPL3 synth configuration vector to recreate the original sound.
My current attempt has two conv1d input layers, one 4-head attention layer, followed by a handful of 2048-4096 neuron linear layers. The best I can get so far is that the prediction’s spectra jiggle a bit and sometimes have peaks sort-of near some peaks in the input.
My loss function is MSE, which I identify a problem with, since like I said, there are many many equivalent configurations to do the same job, so the training config is not the only right answer! (I may have a strategy to combat this, noted in the update below. Basically, I’m ignoring all but one synth channel for now.)
I’ve been going back and forth with ChatGPT to try to improve the system, but not making a lot of progress. Should I convert my (2048,1) spectrum into something else? Is it too naive to hope to predict such a highly-structured and multi-dimensional output? Any advice you can provide would be greatly appreciated!
The project code is on Github as WAV2VGM. I’ll post my current data model below.
Update: Instead of utilizing every voice/operator in the whole synth, I’ve decided to pare down the training data to use just one channel, channel 0, in 2-operator mode. I’m hoping this will remove some ambiguity from the training set. Also, I’m permuting frequency 99% of the time, and other things only 1% of the time. A ‘permutation’ is most often a small bump up or down, but other times the float value is completely rerandomized. An any case, only one float is changed from one set to the next.
Thanks!
Craig I.
Apologies for any eye-bleed induced by this complete newbie attempt!
import os
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset
# ------------------------------------------------------------------------
# AI Model Definition
# ------------------------------------------------------------------------
class OPL3Model(nn.Module):
def __init__(self):
super(OPL3Model, self).__init__()
# Convolutional layers
self.conv_layers = nn.Sequential(
nn.Conv1d(in_channels=1, out_channels=32, kernel_size=5, stride=1, padding=2, padding_mode='reflect'), # First conv layer
nn.ReLU(),
nn.Conv1d(in_channels=32, out_channels=64, kernel_size=5, stride=1, padding=2, padding_mode='reflect'), # Second conv layer
nn.ReLU()
)
self.attention = nn.MultiheadAttention(embed_dim=64, num_heads=4, batch_first=True)
# Fully connected layers
self.model = nn.Sequential(
nn.Linear(64 * 2048, 2048), # Adjusted input size for flattened Conv1d output
nn.BatchNorm1d(2048),
nn.ReLU(),
nn.Linear(2048, 4096),
nn.BatchNorm1d(4096),
nn.ReLU(),
nn.Linear(4096, 4096),
nn.BatchNorm1d(4096),
nn.ReLU(),
nn.Linear(4096, 2048),
nn.BatchNorm1d(2048),
nn.ReLU(),
nn.Linear(2048, 4096),
nn.BatchNorm1d(4096),
nn.ReLU(),
nn.Linear(4096, 2048),
nn.BatchNorm1d(2048),
nn.ReLU(),
nn.Linear(2048, 1024),
nn.BatchNorm1d(1024),
nn.ReLU(),
nn.Linear(1024, 222)
)
def forward(self, x):
# Reshape input to add a channel dimension for Conv1d layers
x = x.unsqueeze(1) # Shape becomes [batch_size, 1, 2048]
# Pass through convolutional layers
x = self.conv_layers(x)
# Prepare for attention by permuting to (batch_size, sequence_length, features)
x = x.permute(0, 2, 1) # Shape becomes (batch_size, 2048, 32)
# Apply multi-head attention (self-attention) where query, key, and value are all `x`
attn_output, _ = self.attention(x, x, x) # Shape: (batch_size, 2048, 32)
# Flatten the output from the attention layer for dense layers
attn_output = attn_output.reshape(attn_output.size(0), -1) # Shape: (batch_size, 32 * 2048)
# Flatten the output from the Conv1d layers
#x = x.view(x.size(0), -1) # Shape becomes [batch_size, 32 * 2048]
# Pass through the fully connected layers
x = self.model(attn_output)
return x
# ------------------------------------------------------------------------
# Dataset Class - gets individual records from the training files
class OPL3Dataset(Dataset):
def __init__(self, spect_file, reg_file):
# input spectra
self.spect_file = spect_file
# output synth configs
self.reg_file = reg_file
# size of one frequency spectrum record (uint8[2048])
self.spect_size = 2048
# size of one synthesizer configuration vector (float32[222])
self.reg_size = 222 * 4 # 4 bytes per float
# Get the number of available training samples
self.num_samples = os.path.getsize(spect_file) // self.spect_size
def __len__(self):
return self.num_samples
def __getitem__(self, idx):
# Read only the required spectrum data from file
with open(self.spect_file, 'rb') as f:
f.seek(idx * self.spect_size)
spect = np.frombuffer(f.read(self.spect_size), dtype=np.uint8).astype(np.float32) / 255.0
# Read only the required reg data from file
with open(self.reg_file, 'rb') as f:
f.seek(idx * self.reg_size)
reg = np.frombuffer(f.read(self.reg_size), dtype=np.float32)
spect = torch.tensor(spect, dtype=torch.float32)
reg = torch.tensor(reg, dtype=torch.float32)
return spect, reg
'''
# This dataset gets the WHOLE FILE into RAM!
class OPL3Dataset(Dataset):
def __init__(self, spect_file, reg_file):
self.spect_file = spect_file
self.reg_file = reg_file
self.spect_data = np.fromfile(spect_file, dtype=np.uint8).reshape(-1, 2048) / 255.0
self.reg_data = np.fromfile(reg_file, dtype=np.float32).reshape(-1, 222)
def __len__(self):
return len(self.spect_data)
def __getitem__(self, idx):
spect = torch.tensor(self.spect_data[idx], dtype=torch.float32)
reg = torch.tensor(self.reg_data[idx], dtype=torch.float32)
return spect, reg
'''