Hello pytorch community! I have a number of questions. The answer to which I could not find. That’s why I’m here!
1.Perhaps I’ll start with the goal of my neural network. I want to get a neural network. Which predicts the position of the maya controls based on the sound wave.
2. What do I have available.
2.1. audio files of various lengths. From 1 second to 7
2.2. There are data from 35 face controls of the 3D model. Who repeat the phrase with the help of lips. transformations and fractions. The data is grouped like this. One sheet contains data on 35 controls. each control has 6 x,y,z transform and x,y,z rotate values.
3. format of tenzers here the first problems begin: I can get the most different matrix of values from controls. 3d. or 2d. It all depends on the two words in the XD code, now I have a 2d tensor that contains [x,4] where, x is the number of fragments that directly correlates with the duration. Here is an example from my code:
dimension mfcc ep15_sc007.txt torch.Size([1431, 4])
dimension txt ep15_sc007.txt torch.Size([1431, 4])
dimension mfcc ep15_sc009.txt torch.Size([6148, 4])
dimension txt ep15_sc009.txt torch.Size([6148, 4]).
however, I can also get a 3d tensor that is more readable for humans, but I don’t know if it’s in the machine format [x, 53,4] where x is the number of animation keys in the scene. Example:
dimension mfcc ep15_sc007.txt torch.Size([27, 53, 4])
dimension txt ep15_sc007.txt torch.Size([27, 53, 4])
dimension mfcc ep15_sc009.txt torch.Size([116, 53, 4])
dimension txt ep15_sc009.txt torch.Size([116, 53, 4])
4. the problem is. What I want is to get what I have in txt in order to return it to maya and get lip sync.
5. and I just can’t find the right model that will do what I need because of the small amount of data this topic is in principle (maybe I didn’t find something XD)
6. I will also show the code that I wrote for my model. So that more experienced people point me to the correct vector
torch.cuda.empty_cache()
gc.collect()
def split_mfcc(mfcc_tensor, frame_size):
frames = []
num_frames= math.ceil(mfcc_tensor.size(1) / frame_size)for me in the range(num_frames):
start_index = i * frame size
end_index = start_index + frame size
kadr = mfcc_tensor[:, start_index:end_index]
frames.add(frame)Checking the frame size and adding padding to each row if necessary
last_frame_size = mfcc_tensor.size(1) % frame size
if last_frame_size > 0:
filling = torch.zeros(mfcc_tensor.size(0), frame_size - last_frame_size)
filled frames = torch.cat ([frames[-1], fill], darken=1)
frames[-1] = filled frame_framesreturned frames
class MyModel(nn.Module):
def init(self):
super (my modeler, myself).initialize()Model architecture definition
self.conv1 = nn.Conv1d(in_channels=1, out_channels=53, kernel_size=3, stride=1, padding=0)
self.conv2 = nn.Conv1d(53, 53, kernel_size=2, stride=1, padding=0)
self.rnn = nn.GRU(53, 53, num_layers=5, batch_first=True)
self.fc1 = nn.Linear(53, 25)
self.fc2 = nn.Linear(25, 4)
self.dropout = nn.Dropout(0.5)
itself.leaky_relu = nn.LeakyReLU()definition of direct(self, x):
x = x.remove the lock(1)
x = self.conv1(x)
x = self.leak_relu(x)
x = self.conv2(x)
x = self.leak_relu(x)
x = x.set(0, 2, 1) # Setting the boundary line by passing to recurrent values
x, _ = self.rnn(x)
x = x[:, -1, :] # Use only the subsequent output from recurrent words
x = self.dropout(x)
x = self.fc1(x)
x = self.leak_relu(x)
x = self.fc2(x)
x = self.leak_relu(x)
believe xCreating an instance of the model
model = MyModel()
#device = torch.device(“cuda:0”, i.e. torch.cuda.is_available() instead of “cpu”)
#model.to (device)Definition of the loss function
criterion = nn.MSELoss()
Definition
optimizer = torch.optim.Adam(model.parameters(), lr=0.001) #SGD
percentage deviation = []
precision of the post = []
l2_loss_epoch = []
for the epoch in the range(0, 500):
sort = ‘/compressed/disk/My disk/dataset/txt_sorting/’
txt_file_names = os.listdir(sort)
transformation of matrices = []
wav_mapping = []
scene_list = [] list of scenes = []
for the file_name in txt_file_names:
line_mapping = [] string matching
list of frames = []
with codecs.open(sort + filename, “r”, encoding=“utf-8”) as a file:
for a line in a file:line_mapping.add(line.band())
Converting a string to a list
data = ast.literal_eval(string)
contol_list =[] list of messages to manage
for data management:
contol_list.extend(ast.literal_eval(control.get(‘4’)))
contol_list.extend(ast.literal_eval(control.get(‘5’)))frame_list.add(contol_list)
scene_list.add(frame_list)
line_count = len(line_mapping)
line_tensors = torch.tensor(list of frames)Tensor dimensions
Tensor dimensions
num_frames, num_controls = line_tensors.size()
Dividing a tensor into 27 53x4 tensors
list of frames = []
frame size = 4for me in the range(num_frames):
frame_tensor = torch.zeros(53, frame_size) # Creating an empty 53x4 tensorWe get one line from the original tensor
control_line = linear tensors[i]
Splitting the string into separate time frames of size 4
for me in the range(53):
start_index = j * time meter
end_index = start_index + frame sizeif end_index <= num_controls, then:
frame_tensor[j] = control_line[start index:end index]
more:If there are not enough values, fill in the paddings
num_missing_values = end index - num_controls
frame_tensor[j, :frame_size - num_missing_values] = control_line[start_index:]
#read(frame_tensor.size())
frame_list.extend(frame_tensor)wav_name = email name.remember(“.txt”, “.wav”)
audio_path = ‘/writable/disk/My disk/dataset/wav/’ + wav_name
audio_samples, my_sr = librosa.load(audio_path, sr=48000)
audio_samples, _ = librosa.effects.trim(audio_samples) # removing the silence
duration = len(audio_samples) / my_sr
frames_to_time = int(duration * 25)MFCC assignment
mfcc = librosa.feature.mfcc(y=audio_samples, sr=my_sr, n_mfcc=53, n_fft=2048, hop_length=485)
jump length = 2048 / 4Converting MFCC to spectrogram
#mel = librosa.feature.inverse.mfcc_to_mel(mfcc)
Restoring audio from a Stranded spectrogram
#audio_reconstructed = librosa.feature.inverse.mel_to_audio(mel, sr=my_sr)
list_mfcc_frames = []
mfcc_tensor_data = torch.tensor(mfcc)
frame size = 4
frames = split_mfcc(mfcc_tensor_data, frame_size)for a frame in frames:
list_mfcc_frames.extend(frame)new_tensors_mfcc = torch.stack(list_mfcc_frames)#.cuda()
new_frame_list = torch.stack(list of frames)#.cuda()we add a value from new_frame_list that is added to the dimension series of the new_tensors_mfcc tensor due to the fact that the animator beat so that the animator added animation for 10+ frames.
if new_tensors_mfcc.size() != new_frame_list.size():
min_frames = minimum(new_tensors_mfcc.size(0), new_frame_list.size(0))
new_frame_list = new_frame_list[:miniature frames]
#read(new_tensors_mfcc.size(), new_frame_list.size())
output data = model(new_tensors_mfcc)
#furnace(outputs.size())
page = criterion(source data, new_frame_list)Achievement of L2-regulation
l2_loss = sum(torch.norm(parameter) for the parameter in model.parameters())
General loss function
total_loss = loss + l2_loss
Back propagation and updating of weights
the optimizer.zero_grad()
total_loss. total loss.reverse()
optimizer.step()print(‘generate’,outputs.tolist())
print(‘text link’,new_frame_list.tolist())
l2_loss_epoch.add(l2_loss.item())use = torch.eq(output data, new_frame_list)
Improve accuracy by calculating the average value received from supervisors.
accuracy = torch.mean(correct.float()).cpu()
accuracy_epoch.add (exactly)
#percentage deviation.add(percentage deviation)Saving the model every epoch
torch.save(model.state_dict(), ‘/compressed/disk/My disk/dataset/model/best_model_lip_sing.pt’)
prints(F"Epoch: {epoch +1}, loss: {loss.element()}, 2l_loss: {total_loss.element()}, full result: {accuracy} %" , file_name) #, preliminary deviation: {minimum(percentage deviation, 100)}Audio playback
audio(audio_reconstructed, rate=my_sr)
Changing the accuracy of graphs and l2_loss
plt.figure(figsize=(15, 13))
plt.subtitle(2, 1, 1)
plt.graph(accuracy_epoch, label=‘Accuracy’)
plt.xlabel(“every 60 hours 1 cycle. Each point, the number of one-eyed in the number of somewhere 0,0 is a full-scale result, and 1 is a full-scale connection”)
plt.ylabel(“Point”)
plt.name(“Exactly by epochs”)
plt.legend()plt.subtitle(2, 1, 2)
plt.grachik(l2_loss_epoch, label=‘Point L2’)
plt.xlabel(“EPoch”)
plt.ylabel(‘Point L2’)
plt.title(‘Point L2’)
plt. legend()
model.eval()
using torch.no_grad():sort = ‘/content/disk/My disk/data set/text set/sorting/txt_sorting/’
txt_file_names = os.listdir(sort)
transformation of matrices = []
wav_mapping = []
scene_list = [] list of scenes = []
for the file_name in txt_file_names:
line_mapping = [] string matching
list of frames = []
with codecs.open(sort + filename, “r”, encoding=“utf-8”) as a file:
for a line in a file:line_mapping.add(line.band())
Converting a string to a list
data = ast.literal_eval(string)
contol_list =[] list of messages to manage
for data management:
contol_list.extend(ast.literal_eval(control.get(‘4’)))
contol_list.extend(ast.literal_eval(control.get(‘5’)))frame_list.add(contol_list)
scene_list.add(frame_list)
line_count = len(line_mapping)
line_tensors = torch.tensor(list of frames)Tensor dimensions
Tensor dimensions
num_frames, num_controls = line_tensors.size()
Dividing a tensor into 27 53x4 tensors
list of frames = []
frame size = 4for me in the range(num_frames):
frame_tensor = torch.zeros(53, frame_size) # Creating an empty 53x4 tensorWe get one line from the original tensor
control_line = linear tensors[i]
Splitting the string into separate time frames of size 4