I can't figure out how to correctly submit data to the neural network. And correctly make the architecture of the model

Hello pytorch community! I have a number of questions. The answer to which I could not find. That’s why I’m here!
1.Perhaps I’ll start with the goal of my neural network. I want to get a neural network. Which predicts the position of the maya controls based on the sound wave.
2. What do I have available.
2.1. audio files of various lengths. From 1 second to 7
2.2. There are data from 35 face controls of the 3D model. Who repeat the phrase with the help of lips. transformations and fractions. The data is grouped like this. One sheet contains data on 35 controls. each control has 6 x,y,z transform and x,y,z rotate values.
3. format of tenzers here the first problems begin: I can get the most different matrix of values ​​from controls. 3d. or 2d. It all depends on the two words in the XD code, now I have a 2d tensor that contains [x,4] where, x is the number of fragments that directly correlates with the duration. Here is an example from my code:
dimension mfcc ep15_sc007.txt torch.Size([1431, 4])
dimension txt ep15_sc007.txt torch.Size([1431, 4])
dimension mfcc ep15_sc009.txt torch.Size([6148, 4])
dimension txt ep15_sc009.txt torch.Size([6148, 4]).
however, I can also get a 3d tensor that is more readable for humans, but I don’t know if it’s in the machine format [x, 53,4] where x is the number of animation keys in the scene. Example:
dimension mfcc ep15_sc007.txt torch.Size([27, 53, 4])
dimension txt ep15_sc007.txt torch.Size([27, 53, 4])
dimension mfcc ep15_sc009.txt torch.Size([116, 53, 4])
dimension txt ep15_sc009.txt torch.Size([116, 53, 4])
4. the problem is. What I want is to get what I have in txt in order to return it to maya and get lip sync.
5. and I just can’t find the right model that will do what I need because of the small amount of data this topic is in principle (maybe I didn’t find something XD)
6. I will also show the code that I wrote for my model. So that more experienced people point me to the correct vector

torch.cuda.empty_cache()
gc.collect()
def split_mfcc(mfcc_tensor, frame_size):
frames =
num_frames= math.ceil(mfcc_tensor.size(1) / frame_size)

for me in the range(num_frames):
start_index = i * frame size
end_index = start_index + frame size
kadr = mfcc_tensor[:, start_index:end_index]
frames.add(frame)

Checking the frame size and adding padding to each row if necessary

last_frame_size = mfcc_tensor.size(1) % frame size
if last_frame_size > 0:
filling = torch.zeros(mfcc_tensor.size(0), frame_size - last_frame_size)
filled frames = torch.cat ([frames[-1], fill], darken=1)
frames[-1] = filled frame_frames

returned frames

class MyModel(nn.Module):
def init(self):
super (my modeler, myself).initialize()

Model architecture definition

self.conv1 = nn.Conv1d(in_channels=1, out_channels=53, kernel_size=3, stride=1, padding=0)
self.conv2 = nn.Conv1d(53, 53, kernel_size=2, stride=1, padding=0)
self.rnn = nn.GRU(53, 53, num_layers=5, batch_first=True)
self.fc1 = nn.Linear(53, 25)
self.fc2 = nn.Linear(25, 4)
self.dropout = nn.Dropout(0.5)
itself.leaky_relu = nn.LeakyReLU()

definition of direct(self, x):
x = x.remove the lock(1)
x = self.conv1(x)
x = self.leak_relu(x)
x = self.conv2(x)
x = self.leak_relu(x)
x = x.set(0, 2, 1) # Setting the boundary line by passing to recurrent values
x, _ = self.rnn(x)
x = x[:, -1, :] # Use only the subsequent output from recurrent words
x = self.dropout(x)
x = self.fc1(x)
x = self.leak_relu(x)
x = self.fc2(x)
x = self.leak_relu(x)
believe x

Creating an instance of the model

model = MyModel()
#device = torch.device(“cuda:0”, i.e. torch.cuda.is_available() instead of “cpu”)
#model.to (device)

Definition of the loss function

criterion = nn.MSELoss()

Definition

optimizer = torch.optim.Adam(model.parameters(), lr=0.001) #SGD
percentage deviation =
precision of the post =
l2_loss_epoch =
for the epoch in the range(0, 500):
sort = ‘/compressed/disk/My disk/dataset/txt_sorting/’
txt_file_names = os.listdir(sort)
transformation of matrices =
wav_mapping =
scene_list = list of scenes =
for the file_name in txt_file_names:
line_mapping = string matching
list of frames =
with codecs.open(sort + filename, “r”, encoding=“utf-8”) as a file:
for a line in a file:

line_mapping.add(line.band())

Converting a string to a list

data = ast.literal_eval(string)

contol_list = list of messages to manage
for data management:
contol_list.extend(ast.literal_eval(control.get(‘4’)))
contol_list.extend(ast.literal_eval(control.get(‘5’)))

frame_list.add(contol_list)
scene_list.add(frame_list)
line_count = len(line_mapping)
line_tensors = torch.tensor(list of frames)

Tensor dimensions

Tensor dimensions

num_frames, num_controls = line_tensors.size()

Dividing a tensor into 27 53x4 tensors

list of frames =
frame size = 4

for me in the range(num_frames):
frame_tensor = torch.zeros(53, frame_size) # Creating an empty 53x4 tensor

We get one line from the original tensor

control_line = linear tensors[i]

Splitting the string into separate time frames of size 4

for me in the range(53):
start_index = j * time meter
end_index = start_index + frame size

if end_index <= num_controls, then:
frame_tensor[j] = control_line[start index:end index]
more:

If there are not enough values, fill in the paddings

num_missing_values = end index - num_controls
frame_tensor[j, :frame_size - num_missing_values] = control_line[start_index:]
#read(frame_tensor.size())
frame_list.extend(frame_tensor)

wav_name = email name.remember(“.txt”, “.wav”)

audio_path = ‘/writable/disk/My disk/dataset/wav/’ + wav_name
audio_samples, my_sr = librosa.load(audio_path, sr=48000)
audio_samples, _ = librosa.effects.trim(audio_samples) # removing the silence
duration = len(audio_samples) / my_sr
frames_to_time = int(duration * 25)

MFCC assignment

mfcc = librosa.feature.mfcc(y=audio_samples, sr=my_sr, n_mfcc=53, n_fft=2048, hop_length=485)
jump length = 2048 / 4

Converting MFCC to spectrogram

#mel = librosa.feature.inverse.mfcc_to_mel(mfcc)

Restoring audio from a Stranded spectrogram

#audio_reconstructed = librosa.feature.inverse.mel_to_audio(mel, sr=my_sr)

list_mfcc_frames =
mfcc_tensor_data = torch.tensor(mfcc)
frame size = 4
frames = split_mfcc(mfcc_tensor_data, frame_size)

for a frame in frames:
list_mfcc_frames.extend(frame)

new_tensors_mfcc = torch.stack(list_mfcc_frames)#.cuda()
new_frame_list = torch.stack(list of frames)#.cuda()

we add a value from new_frame_list that is added to the dimension series of the new_tensors_mfcc tensor due to the fact that the animator beat so that the animator added animation for 10+ frames.

if new_tensors_mfcc.size() != new_frame_list.size():
min_frames = minimum(new_tensors_mfcc.size(0), new_frame_list.size(0))
new_frame_list = new_frame_list[:miniature frames]
#read(new_tensors_mfcc.size(), new_frame_list.size())
output data = model(new_tensors_mfcc)
#furnace(outputs.size())
page = criterion(source data, new_frame_list)

Achievement of L2-regulation

l2_loss = sum(torch.norm(parameter) for the parameter in model.parameters())

General loss function

total_loss = loss + l2_loss

Back propagation and updating of weights

the optimizer.zero_grad()
total_loss. total loss.reverse()
optimizer.step()

print(‘generate’,outputs.tolist())
print(‘text link’,new_frame_list.tolist())
l2_loss_epoch.add(l2_loss.item())

use = torch.eq(output data, new_frame_list)

Improve accuracy by calculating the average value received from supervisors.

accuracy = torch.mean(correct.float()).cpu()
accuracy_epoch.add (exactly)
#percentage deviation.add(percentage deviation)

Saving the model every epoch

torch.save(model.state_dict(), ‘/compressed/disk/My disk/dataset/model/best_model_lip_sing.pt’)
prints(F"Epoch: {epoch +1}, loss: {loss.element()}, 2l_loss: {total_loss.element()}, full result: {accuracy} %" , file_name) #, preliminary deviation: {minimum(percentage deviation, 100)}

Audio playback

audio(audio_reconstructed, rate=my_sr)

Changing the accuracy of graphs and l2_loss

plt.figure(figsize=(15, 13))
plt.subtitle(2, 1, 1)
plt.graph(accuracy_epoch, label=‘Accuracy’)
plt.xlabel(“every 60 hours 1 cycle. Each point, the number of one-eyed in the number of somewhere 0,0 is a full-scale result, and 1 is a full-scale connection”)
plt.ylabel(“Point”)
plt.name(“Exactly by epochs”)
plt.legend()

plt.subtitle(2, 1, 2)
plt.grachik(l2_loss_epoch, label=‘Point L2’)
plt.xlabel(“EPoch”)
plt.ylabel(‘Point L2’)
plt.title(‘Point L2’)
plt. legend()
model.eval()
using torch.no_grad():

sort = ‘/content/disk/My disk/data set/text set/sorting/txt_sorting/’
txt_file_names = os.listdir(sort)
transformation of matrices =
wav_mapping =
scene_list = list of scenes =
for the file_name in txt_file_names:
line_mapping = string matching
list of frames =
with codecs.open(sort + filename, “r”, encoding=“utf-8”) as a file:
for a line in a file:

line_mapping.add(line.band())

Converting a string to a list

data = ast.literal_eval(string)

contol_list = list of messages to manage
for data management:
contol_list.extend(ast.literal_eval(control.get(‘4’)))
contol_list.extend(ast.literal_eval(control.get(‘5’)))

frame_list.add(contol_list)
scene_list.add(frame_list)
line_count = len(line_mapping)
line_tensors = torch.tensor(list of frames)

Tensor dimensions

Tensor dimensions

num_frames, num_controls = line_tensors.size()

Dividing a tensor into 27 53x4 tensors

list of frames =
frame size = 4

for me in the range(num_frames):
frame_tensor = torch.zeros(53, frame_size) # Creating an empty 53x4 tensor

We get one line from the original tensor

control_line = linear tensors[i]

Splitting the string into separate time frames of size 4