I have the following pytorch code that define a new matrix operation on two matrix points and primitives

```
def operations(points, primitives):
"""
points shape: (batch size, number_of_points, 3)
primitives shape: (batch_size, number_of_primitives,7)
"""
gradient = torch.zeros(batch_size,number_of_points,number_of_primitives)
for i in range(batch_size):
temp_points = points[i,:,:]
temp_primitives= primitives[i,:,:]
temp = torch.zeros(number_of_points,number_of_primitives)
for k in range(number_of_points):
for j in range(number_of_primitives):
temp[k,j] = torch.norm(temp_points[k,:]*temp_primitives[j,:3]+temp_primitives[j,3:6])
gradient[i,:,:] = temp
return gradient
```

Is there any method to parallelize such code to speed up?Thanks!

The above serialized code is implemented by myself, which is utilized in a deep-learning work. Is there any method to parallelize it? Everytime I run my code, the data loader of pytorch will throw an error `RuntimeError: DataLoader worker (pid 255034) is killed by signal: Killed. `

even though I set the number of worker is zero. Thanks!