I’m searching the efficiency of torch matrix multiplication and autograd API.

**The first question is **
"Is there special relationship between the number of CPU thread and autograd API?"

The second question is
**“What is the minimum F.C model matrix size that GPU device shows better computational time efficiency compared to CPU device?”"

• In the same model that I mentioned above, GPU device shows inferior performance compared to CPU device, regarding computaion time. If inferior performance due to the small size of N.N model, then will it be useful using CPU than GPU purely in perspective of computational time?

Forward propagation ratio GPU/CPU : 1.412X

Here is my code.

``````import torch
import time
import matplotlib.pyplot as plt

from datetime import timedelta

device1 = torch.device('cpu')
device2 = torch.device('cuda')

x = torch.randn((128,3))

w1 = torch.randn(3,64)
w2 = torch.randn(64,64)
w3 = torch.randn(64,1)

for _ in range(1000):
start = time.time()
y = x @ w1 @ w2 @ w3
forward_time = time.time()-start
forward_buffer += forward_time

start = time.time()

fig = plt.figure(figsize=(15,10))

plt.show()

x = torch.randn((128,3)).to(device1)

w1 = torch.randn(3,64).to(device1)
w2 = torch.randn(64,64).to(device1)
w3 = torch.randn(64,1).to(device1)
for _ in range(1000):
start = time.time()
y = x @ w1 @ w2 @ w3
forward_time = time.time()-start
cpu_forward_buffer += forward_time

start = time.time()

x = torch.randn((128,3)).to(device2)

w1 = torch.randn(3,64).to(device2)
w2 = torch.randn(64,64).to(device2)
w3 = torch.randn(64,1).to(device2)
for _ in range(1000):
start = time.time()
y = x @ w1 @ w2 @ w3
forward_time = time.time()-start
gpu_forward_buffer += forward_time

start = time.time()