Position of 'with torch.no_grad()'

jjjjaeni · August 3, 2022, 11:15am

Hello, I have a question about ‘with torch.no_grad()’,

torch.no_grad() disables gradient calculation which is useful for inference

Then, are the following two codes equivalent? Is it true that in both code the model doesn’t learn the test data? Does it matters if the location of ‘with torch.no_grad()’ changes in the following case?

(1)

def dcn(x):    # detach, cpu, numpy
    if type(x)== np.ndarray: return x
    else: return x.detach().cpu().numpy()
    
def predict(dataloader, network):
    Y_new, Y_hat, Y_hat_pb = np.array([]), np.array([]), np.array([[],[],[],[],[]]).reshape(0,5)
    for iteration, batch in enumerate(zip(dataloader)):
        x, y = batch[0]
        x, y = x.to(device), y.flatten().to(device)

        with torch.no_grad():
            x = network.FE(x)
            x_att, _ = network.sce(x)
            h = network.bilstm(x_att)
            x = x.flatten(start_dim=2)
            h = network.dropout(network.project_f(x) + h)
            l_2 = network.cls(h)
            l_2 = l_2.flatten(end_dim=1)

            y_hat = dcn(l_2.detach().argmax(-1))
            y_hat_pb = dcn(F.softmax(l_2, dim=-1))

        Y_new = np.concatenate([Y_new, dcn(y)])
        Y_hat = np.concatenate([Y_hat, y_hat])
        Y_hat_pb = np.concatenate([Y_hat_pb, y_hat_pb])
    return Y_hat_pb, Y_hat, Y_new

for epoch in range(10):
    network.train()
    loss = train(trainloader, network)
    
    network.eval()
    Yts_hat_pb, Yts_hat, Yts_new = predict(testloader, network)

(2)

def dcn(x):    # detach, cpu, numpy
    if type(x)== np.ndarray: return x
    else: return x.detach().cpu().numpy()

def predict2(dataloader, network):
    Y_new, Y_hat, Y_hat_pb = np.array([]), np.array([]), np.array([[],[],[],[],[]]).reshape(0,5)
    with torch.no_grad():
        for iteration, batch in enumerate(zip(dataloader)):
            x, y = batch[0]
            x, y = x.to(device), y.flatten().to(device)

            x = network.FE(x)
            x_att, _ = network.sce(x)
            h = network.bilstm(x_att)
            x = x.flatten(start_dim=2)
            h = network.dropout(network.project_f(x) + h)
            l_2 = network.cls(h)
            l_2 = l_2.flatten(end_dim=1)

            y_hat = dcn(l_2.detach().argmax(-1))
            y_hat_pb = dcn(F.softmax(l_2, dim=-1))

            Y_new = np.concatenate([Y_new, dcn(y)])
            Y_hat = np.concatenate([Y_hat, y_hat])
            Y_hat_pb = np.concatenate([Y_hat_pb, y_hat_pb])
    return Y_hat_pb, Y_hat, Y_new

for epoch in range(10):
    network.train()
    loss = train(trainloader, network)

    network.eval()
    Yts_hat_pb, Yts_hat, Yts_new = predict2(testloader, network)

I am using (1) for test data (for inference/evaluation), is it right to use code (1)? Is using code (1) the same as using code (2)?

the simple version of the above is like this

(a)

network.eval()
for iteration, batch in enumerate(zip(dataloader)):
    x, y = batch[0]
    with torch.no_grad():
        y_hat = network(x)

(b)

network.eval()
with torch.no_grad():
    for iteration, batch in enumerate(zip(dataloader)):
        x, y = batch[0]
        y_hat = network(x)

ptrblck · August 3, 2022, 7:58pm

Both codes should be fine and there shouldn’t be a difference. Wrapping the DataLoader into the no_grad context would not be necessary (assuming no differentiable operations are used in the Dataset, which would be a rare edge case) and you can thus use the approach you prefer.

jjjjaeni · August 3, 2022, 8:11pm

@ptrblck Thanks,

There are no differentiable operations in the dataset. The dataset is just a tensor which prints False as an output when I type dataset.requires_grad as an input.

>>> dataset.requires_grad
False

Then are you saying that the code (1) and the code (2) are the same?

thecho7 · August 4, 2022, 1:13am

Yes, they are the same.
Because x and y from dataloader have no graphs.