Many AsStridedBackward0 during the backward

pansn · October 11, 2023, 9:01am

Our model needs to split the input data. Formatting input data within the forward function of our model causes extremely poor performance when training with torch.compile.

torch: v2.1.0
GPU: V100

def CustomModel(torch.nn.Module):
    def __init__(self) -> None:
        ...

    def format_data(self, data):
        data = torch.reshape(data, [-1, 5, 121024])
        data_list = torch.permute(data, [1, 0, 2])
        data_split_shape = [1,1,1,1,1,950,14,5,1,1,1,1,1,1,5,5,5,5,1,5,5,1,1,1,1,1,14,25,5,1,1,1,1,1,1,1,1,1,5,25,42,42,42,42,3,61,3,61]
        each_data_length = sum(data_split_shape)
        sequence_data_length = each_data_length * 16
        sequence_data_split_shape = [sequence_data_length, 4096, 256, 4096, 256]
        sequence_data, data1, data2, data3, data4 = torch.split(data_list, sequence_data_split_shape, dim=-1)
        sequence_data = torch.reshape(sequence_data, [5, -1, each_data_length])
        each_data_list = [torch.split(sequence_data[i], data_split_shape, dim=-1) for i in range(5)]
        d1 = data1.reshape([-1, 4096])
        d3 = data3.reshape([-1, 4096])
        d2 = data2.reshape([-1, 256])
        d4 = data4.reshape([-1, 256])
        return each_data_list, (d2, d1), (d4, d3)

    def forward(self, data):
        data_list = self.format_data(data)
        return ...

model = CustomModel().to("cuda")
model.train()

compiled_model = torch.compile(model)

total_step = 10000
for step in range(total_step):
    res = compiled_model(data)
    loss = compiled_model.loss_def(res)
    loss.backward()

The latency comparison of each step during training is as follows

batch_size	format_in_forward	not_in_forward
16	601.5 ms	781.4ms

Through torch.profile, we found a large number of AsStridedBackward0 during backward.

pansn · October 11, 2023, 9:02am

github.com

pytorch/pytorch/blob/7bcf7da3a268b435777fe87c7794c382f444e86d/torch/_functorch/aot_autograd.py#L632


      
              )
          
          
          def gen_alias_from_base(aliased_base_tensor, target_meta_tensor, target_requires_grad):
              # Try to do view-replay if possible.
              # fall back to .as_strided() if we can't.
              if target_meta_tensor._base is not None:
                  # The base that we want to replay our view off of might have a different shape than the view's original base.
                  b = target_meta_tensor._base
                  abt = aliased_base_tensor
                  # Don't unnecessarily call as_strided if nothing changed; as_strided's
                  # backward is poorly implemented and slow
                  if abt is not b and (
                      abt.size() != b.size() or
                      abt.stride() != b.stride() or
                      abt.storage_offset() != b.storage_offset()
                  ):
                      reshaped_base_tensor = aliased_base_tensor.as_strided(
                          b.size(), b.stride(), b.storage_offset()
                      )
                  else:

I found this comment while reading the source code. Is it related to the problem I am facing? How should I solve this problem? Should I only move format_data out of forward?

pansn · October 13, 2023, 7:14am

github.com/pytorch/pytorch

AOTAutograd perf: avoid as_strided() calls when we have intermediate bases

opened 08:23PM - 11 Oct 23 UTC

bdhirsh

triaged oncall: pt2 module: aotdispatch

This is a more targeted version of an existing issue around as_strided calls in …AOTAutograd, https://github.com/pytorch/pytorch/issues/109237. Came from an internal issue Simple repro: ``` import torch @torch.compile def f(x): out = x.mul(2) return out.view(out.shape), out.view(out.shape) inps = (torch.randn(5), torch.tensor(0)) x = torch.randn(4, requires_grad=True) out1, out2 = f(x) print(out1.grad_fn) ``` prints: ``` <AsStridedBackward0 object at 0x7f4c7d4d2260> ``` We end up calling as_strided in the compiled forward, so an `AsStridedBackward` node shows up in the backward, which in general is not implemented to be particularly fast. Why does this happen? (1) AOTAutograd has logic for "intermediate bases". If we have two outputs of our graph that are aliases of each other (and of the same graph intermediate), today, AOTAutograd will just have the shared intermediate be an output to the graph. AOTAutograd will then replay the views off of the intermediate, so that autograd properly realizes that the outputs alias. (2) AOTAutograd has a function to try to do the view replay, but it hits a slow path in that function that causes it to go to as_strided. We should figure out why and fix this: https://github.com/pytorch/pytorch/blob/main/torch/_functorch/aot_autograd.py#L807 cc @ezyang @msaroufim @wconstab @anijain2305

bdhirsh · October 16, 2023, 12:32pm

Hey @pansn do you have a self-contained code repro I can use that replicates the issue? That would be helpful for tracking down the as_strider calls.

pansn · October 17, 2023, 3:04am

@bdhirsh Sorry, maybe my previous code was too abstract. Here is a simple reproducible sample code. Could you please take a look and help me out?

import torch
import numpy as np
import time

class CustomModel(torch.nn.Module):
    def format_data(self, data):
        data = torch.reshape(data, [-1, 5, 31024])
        data_list = torch.permute(data, [1, 0, 2])
        data_split_shape = [1,1,1,1,1,950,14,5,1,1,1,1,1,1,5,5,5,5,1,5,5,1,1,1,1,1,14,25,5,1,1,1,1,1,1,1,1,1,5,25,42,42,42,42,3,61,3,61]
        each_data_length = sum(data_split_shape)
        sequence_data_length = each_data_length * 16
        sequence_data_split_shape = [sequence_data_length, 4096, 256, 4096, 256]
        sequence_data, data1, data2, data3, data4 = torch.split(data_list, sequence_data_split_shape, dim=-1)
        sequence_data = torch.reshape(sequence_data, [5, -1, each_data_length])
        each_data_list = [torch.split(sequence_data[i], data_split_shape, dim=-1) for i in range(5)]
        d1 = data1.reshape([-1, 4096])
        d3 = data3.reshape([-1, 4096])
        d2 = data2.reshape([-1, 256])
        d4 = data4.reshape([-1, 256])
        return each_data_list, (d2, d1), (d4, d3)

    def forward(self, data):
        return self.format_data(data)
    
    def loss_def(self, res1, res2, res3):
        loss = 0
        for r in res1:
            for x in r:
                loss = loss + torch.sum(x)
        for r in res2:
            loss = loss + torch.sum(r)
        for r in res3:
            loss = loss + torch.sum(r)
        return loss

device = "cuda"

model = CustomModel().to(device)
model.train()

dtype = torch.float32

compiled_model = torch.compile(model)
warmup_step = 10
total_step = 100

for step in range(total_step):
    if step == warmup_step:
        torch.cuda.synchronize(device=device)
        start_time = time.perf_counter()
    data = torch.tensor(np.random.random([16, 5*31024]) + 0.1, requires_grad=True).type(dtype).to(device)
    res1, res2, res3 = compiled_model(data)
    loss = compiled_model.loss_def(res1, res2, res3 )
    loss.backward()

torch.cuda.synchronize(device=device)
end_time = time.perf_counter()

print(
    "torch.compile avg step time: {} ms".format(
        (end_time - start_time) * 1e3 / (total_step - warmup_step)
    )
)

for step in range(total_step):
    if step == warmup_step:
        torch.cuda.synchronize(device=device)
        start_time = time.perf_counter()
    data = torch.tensor(np.random.random([16, 5*31024]) + 0.1, requires_grad=True).type(dtype).to(device)
    res1, res2, res3 = model(data)
    loss = model.loss_def(res1, res2, res3 )
    loss.backward()

torch.cuda.synchronize(device=device)
end_time = time.perf_counter()

print(
    "torch avg step time: {} ms".format(
        (end_time - start_time) * 1e3 / (total_step - warmup_step)
    )
)

V100 result

torch.compile avg step time: 49.37022762993971 ms
torch avg step time: 35.64678467810154 ms

bdhirsh · October 30, 2023, 3:04pm

Hey @pansn - I believe the AsStridedBackward slowness should be fixed by https://github.com/pytorch/pytorch/pull/111411. I tried running your repro on top of that fix on a nightly, and when I print out the grad_fn of every output of the forward, I no longer see any AsStridedBackward nodes.