Model only converges when wrapped in `torch.compile`

angusturner · September 5, 2023, 2:40pm

Hi,

I have a frustrating bug that I don’t know how to begin to solve. Essentially - if I torch.compile my model, then it trains normally. I get the results I would expect, I can do inference etc. etc.

To speed up my development loop I tried removing the torch.compile, and suddenly my model cannot improve past random performance. I have tested two identical configurations, that differ only in the use of torch.compile.

Does anyone have any idea how this is possible?

ptrblck · September 5, 2023, 3:00pm

Could you post the model definition as well as the training loop, please?

angusturner · September 5, 2023, 3:14pm

It is quite a lot of code but I can give you a few snippets. (Edit*: I can send more if required, but maybe I will work towards a min repro. first…)

The train loop is something like this:

        if self._loss_fn is None:
            if self.torch_compile:
                print("Compiling loss function...")
                self._loss_fn = torch.compile(self.compute_loss, fullgraph=True, backend="eager")
            else:
                self._loss_fn = self.compute_loss

        for features in enumerate(loader):
            loss, *_ = self._loss_fn(**features)
            loss.backward()
            ...

compute loss looks like:

# forward pass
pad_idx = 1024
nb_codes = self.model.nb_codes
codes_shifted = F.pad(codes, (1, 0), "constant", pad_idx)  # (B, nb_codebooks, L+1)
logits = self.model(
    x=codes_shifted,
    ...
)  # (B, nb_codebooks * nb_codes, L)

loss = F.cross_entropy(logits, codes, ignore_index=pad_idx, reduction="mean")

return loss, {}

Within the model code itself, can you think of any specific ops that are likely to generate a difference in compiled vs un-compiled?

A few extra pieces of info:

the model is a deep auto-regressive model. inference works without compilation (I get good samples from the model)
compiling backend=eager also converges. So I guess graph capture is changing something (in a way that happens to be beneficial / stabilizing)?

ptrblck · September 5, 2023, 3:21pm

No and I would also assume that the default “eager” mode should already fail or raise warnings in case something odd happens as I would assume torch.compile has more limitations. However, in your use case it seems torch.compile works while the default eager mode works, which is something I haven’t seen before.
Are you using CUDAGraphs in eager mode as it’s not always trivial to make sure the same buffers are used?

angusturner · September 5, 2023, 3:25pm

I am not familiar with CUDAGraphs, and also fairly new to torch.compile.

So I think the answer is no? I have tested three settings:

no torch.compile (training fails)
torch.compile() training succeeds
torch.compile(…, backend=“eager”) - training succeeds.

Possible I am doing something silly elsewhere, but yeah its really weird behaviour. I’m just going to start stripping things out until compile and non-compiled give the same results.

Will open an issue if I can identify an actual error…

ptrblck · September 5, 2023, 3:27pm

Sounds good! Let me know once you have an executable code snippet reproducing the issue.

angusturner · September 6, 2023, 1:26pm

I have made some progress in narrowing down a cause of the issue, but its still quite mysterious.

In my code I have the following layer-norm, with (B, C, L) ordering:

@torch.jit.script
def layer_norm_no_bias(x, gamma):
    mean = x.mean(dim=1, keepdim=True)  # (B, C, L) -> (B, 1, L)
    var = x.var(dim=1, keepdim=True, unbiased=False)  # (B, C, L) -> (B, 1, L)
    x = (x - mean) / torch.sqrt(var + 1e-6)  # (B, C, L)
    return gamma * x

If I remove the torch.jit.script, my model converges (regardless of the outer torch.compile). However, with the jit.script my model only converges with an outer torch.compile (which I think is taking precedence / negating the jit.script somehow?).

This is pretty weird, because when I compare the scripted vs. non-scripted layer norm in a notebook, the results are basically identical? (sometimes a tiny epsilon difference which seems probably consistent with floating point errors?).

ptrblck · September 6, 2023, 4:49pm

This is really interesting and great debugging! Thanks a lot for sharing the update.
I don’t know how torch.compile would interact with an internal TorchScript module, but note that TorchScript is in maintenance mode the current recommendation would be to use torch.compile (only).
Does the model work fine without @torch.jit.script in eager mode vs. torch.compile?

CC @marksaroufim for viz as it’s an interesting issue.

angusturner · September 6, 2023, 5:19pm

Yes it works in both modes without the torch.jit.script - it actually feels like two bugs:

weird/silent interaction of torch.compile with torch.jit.script
numerical error in backwards pass of torch.jit.script

I opened an issue for the 2nd, with a minimum reproducible that shows jit.script producing the wrong gradients.

github.com/pytorch/pytorch

torch.jit.script produces incorrect gradients

opened 05:19PM - 06 Sep 23 UTC

angusturner

### 🐛 Describe the bug I made a custom layer-norm implementation for Conv1d `(B…, C, L)` ordering, and the gradients appear to be wrong under `torch.jit.script`. Here is a reproducible example: ```python import torch import torch.nn as nn def ln_nb(x, gamma): mean = x.mean(dim=1, keepdim=True) # (B, C, L) -> (B, 1, L) var = x.var(dim=1, keepdim=True, unbiased=False) # (B, C, L) -> (B, 1, L) x = (x - mean) / torch.sqrt(var + 1e-5) # (B, C, L) return gamma * x ln_nb_scripted = torch.jit.script(ln_nb) class LN(nn.Module): def __init__(self, num_channels, jit=False): """ Bias-free layer-norm with (B, C, L) ordering. """ super(LN, self).__init__() self.gamma = nn.Parameter(torch.ones(1, num_channels, 1)) self.jit = jit def forward(self, x): if self.jit: return ln_nb_scripted(x, self.gamma) return ln_nb(x, self.gamma) # create dummy-input + grad from previous layer B, C, L = 8, 512, 1024 x = torch.randn(B, C, L, requires_grad=True).cuda() prev_grads = torch.randn(B, C, L).cuda() # note: error doesn't appear until 2nd or 3rd JIT-ed function call. for i in range(4): # version 1. (no JIT) ln1 = LN(C, jit=False).cuda() y1 = ln1(x) grad1, = torch.autograd.grad( y1, x, prev_grads ) grad1 = grad1 # version 2. (JIT) ln2 = LN(C, jit=True).cuda() y2 = ln2(x) grad2, = torch.autograd.grad( y2, x, prev_grads ) grad2 = grad2 grad_diff = torch.abs(grad1 - grad2) y_diff = torch.abs(y1 - y2) print(f"Iteration {i}") print(f"Output diffs. Mean: {y_diff.mean().item()}, Max: {y_diff.max().item()}") print(f"Grad diffs. Mean: {grad_diff.mean().item()}, Max: {grad_diff.max().item()}\n") ``` As an interesting aside, the bug goes away if my model is wrapped in `torch.compile`, even with `backend=eager`. I discovered this because my model would _only_ train when wrapped with `torch.compile` (Unless I disable `torch.jit.script`, in which case it always trains). ### Versions Collecting environment information... PyTorch version: 2.0.1+cu118 Is debug build: False CUDA used to build PyTorch: 11.8 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.25.0 Libc version: glibc-2.35 Python version: 3.9.17 (main, Jul 5 2023, 20:41:20) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.19.0-1030-gcp-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB GPU 1: NVIDIA A100-SXM4-80GB GPU 2: NVIDIA A100-SXM4-80GB GPU 3: NVIDIA A100-SXM4-80GB Nvidia driver version: 535.86.10 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) CPU @ 2.20GHz CPU family: 6 Model: 85 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 1 Stepping: 7 BogoMIPS: 4400.46 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities Hypervisor vendor: KVM Virtualization type: full L1d cache: 768 KiB (24 instances) L1i cache: 768 KiB (24 instances) L2 cache: 24 MiB (24 instances) L3 cache: 38.5 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-47 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Retbleed: Mitigation; Enhanced IBRS Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT Host state unknown Versions of relevant libraries: [pip3] mypy==0.991 [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.21.6 [pip3] torch==2.0.1+cu118 [pip3] torch-stoi==0.1.2 [pip3] torchaudio==2.0.2+cu118 [pip3] torchfile==0.1.0 [pip3] torchvision==0.15.2+cu118 [pip3] triton==2.0.0 [conda] numpy 1.21.6 pypi_0 pypi [conda] torch 2.0.1+cu118 pypi_0 pypi [conda] torch-stoi 0.1.2 pypi_0 pypi [conda] torchaudio 2.0.2+cu118 pypi_0 pypi [conda] torchfile 0.1.0 pypi_0 pypi [conda] torchvision 0.15.2+cu118 pypi_0 pypi [conda] triton 2.0.0 pypi_0 pypi

I will just avoid torch.jit.script for now.

marksaroufim · September 6, 2023, 5:33pm

Not gonna lie this issue was pretty funny when I first saw it, I suspect dynamo recently skipping jit because a few weeks I had to skip jit export here EMFORMER_RNNT not compilable · Issue #106101 · pytorch/pytorch · GitHub

Maybe @bdhirsh has some ideas as to what’s going on