Mhmm. Well, when I am training my actual model I know the training time decreases because the wall-clock time of each epoch is less, certainly that isn’t being affected by the async. 
And as I said, the batch size I am able to train on (without an CUDA out of memory error) is much higher.
At first I thought the speedup was some nuance of data IO with the larger batch size (like, fewer I/O operations from main memory to GPU?) but with the above script and BATCH_SIZE = 32 I see the decrease in memory use (at least as reported by nvidia-smi) and decrease in training time even with a fixed batch size.
I was also able to reproduce the behavior with this script on another machine (with a Quadro RTX 6000). Note, too, that when I run it on my actual model I am loading data off the SSD, just in case you thought it might be something to do with the contrived nature of this example script…
Thanks for your thoughts, I am just looking for understanding. My first thought was “surely I am confused and somehow this is reducing my effective batch size?” but I am pretty sure it is in fact running the whole model and using the larger batches… it is also perhaps notable that both with and without the checkpoint I am getting high-80s-mid-90s % GPU utilization (as reported by nvidia-smi). I am pretty sure it is reliably slightly higher with the checkpoint than without, which I guess is what you’d expect since presumably it is re-computing the forward pass. Its just such a dramatic effect I would like to understand how to leverage it every time I train a model, or understand what I am doing wrong…