My training code does successfully train a known model (VGGnet) on CIFAR-10, so my logic is also that there's something wrong with my model. However I have also exactly compared my model's outputs and gradients to the official model.
I thought about trying to remove the passthrough connections from my DenseNet implementation to further debug this. However I haven't tried this yet because I've never seen reports of such an architecture (even correctly implemented) converging on CIFAR-10. And if my implementation of this did converge, then it would indicate that there's a problem of layers that concatenate the input and output. So to directly check if there's a problem with this kind of operation, I used numdifftools to numerically check the gradients of a single PyTorch layer that concatenated the input to a fully-connected operation.
As another idea of breaking the DenseNet into a known architecture, I could start with a ResNet architecture that's known to converge and then start adding DenseNet features. However these intermediate architecture are not known to converge, so if it doesn't work, then I won't know if it's because of a code bug or something more fundamental.