Why is preprocessing absorbed into detection models?

I noticed a trend in libraries like mmdetection, detectron2 and even torchvision where we absorb the preprocessing into the nn.Module.

So for a classification model, normally we normalize pixels prior to input, whereas a detection model has that and other transforms built into the nn.Module. Furthermore, the detection models seem to have a bunch of assertions and checks (are the bbox targets properly defined).

Personally, I’m not a fan of this approach, and I like to keep the DNN in the nn.Module, and anything else like the preprocessing separate. Would go as far as to say that even the losses should be calculated separately (whereas in torchvision some of the losses are calculated within the submodules)

But why is that the trend? (this is my main question) Would love it if someone can convince me why I should drop my preference and go with the flow.

Well I would say that it’s for convenience.
Preprocessing is necessary for models to work. That’s a fact. Wrong preprocessing–> wrong output.
From a api perspective it makes more sense. If you wanna process images, you just need to pass a natural image between 0 and 255, not a normalized image with strange mean and std. So this ease task for begginers and non CS ppl.

This way also ensures you can grab just the model out of a framework and it will work as designers expect.
I guess preprocessing outside the model was used for computational efficiency. With new gpu gens it’s faster to process many things in GPU than in CPU.

Thanks for your input @JuanFMontesinos . I’m all for that sort of convenience, I just don’t think it should fit under the “forward” function of a “neural_network”.Module. If we’re going to go down the path of doing pre/post processing on the GPU then there should be separate types of modules that handle it. It’s not just a cosmetic thing. It also means that when I want to do something custom with the pre/post processing I need to dive into the source code and isolate the neural network from the rest manually.

But all those things are coding practices.
I can code a nn.Module called preprocessing and just set it as a layer which is straight-forward to remove (and define), or as a method within the model class. Even the forward pass of a “network” can be modularised or to be a messy script.

Anyway there is no correct answer here. I used to have many modules to do the preprocessing and so on and ended up being problematic when you try to make the code user-friendly.
Then it becomes something like hey, you need this model + this preprocessing funcs. And then, one usually have a utils file, a custom package or similar that people who want to reuse your model will have to copy and understand anyway.
And it’s very often not very well written. You may find reading + cropping in the same function. Turns out you want to use another library to read.

I ended up writting wrappers around the modules which contains all this with pytorch functions. Do you want to reuse the model? Are you a basic user? Copy paste the wrapper and pass audio, video, images… and don’t struggle more.
Do you want just the “neural network”? Then just copy that.
Do you want the whole training pipeline with the dataloader and my setup? Then just clone the repo.

For example I recently found an interesting audiovisual transformer from facebook. The code is a mess. It works based on a library which creates an object which carries all the cfg. Really hard to understand how does it work unless you inspect the object in a debug session. Everything ofc fused with a distributed dataloader. You cannot just copy the model. To me it would have been much simpler if they just created and isolated model I can run with my inputs.

Yeah well it sounds like you understand my motivation then. I’m a consultant so I usually end up wrapping everything up for my clients and they don’t ask questions as long as I have a good way of presenting the results. So most of the time I don’t need to make my code user friendly - either no one looks at it, or an expert looks at it in which case they don’t want it to be an impenetrable fortress with a config as the key. And if I’m using someone else’s code I almost always dive into how it works and fiddle around with it, so I don’t need/want them to package it all up in a way that’s hard to disentangle (but “user friendly”).

But yeah I get the general user friendly argument.