Thank you for your quick reply, @J_Johnson . But I’m still confused about "combining input tensors to the output tensor" mentioned by @DoctorPolygon . Is the part referred to this:
def forward(self, x: Tensor) -> Tensor:
identity = x
You can run a process that crops the image using regular image processing techniques, especially if you always have that white background. The model won’t care if the image has slightly oddball aspect ratios, just try to maximize neurons spent on looking at the leaf.
Also, its generally not helpful when starting out in ML to create a custom architecture. Use something that already exists, like Resnet, Unet, and so on. There are many more people much smarter than you or I that only do R&D and create these blocks and modules.
A Resnet can use depthwise conv just fine, I’d recommend you read more papers like the original ResNet and EfficientNet documents that go into more detail than I care to go into here.
Finding the bounds is a basic computer vision task, recommend you study the topic more in general so you are more familiar with the domain and what you are trying to do in general.