So the understanding that came from working on GPU RAM fragmentation diagnostics is that this won’t cause an actual fragmentation, since an average model will be 100MB+ and a whole bunch of whole gpu memory pages will be freed and re-used later through remapping of free pages.
So the only issue here is if there is not enough memory left to load the new model, without first unloading the old one, in which case the card will not be able to do much work anyway, other than perhaps a very simple inference. So I suppose this is a very low priority for devs to spend their time on.
All is good then.
Thanks again to @colesbury for his very insightful answer.