I have collections of CT scan medical image datasets and I would like to sample the dataset for training, validation, and testing due to the computational challenge of not being able to use all the datasets. A large number of the images are with/without pathologies - GGO and CON while others only have either GGO or CON.
As I need to sample/split the whole datasets into training, validation, and testing, what approach(es) do you suggest in order to have an approximate statistic distribution of the validation/testing in training?