Generalizability issues with deep learning models in medicine

How well the deep learning models generalize to the medical images, like CT scans?
According to this paper( Deep learning-based COVID-19 pneumonia classification using chest CT images: model generalizability, haven’t passed the peer review yet), the model train by cnn do not generalize well for different dataset. The data set he prefer is not big, the largest one only has 1500~2000 CT scans per class.

A Real-World Demonstration of Machine Learning Generalizability: Intracranial Hemorrhage Detection on Head CT say the ML model can generalize well for different data set, but the “other data set” the test only come from their own emergency department in 2019.

New to the domain of medical images, do anyone know the progress of this question? Thanks