Unfortunately, life still ain't like Kaggle.
When searching for publicly available datasets to train a machine learning model for rare disease diagnosis, a hard truth quickly becomes apparent. There is a lot of data out there. But it's unclear if and how it's useful. In a Kaggle competition, the organizers make sure that the insight they are looking for is in the data. It might be hidden deep in complex relationships. But it's there.
But open source datasets?
There are countless sources. On all the different aspects of rare diseases and disease diagnosis. But not necessarily in a single repository. It is not uncommon for data to be scattered across countless sources, often incomplete, poorly documented, or only partially accessible. You need to piece together fragments from different sources, each with their own formats, labels and assumptions. To manage this, you need not only technical skills, but also a clear understanding of your goal, because your goal determines what data you need and how you define success.