@Christian_Reich.
Fantastic challenge! The data catalog is a very hot topic in the Pharma / Heathcare industry - I believe there are a few commercial and open source products (CKAN/DKAN) floating out there, some are actually pretty good, all would require some (intense) level of customization and integration. And depending on who you ask - they can be very broad or very focused.
In my mind, a typical Data Catalog is, essentially, a registry of data sets. The registry can be used by the users to discover data sets that are relevant for a certain task at hand based on a set of metadata attributes and data set stats describing it.
Regardless, the main challenges that always need to be addressed to keep data catalogs alive and useful are:
- keep data in the data catalog fresh and up to date, in sync with source data
- use of a standard, consistent and relevant vocabularies to describe the data set as meta data to improve the “discoverability” of data
- integration into the business process where data is being used.
The data catalogs typically include the following meta data:
- general attributes - name, description, time span, geographical location, ownership, license type.
- business area/process relevant attributes - therapeutic areas, population, drug and diseases covered, population, geographical distribution, use of standard vocabularies. This information is typically generated by automatic data profiling tools and is integrated into the data catalog record. Achilles is a great example of one of this - just a fantastic tool!
- data lineage - versioning, link with other data sets (raw data or analysis outcome insights) or even a process.
- data quality attributes - describe the level of data trustworthiness.
- compliance and security - data privacy related and access information etc…
The Odysseus Arachne platform is built to be used to conduct federated as well as local studies in a collaborative way. It includes a simple to use yet quite sophisticated data catalog. It is playing a clear supporting role within the business process (study) allowing data node owners to register and describe their respective data sets for the purpose of them becoming discoverable by study teams. The study teams use the Arachne data catalog to find data set and subsequently request and gain appropriate level of controlled access to it. Since we have integrated it with OHDSI Achilles, the researcher can also use the data catalog record to perform a high level feasibility of the data set for a certain type of the analysis.
It is a very interesting topic - I would be happy to present what have been done so far and further discuss this in one of our OHDSI WG meetings.