OHDSI Home | Forums | Wiki | Github

Evaluating the Potential of MotherDuck for Enhancing OHDSI's Open-Source Ecosystem

BLUF: Would partnering with MotherDuck benefit the OHDSI open-source community?

Background

I recently had a meeting with MotherDuck to discuss their new pricing model and its support for open-source and open-science initiatives. I am seeking community feedback on two aspects:

  1. Potential use cases for MotherDuck

  2. Estimated usage rates of certain services (Athena, Atlas, and GitHub Actions Tests)

I’m eager to hear your ideas. The adoption of DuckDB in OHDSI tooling has been increasing, and MotherDuck could significantly enhance our capabilities. I’ve outlined one promising use case below, but I’m keen to gather the broader community’s input.

What is MotherDuck?

MotherDuck is a cloud-based managed service that enhances DuckDB, an in-process SQL database designed for efficient data analytics. It offers scalability and managed operations, facilitating seamless integration with various data sources and cloud services. By leveraging DuckDB, MotherDuck provides fast query performance and simplifies the deployment and management of database infrastructure, making it ideal for serverless analytical workflows.

Example Use Case: Serverless Distribution of Synthetic Data, CDM Meta-Data, and Vocabularies for Testing and Demo Purposes

Primary Actors: OHDSI tool-stack developers and users

Secondary Actors: ETL and vocabulary developers

Goals:

  • Provide an accessible catalog of non-PII/PHI datasets and tables relevant to OHDSI tools

  • Establish a standard target for automated testing (e.g., GitHub Actions)

  • Enable the offloading of compute for interactive tutorials/demos, regardless of local compute resources

Systems: MotherDuck, DuckDB, Eunomia, Athena, Atlas, Broadsea, HADES

Preconditions:

  • Estimated usage rates

  • Process for distributing service tokens

Flows:

  • Atlas Demo: The current data sources (ATLASPROD, SYNPUF 1%, SYNPUF 5%) are stored in MotherDuck and accessed through the API. Larger synthetic datasets would be possible to add with no additional resource requirements for the current host of the Atlas Demo.

  • Athena: Compute tasks for vocabulary searches and downloads can be offloaded to MotherDuck.

  • Broadsea: During the building of the vocabulary, a selection of vocabularies can be noted in the config, along with whether to download them or keep them serverless, reducing the overhead of downloading and query compute for demos or development.

  • Testing via GitHub Actions: OHDSI tools that require testing against the vocabulary or a synthetic CDM like Eunomia can target MotherDuck databases by storing the Service Token in GitHub Secrets.

As you can see here, we test HADES packages against multiple database platforms, because each platform has its own idiosyncrasies. I see I didn’t add DuckDB to this table, but currently we only test DatabaseConnector against DuckDB, even though we should support it throughout HADES.

We could swap out the current file-based DuckDB with a cloud-based one for testing, but it would be just one of the many database platforms we support. Perhaps we could benefit from switching other resources to MotherDuck, but I would not be the right person to comment.

t