OHDSI Home | Forums | Wiki | Github

A serialization format for OHDSI CDM

Over the last few weeks I have been exploring OHDSI CDM and associated tools, and comparing it with Computational Healthcare (a software stack for AHRQ HCUP data developed by me.) . From my initial exploration, its apparent to me that OHDSI CDM might benefit significantly from a serialization format in Interface Definition Language such as Protocol Buffers or Thrift or Avro.

Such format would allow information about Persons, Visits, Condition occurences etc. to be processed in chunks, while maintaining the logical integrity in any programming language (Python, C++, Java) outside a database. Further it would enable creation of Server/Clients for Remote Procedure Calls using frameworks such as gRPC and efficient processing on Spark/Hadoop using columnar data formats such as Apache Parquet.

I am currently trying to translate the CDM specifications into protobuf using the SynPUF. Its very much work in progress but you can take a look here:

https://github.com/AKSHAYUBHAT/ComputationalHealthcare/blob/master/blog/ohdsi.proto

A good introduction to serialization frameworks:
http://ganges.usc.edu/pgroupW/images/a/a9/Serializarion_Framework.pdf

thx for your wonderful work i want to pariticipate in this work

Can you explain more what this means:

might benefit significantly from a serialization format

Are you saying that NOTE table should be a one large 8 GB file?

Serialization format is in my view a file on my hard drive. Correct?

The OMOP [historical] approach is to use relational database. Because the data is quite large. Are you arguing for some other approach?

The issue with keeping the entire data model as relational database schema is that there is no way to maintain consistency outside database. In addition to maintaining the data inside database, the data can be serialized into person, notes, visit, etc. messages.

E.g. consider this work

Each process generates CDM files in CSV format, as well as a flat-file, single-person representation
of the CDM in JSON format for further PaaT (Person-at-a-Time) processing.

In this case in addition to having a CDM data in CSV format. They also stored the data with nested person/visit objects in JSON format. Rather than using a JSON with homegrown converter, a serialization framework will ensure that the Schema is maintained outside database. And enable scalable processing since the same framework can be used to build RPC clients etc.

Regarding your specific question:

Are you saying that NOTE table should be a one large 8 GB file?

By splitting the same table in nested patient level message in multiple files (~100,000 messages), the processing can made significantly faster. Typically databases are limited by the computation power available to the database engine. If the same data is also stored on a distributed file system such as S3 or HDFS it can be processed much faster by distributing load across multiple machines.

Serialization format is in my view a file on my hard drive. Correct?

A serialization format such as protobuf or thrift maintains consistency for files stored on disk, data sent over network and within a programming language (by performing type checking, required/optional fields, etc.).

Thanks for talking about this today, Akshay.

So, if I wanted to try this for retrieving CDM data from postgres into javascript, what would I need to do?

To give a concrete example, consider how OHDSI WebAPI currently implements PersonService [1] it executes an SQL statement [2 and/or 3] and stores results into a PersonProfile object [4]. This object then gets converted into JSON and sent as a response.

Now similar to how a PersonProfile class exists in Java one would need to create an equivalent class in Python or C++ or JS. With Protobuf & gRPC, you would write all the data structures (Person, Visit, Drug, Condition Occurrence etc.) in a single file which will automatically generate classes for programming languages. (similar to PersonProfile.java). You can also write all the services e.g. getPersonProfile(String sourceKey,String personId) -> PersonProfile in a similar manner and gRPC will create stubs for clients and servers. This ensure that the core data model is not strongly tied to a single programming language.

for retrieving CDM data from postgres into javascript, what would I need to do?

You would need to implement a simple server (Python, Go, Java etc.) which will execute getRecords.sql that stores results in Person protobuf object (Protobuf will create a class for your language). You can then send the object using code generated by gRPC. To use this in javascript you would need to provide ohdsi.proto to ProtoBuf.js [5] and use the service object it generate to make a request. The response will be automatically parsed and converted into a Person object.

Eventually when CDMv5 evolves into CDMv6 you would no longer need to update code in all programming languages manually. All you would need is to update ohdsi.proto and use the generated code.

[1] https://github.com/OHDSI/WebAPI/blob/master/src/main/java/org/ohdsi/webapi/service/PersonService.java
[2] https://github.com/OHDSI/WebAPI/blob/master/src/main/resources/resources/person/sql/personInfo.sql
[3] https://github.com/OHDSI/WebAPI/blob/master/src/main/resources/resources/person/sql/getRecords.sql
[4] https://github.com/OHDSI/WebAPI/blob/master/src/main/java/org/ohdsi/webapi/person/PersonProfile.java
[5] https://github.com/dcodeIO/ProtoBuf.js/

1 Like
t