Database-wide descriptive statistics sharing

schuemie · May 25, 2016, 7:51am

Continuing a conversation that had started via e-mail with @Vojtech_Huser, @Patrick_Ryan, and @jon_duke.

When we created Achilles, we envisioned every OHDSI partner in the world would make their Achilles results publicly available to facilitate collaboration. Several years later, only one site (AUSOM) has actually done so. (Yes, also my site has not posted our data online). Even though the Achilles results are explicitly designed not to reveal any patient-identifying information, people are reluctant to share the results. I think one reason is that we’re opening Pandora’s box: Achilles’ information is so rich, we think there probably is no problem in posting it, but we can’t be sure. Some people have already mentioned some condition codes by themselves already can be used to infer malpractice, and could hypothetically be used to harm the data site. And once the data is out there, it cannot be undone.

However, I still think there is a need to share general descriptive statistics. @Vojtech_Huser proposed several different levels of sharing, some public and some between trusted partners. I would like to approach this from specific use-cases: when do we need what type of information?

One use case is finding sites that I would like to ask to participate in a study. For example, take our angioedema safety study. For this study, I would imagine the minimum requirements are that the data captures drugs and conditions in the outpatient settings for all genders, preferably not restricted to any age group. Currently, I have no way of identifying such sites in OHDSI. I have created a mockup here of information that could be generated by Achilles that we could all share, for example by posting it on our Wiki.

Let me know what you think! Would people be willing to share the data shown in my mockup? Would this be useful? Would the Wiki be a good place to keep these results, or should we build an app for that?

rwpark · May 25, 2016, 10:28am

I agree that no one can guarantee the anonymity of the data. The decision makers may not want to take any latent or invisible potential risk, which may not exist if the data are closed.
Several summarizing tools in spectrum, a very simple tools, like the mockup, or intermediate tool, to a very comprehensive tool like Achilles, may give the decision makers more options to choose.

Patrick_Ryan · May 25, 2016, 12:07pm

I really appreciate @vojtech_huser pushing on this topic, and thanks
@schuemie for keeping the momentum.

I’m not wedded to this idea, but I just want to throw it out there:

Right now, it seems the perception about ACHILLES is that ‘its all or
nothing’. That is, if you run ACHILLES out of the box, it provides a
fairly comprehensive summary of all the content across all the tables in
the CDM. On the one end, we’ve heard from folks, including @rijnbeek and
@vojtech_huser, that there are additional aggregate summary statistics
which would be desirable to add into ACHILLES and ACHILLES HEEL to better
characterize the population and evaluate data quality issues. On the
other end, we’ve got data holders who still aren’t comfortable with sharing
these granular aggregate summary statistics, potentially for a variety of
scientific and non-scientific reasons.

In fact, ACHILLES offers the ability for users to choose which summary
statistics to generate (by setting the parameter to include the list of
analysis_ids) and a user can also opt to export only a subset of the JSON
files that are used to render the web UI. It could be very reasonable (and
technically already possible) that some data holders may opt to run the
entire ACHILLES build to have shown locally, but then may choose to expose
only a subset of the summary statistics in a version that is made publicly
available on the OHDSI website.

I think @schuemie’s use case is an important one that we’ve now run into
several times already as a community. To satisfy this use case, we’d be
asking sites to expose the types of data they contain, but ideally you’d
also like to have the basic prevalence of each drug (like here:
http://www.ohdsi.org/web/achilles/#/OHDSI_Sample_Database/drugeras) and
condition (like here:
http://www.ohdsi.org/web/achilles/#/CMS_SYNPUF_synthetic_data/conditions).
(Thanks to @rwpark and @lee_evans for sharing content on the OHDSI
website).

Ultimately, this comes down to each data partner’s comfort with sharing.
Some may only be willing to share a little at the start, some may do more,
some may follow @rwpark’s excellent example and share it all. Ideally,
we’d have only one platform that would accommodate all sites at whatever
level of sharing they were willing to entertain. It seems like ACHILLES
already IS this platform. Rather than building a different reporting
structure in wiki, why don’t we organize the reports in ACHILLES around
these principles of sharing. So, there could be a report that provides the
top-line information that @schuemie and @vojtech_huser propose (which
essentially is in ACHILLES as the ‘Person’ report (
http://www.ohdsi.org/web/achilles/#/OHDSI_Sample_Database/person) plus the
‘Data density’ report (
http://www.ohdsi.org/web/achilles/#/OHDSI_Sample_Database/datadensity)).
If we find that there’s something not in ACHILLES that we’d like to add
into a report (like for example IRIS statistics), then instead of building
a different mechanism for it, why not either build a new report inside of
ACHILLES or add the table of stats to an existing report. And this way,
sites can opt to share as much or as little of the summary as they want,
and when an end user goes out to OHDSI.org/web/ACHILLES, they’ll see all
reports available but the content will only show for the sites who opted to
share it (and will pop up as blank for those who opt against sharing for
now).

Would sites be willing to expose some (if not all) of their ACHILLES
results under this approach?

Vojtech_Huser · May 25, 2016, 1:45pm

I like very much such discussion.

To add, there was also a proposal by Lee Evans (that I seconded) in the architecture group meeting, to potentially make Achilles_results table a core derived table in the CDM v6. On the same level as condition_era table. (possibly removing the cost pre-computations)

I agree that Achilles is an important platform to build on. That is why I moved Iris analyses there.
Another observation I made is that it is not that easy to build new analyses and new Heel rules (Data Quality checks). Problem with ID space for analyses (DQI group suggested to use a string for analysis_id that would make naming derived measures easy. E.g., Visit:Visit_typeByMonth and derived Visit:Visit_type (for counts with no time component)). Another problem is , data type of integer (current table) vs. float.
I found that it is not that easy to get consensus on additions to Achilles - especially new Heel rules. People try to defend their claims databases and resist EHR-ish data quality measures. New rules makes existing datasets look worse. A known resistance problem in data quality research. For Heel - adopting a “utopian data quality assumption” (making a high bar for what is a high quality data set) is an essential issue and being truly open to many data quality rule ideas.
I am not a big fan of the wiki syntax. It is only good for one platform. The trend is toward common formats. In current Iris beta version, I am targeting a .CSV file output for AchillesShare function (for all levels). (such CSV os set of CSVs would be transformable to different outputs (e.g., HTML or .MD (markdown) or DOCX). CSV files would be computable - so 17 datasets with zipped set of CSV files can be compared inside some Pick-a-CDM app.

AchillesShare beta (in Iris package for now (to make my development simpler) can be tried like this:

#execute early implementation of Achilles Share
 shareRes<-Iris:::achillesShare(connectionDetails,cdmDatabaseSchema=cdmDatabaseSchema,resultsDatabaseSchema=resultsDatabaseSchema)

#optionaly include that in export
 write.csv(shareRes,paste0(cdmDatabaseSchema,'-iris_part-',1,'.csv'),na='',row.names=F)

See https://github.com/OHDSI/Iris/blob/master/extras/notes.md#executing-iris-on-multiple-datasets-sample-code

schuemie · May 25, 2016, 2:11pm

Hi @Patrick_Ryan,

Happy to use Achilles for this purpose, but some points to consider:

I recommend we create a ‘proposed minimum set’ of statistics to share (like in my mockup). Saying that sites are free to choose any of the statistics to share may lead to an inability to decide. Often a review board has to approve the data before it is made public, and it is much easier to communicate with a concrete set in mind instead of saying “we could do all this or some of it or none of it”.
Some things would need to be added to Achilles, like the description, contact e-mail, CDM and vocab versions.
In my mockup, we deliberately reduced the detail level. For example, age at first observation is in 5-year age bins, not per year, and in percent instead of number of people. Not sure how important that is.
We’d need to set up a public Achilles, and put a mechanism in place for managing it (how to push updates from sites etc.)

Andrew · May 25, 2016, 7:02pm

Would it be useful to explicitly define the intended purpose(s) of sharing? If the primary purpose is to facilitate collaboration, there can be challenges to inferring a site’s data suitability for a project from aggragate metrics on broad range of data domains. In my experience it is common for data-holding sites that such aggregate data suggest are sutiable to have less reliable data than is needed when quality is assessed carefully. Data that has been used a lot for analysis has a higher chance of being complete and accurate, because it has been improved by scrutiny, and a lower amount of uncertainty about integrity issues for the same reason. It might be worth thinking about a standard way to capture and publish metadata about the domains that have been scruitinized heavily for a project. IMO documented prior use with DQA details is a trustworthy signal that lots of additional data exploration and development probably won’t be needed or reveal unknown major issues.

schuemie · May 26, 2016, 7:48am

Hi @Andrew,

Yes, I’d like to focus on the use case of facilitating collaboration by identifying sites that potentially could be asked to join a study.

Data-quality is indeed extremely important, and I’m a bit sad we don’t have more data quality work in OHDSI (at least not that I know of). I like the idea of using ‘prior use of the data’ as one indicator of data quality.

But what I’d really like to do is just get us started on sharing something. We currently are not sharing anything about our data (except what is in this table). Sharing high-level statistics as highlighted in my mockup would be a first step I think most people would agree to (although they might not, I don’t know yet). As @Vojtech_Huser can tell you, people are highly reluctant to share data quality information.

jon_duke · May 27, 2016, 3:04am

Just to follow-up on this point:

I would recommend we keep the achilles_results table in the results_schema rather than the main cdm_schema in order to keep the CDM limited to patient level data.

Otherwise, while I love @schuemie’s Westeros health mockup, I would favor @Patrick_Ryan’s suggestion to add the additional elements into Achilles rather than placing in a separate spot. I think the level of details that Martijn put in his example should be acceptable to any privacy officer.

One interim approach to seeding public Achilles would be adding an Upload option to the Achilles web app. That button would give you a set of checkboxes to select which reports you want to share. Check the ones you want, submit, and the JSON files get pushed up to OHDSI server. Nothing fancy (and some security considerations), but should make it dead simple for people to at least push their demographics etc to the public site.

Over the longer term, I would favor a public and private WebAPI at each site so permissions could be turned on and off without having to upload files directly (and irretrievably). The public WebAPI would be unable to access anything but Achilles_results, and could be configured to serve the JSON files on demand based on permissions. More complicated, because each site would need to host a publicly visible API, but would offer a lot more flexibility and security for turning on and off the tap.

Jon

Patrick_Ryan · May 28, 2016, 2:51pm

To @schuemie’s points to consider:

I 100% agree we would need to define a ‘small’, ‘medium’, ‘large’
version of summary statistic exports, and each site could choose what level
they were comfortable with (and of course, increasingly step up as they get
comfortable with the notion that sharing aggregate information is not a bad
thing but actually can move the science forward).

My proposal as a straw man would be the following ACHILLES reports:

‘small’ = ‘proposed minimum set’ = ‘Dashboard’ (CDM summary, population
by gender, age at first observation, cumulative observation, persons with
continuous observation by month), and ‘Data density’ (‘Total rows per
domain’, ‘Records per person’, ‘concepts per person’). It’s not
implemented in the ACHILLES web portion yet, but I’d argue I’d like to see
the table of IRIS statistics on the ‘Data density’ report as well.

‘medium’ = all reports without the concept-level drilldowns. So, for
example, ‘Conditions’ would allow you to see the treemap and table that
gives you prevalence and records per person, but you won’t be able to click
down to see prevalence by age/gender/year, by month, or breakdown by type
or age. This would expose much greater information than ‘small’, but
would be still be extremely high level, and would represent very little
risk to an institution since all small cell counts would be scrubbed and it
wouldn’t be possible to get down to any summary statistics that would be
generated off a low prevalence concept.

‘large’ = all reports with all drilldowns. We designed ACHILLES to be
low-risk for all content, and we should STRONGLY encourage the community to
follow @rwpark’s tremendous example in sharing as broadly as possible.

I don’t know how the community feels about the ACHILLES HEEL data quality
report…clearly its EXTREMELY valuable in understanding whether a
database is ready for research, but some could misinterpret it as exposing
the warts of a source in a way that is unflattering. That’s why I’ve
relegated it to ‘medium’/‘large’ even though I’d personally love to see it
in ‘small’.

2). Yes, I would propose we put CDM_SOURCE content on the ACHILLES
Dashboard page as well. If sites are following the OHDSI conventions with
V5, this table should be populated with the relevant meta-data to tell
others what the data contain in a useful way. I’d actually think exposing
the CDM_SOURCE content on all OHDSI apps will generally be a very good
idea…

3). I’d recommend to go forward with the existing solution, rather than
modifying and making people re-run and then having to do new development on
ACHILLES. I’d only favor reducing the level of data is a site says they
are unwilling to share unless this was done (and I’d be fascinated to here
why that would be the case).

We have a public ACHILLES already: http://ohdsi.org/web/ACHILLES.

To amend @jon_duke’s recommendation, I would NOT suggest we export JSON
files and have them shared to the OHDSI centeral server. Rather, I suggest
we add an export to the ACHILLES R package that creates a .csv of the
underlying ACHILLES_RESULTS and ACHILLES_RESULTS_DIST tables, subset to
only the ‘small’, ‘medium’, or ‘large’ sets as requested by the user. In
this way, the .csv can be loaded into the OHDSI central database and the
JSON can be generated centrally. This will also help get around some of
the occasional performance issues some people run into on JSON export when
they don’t have their vocabularies adequately indexed…

So, assuming folks were legitimately willing to join the journey of
evidence sharing, the only three technical tasks left to do are: 1) add an
‘exportToOhdsi’ function in ACHILLES, 2) @lee_evans can stand up a secure
S3 bucket to host the results (just like we do for our OHDSI network
studies), and 3) we’d need to load the imported csv into a database and
have the ACHILLES ExportToJSON set up to kick out files for whatever
databases get sucked in…

hripcsa · May 28, 2016, 6:13pm

There are lots of reasons that you might limit access. We have to be careful not to over judge ("

assuming folks were legitimately willing to join the journey of evidence sharing").

For example, once an institution publishes an error rate (eg, implied by an ICD9 code), there is no way to take it back in a legal case. You can’t then say, but there was probably miscoding; you shouldn’t have published it. We publish plenty of error rates but they are curated.

George

Patrick_Ryan · May 28, 2016, 6:28pm

Thanks @hripcsa, so are you saying that even ‘medium’ sharing may be too
much for some? If so, how do we overcome this issue? Surely we don’t want
to limit ourselves to only sharing ‘small’ information…

hripcsa · May 29, 2016, 5:13pm

Just thinking, but could aggregate by disease class, or filter some diagnoses.

George

Patrick_Ryan · May 29, 2016, 7:46pm

Does anyone in the community have specific SNOMED concepts that they know
they would be unable to share? If so, please post them here and we can
figure out how to filter them out as part of the ‘medium’ export process.

Rijnbeek · May 31, 2016, 9:37pm

Sorry for jumping in late in this discussion which is very relevant to our proposed use of Achilles in the European Medical Information Framework (EMIF) project i am heavily involved in.

We are in the process of mapping 6 databases to the OMOP CDM. The idea here is to use Achilles as enhancement of a database Catalogue that contains metadata about the database (questionnaires). However, i agree with @hripcsa that this is seen as very sensitive for reasons you might not think of in the first place. In general in EMIF the feeling is that the person dashboard is ok to make public, but drug use, outcome prevalences etc is sensitive for multiple reasons. We will perform interviews with the data custodians to get more insight in the details here, but for example, for some this is jeopardising their current business model, or it is not allowed by governance boards, or they are afraid that the ‘simple’ approach of prevalence of codes is miss-interpretated as prevalence of disease which can not be defined by a single concept in their database but needs an algorithm.

I think that the limits on the Achilles report creation side is not the most optimal solution. In EMIF it can turn out that, dependent on the chosen business model, some reports are visible to paying customers and other are public. In EMIF we will need mechanisms to protect parts of Achilles using a user-based security layer. Clearly something that in my view should be build by EMIF.

Interestingly, today and tomorrow we are having a workshop in Brussels, where these things are being discussed. I am pushing that EMIF will contribute in the development of Achilles (and also other tools that we will be incorporating like Calypso or Atlas) and I think there was a lot of support for this in today’s meeting. A very important question for EMIF is whether the current Achilles is useful in the European context and furthermore are the available dashboards what is needed by all the potential users of the tool. I think there are opportunities here, especially since EMIF has an user base on all levels, i.e. data custodians, companies, IMI etc. Personally, i would like to see EMIF not only propose improvements of Achilles but also contribute to the tool development.

Peter

jon_duke · May 31, 2016, 7:30pm

Pat,

I agree that it would be much preferable to export the data itself rather
than the derived JSON. I would however expect that people will want to
look at AchillesWeb to get a feel for their results before pushing out to
OHDSI. So from a workflow perspective, this will be 1) run Achilles in R;
2) go to AchllesWeb and check stuff out; 3) remember to be a good network
citizen and go back to R to export your small/med/big data to OHDSI. I
just think we’ll get more activity if it is directly performable from
AchillesWeb, where the upload option can be put in a strategically salient
location (and not go away until you’ve uploaded your data). Standard HCI
stuff - recognition over recall, make it easy for users to do the right
thing, etc.

So how about we put it in AchillesWeb but have the WebAPI pull together the
CSV data as you’ve described rather than the JSON files?

Jon

schuemie · June 1, 2016, 10:27am

@jon_duke, although I very much like that idea from a user perspective, I think it is a bit hard to implement. Currently Achilles Web is completely uncoupled from any database, and would not be able to extract the CSV files.

I do think it is essential that folks review their Achilles before sharing results. In fact, I think this should be a general principle in OHDSI, that collaborators always have the opportunity to review anything in detail before it is shared.

A workflow that could work is:

Run Achilles in R
In R, export what you’d like to share (small, medium, or large set) to CSV
In R, transform the CSV files to JSON
Visualize the JSON in an Achilles Web
When OK, submit the CSV files to the coordinating center (either through e-mail or S3)

Vojtech_Huser · June 1, 2016, 3:59pm

I agree with Martijn’s point about Achilles Web being decoupled.

Looks like me, Martijn and Patrick all support creating a .CSV using R code.

Patrick wrote:

Rather, I suggest we add an export to the ACHILLES R package that creates a .csv of the underlying ACHILLES_RESULTS and ACHILLES_RESULTS_DIST tables, subset to only the ‘small’, ‘medium’, or ‘large’ sets as requested by the user. In this way, the .csv can be loaded into the OHDSI central database and the JSON can be generated centrally.

Rijnbeek · June 1, 2016, 5:40pm

Guys any response to my post above?

Christian_Reich · June 21, 2016, 5:46am

@Rijnbeek:

What came out of your Brussels session? Would be interesting to hear.

Look. I am in that boat at IMS. I personally believe there is more value in publishing those simple counts, but there are people who will feel differently. The only way to get over that is to have as many people as possible publish their results, at which point the counts of some silly code in the data becomes no longer an asset.

Vojtech_Huser · January 18, 2018, 4:57pm

To reduce the size of the shared data - a subset of data from Achilles is created by R code called MIAD (minimum information about dataset).

Exact patient counts are changed to ratios.

See https://github.com/OHDSI/StudyProtocolSandbox/blob/master/themis/extras/MIAD.md

MIAD level 1 only shares data on % of females and males.
MIAD level 3 has about 100 items. All are not sensitive to share. I proposed it for the Raloxifene study and it is used a lot in DataQuality study.

I encourage study authors to consider adding MIAD.R code to any future study within OHDSI.