In part due to the efforts at the recent OHDSI document-a-thon in Cleveland, the first full draft of a chapter is now ready!
I invite everyone in the community to review the chapter on Population-Level Estimation. Please let us know if the chapter is clear, whether things are missing, or if you have any other comments.
Comments and edits are welcome in any form or shape. I recommend either using TheBookOfOhdsi issue tracker or to edit the rmarkdown directly to hit the edit ( ) button at the top of the book website. (Alternatively, you can send an e-mail to one of the chapter leads).
Martijn, This is fantastic. Thank you. My colleague Kyrylo Simonov had the chance to read TheBookOfOhdsi just yesterday and found it very helpful for understanding what these tools are doing. In particular, beyond Atlas, it was great to see how the CohortMethod āRā library fits into the broader picture. - Clark
Thanks so much @schuemie, @msuchard, @David_Madigan, and @Patrick_Ryan!
This is extremely well done and thorough and just what I needed to help guide colleagues who want to start using these tools.
Iām using a few of the Atlas-generated PLE outputs right now and I have to be honest, 13.3.4 Running the study package needs more detail. The README file today is missing some key knowledge on how to actually execute a study package.
If not inside this exact section, we need a user guide that explains the code generated by Atlas (e.g. a brief explanation of each of the components, what they do, etc) and basic step-by-step of how to utilize the package. This is important because if the PLE package is being transmitted as a network study, it will undergo significant review by organizations receiving the package. We shouldnāt assume people automatically understand the outputs and should be clear thereās no funny business in here. I imagine some sites will be forced to explain to their IT group, āno weāre not getting hacked by OHDSIā¦ this is clean and acceptable code. Hereās what it doesā¦ā
Right now, the README has placeholders inside the code where a user would need to add their own instructions of how to load the package, e.g.:
In āRā, use the following code to install the [studyname] package:
To do: Need to provide some instructions for installing the study package itself.
Or maybe this was a note to yourself?
Maybe we can set-up a call and quickly chat through the general steps here? @George_Argyriou and I are testing @mattspotnitzās package and admittedly weāre not sure what the ābest practiceā is for loading. We both have different opinions
I will cover some of the nuances of executing a study in the Network Studies chapter, tooā¦ but I think the PLE chapter needs to explain the PLE side of things.
Echoching @krfeeney - I do think that if the book is to enable a brand new shiny user to run a PLE package, then perhaps there does need to be a bit more detailed explanation of how the code is generated.
I know a certain level of understanding is required but I think step-by-step guides are super useful in this instance.
Very nice. Iād suggest from a quick initial read to make explicit in what populations the effect is being estimated under various approaches described. For example, in the new user cohort method, the estimate generated from a matching approach is standardized to the population receiving the target drug. If a weighting approach is used, usually the estimate via IPTW is standardized to the ātotalā combined population , although SMR could be used to standardize to the target only etc.
I have a few other more detailed, nit picky comments that Iāll attempt to place in the issue tracker.
I havenāt had the chance to engage much with this project, even though it is incredibly useful. But a 14-hour flight to Hong Kong with no Wifi distractions allows me to get things done.
But this is beautiful.
Still, for it to read consistently it is essential that we standardize the way we talk about things. We may need some editorial cheat sheet. In fact, we may want to create that online and in community fashion.
Naming of standard things:
Event vs. domain vs. clinical entity
Introduction of things (CDM for Common Data Model) and subsequent use. If we use hyperlinks heavily we can get away with the introductions only once. Otherwise, Iād say each section introduces all abbreviations.
Claims, EHR. Remember, the audience is international.
ID or Id or id of identifier
Columns or fields
Spelling of standard terms and other editing choices
Capitalization. All Domains should be capitalized (Drug, Condition, the word Concept, Standard Concept, Source Concept, etc.)
Reference to CDM tables and fields. In the CDM documentation, we used ALL_CAPITAL for tables, and all_lower_case for fields. We may turn the latter into italic_field_names. But I wouldnāt also use capitalization, to help differentiating between the two.
Reference to records in a table. For example, records in the DRUG_EXPOSURE table should be called Drug Exposures.
Naming of vocabularies. ICD-9, ICD-9-CM, ICD9CM, Read, etc.
Quoting things or using full nouns (the āResultsā schema) with single or double quotes.
Healthcare (instead of health care)
Use-cases or use cases
References. We should cross-reference heavily heavily heavily. The book should be one ocean of hyperlinks, so folks can read any chapter at any time and not depend on having read others before. Concept IDs should be referenced in Athena.
Numbering of things. If we intend to use this as a html or PDF document, rather than a printed text, we donāt need to number tables and figures. They can be referenced through hyperlinks a lot better. It also saves us from constantly fixing the numbers if things get added or removed.
Way to address the reader and us. So, the reader should be either āyouā, āoneā, āweā, or not be called at all by using passive voice. Right now, it is all over the place.
I am sending revisions of the CDM, Vocabulary and PLE sections to the authors, adding a bunch of comments for consideration and fixing typos.
Thanks @Christian_Reich! This is very helpful. Iāll start making these changes.
I think the intention is to have something that works as a paper book, and add features that may help online / PDF versions. For example, concept IDs should just be numbers (not visible URLs), but in HTML / PDF they should be hyperlinks to ATHENA. Similarly, we need the figure and table numbering, which in HTML / PDF are also hyperlinks. The numbering is done fully automatically, so I wouldnāt worry about that.
Iāve tried to re-introduce abbreviations in every chapter, but Iām sure we havenāt been completely consistent with that.
Two more chapters now have a full first draft: Common Data Model and Patient-Level Prediction. Both have also already been reviewed, and are currently being revised based on those reviews, so it is perhaps better to wait a bit before taking a look.
Having trouble keeping track of the state of the various chapters? Iāve created this Google spreadsheet to help.
I am taking a summer course that was more intense than expected, so the chapter is not as polished as I would have liked. I also didnāt have time to get it into R markdown - Iāll have time in 3 weeks to do so, or I can schedule a phone/skype meeting around lunchtime EST with someone more familiar with R markdown to get it formatted anytime this week or next.
Of note, sections 11.3.2 and 11.4 still need a lot of work. I found that I couldnāt actual make my own feature when I went back to finish up 11.4 today, so I copied/pasted what was in the notes from our meetup. I also wasnāt completely sure of the level of detail needed on preset features in 11.3.2, so that section does need extra scrutiny.
Finally, I do have a couple of citations in the chapter that I will get into the bookās citation page by the end of this work week.
Here are my random thoughts about the Common Data Model chapter. Iād love it to have more detail with regard to date and datetime treatment. I think itās important to specify not just what the fields should contain, but to clarify, how various algorithms treat these fields.
The observation_period says that the end_date is the latest encounter on record, but it doesnāt expressly tell me that the end_date is inclusive (is it?). How should condition or drug events limited to the observation period be pruned, must they have an end_date<= the observation period? These are essentially the same information, one specified normative form, the other specified operationally.
For the visit_occurrence table, visit_end_date seems to also be inclusive, saying that it āshould matchā. How are these date intervals treated by cohort definitions relative to periods and conditions, etc.
For condition_start_date this is defined to be the condition recording date, which seems to be quite different from HL7 FHIRās āonsetā date. The cohort logic seem to treat the start_date as an onset, not the date of record. Further, this definition, one could easily have an start_date that is much later than the end_date. Is this intended or expected? Yet, the condition_end_date is when the condition is considered to have ended, which seems compatible with FHIRās āabatementā date.
For drug_exposure the start/end date seems to be interpreted differently, exclusive rather than inclusive (or am I reading this wrong?). This seems to be touched on in a drug supply thread from 2015. Switching from inclusive treatment to exclusive interval endpoint is something that one could easily get wrong if one isnāt careful.
Generally, since the purpose of this system is to produce cohorts and higher-level analysis, really, the meaning of these fields is exactly how they are treated by the algorithms (which upon scanning the SQL. What would be most helpful is an exact description of how these columns affect the analysis performed. Or perhaps this belongs somewhere else?
There are also some missing details:
What is the relationship between date and datetime and how is a transition expected. It says that midnight is to be used for the start_datetime when the exact time is unknown; but there is no corresponding statement for end_datetime. In some ways, end_datetime can handle inclusive/exclusive differentiation, but a recommendation should be hereā¦ given that most treatment is specified to be inclusive, perhaps it should be defaulted to23:59.99 when the exact ending time is unknown?
How should a missing end_date be treated? Is it treated as unknown data? The generated SQL code usually treats a NULL value as being equivalent to the start_date, or, in some cases for conditions, and without explanation, the start_date+1 (which seems to be a confusion of inclusive/exclusive interpretation).