OHDSI Home | Forums | Wiki | Github

Book of OHDSI chapter review


(Martijn Schuemie) #1

In part due to the efforts at the recent OHDSI document-a-thon in Cleveland, the first full draft of a chapter is now ready!

I invite everyone in the community to review the chapter on Population-Level Estimation. Please let us know if the chapter is clear, whether things are missing, or if you have any other comments.

Comments and edits are welcome in any form or shape. I recommend either using TheBookOfOhdsi issue tracker or to edit the rmarkdown directly to hit the edit (edit ) button at the top of the book website. (Alternatively, you can send an e-mail to one of the chapter leads).

(Clark C. Evans) #2

Martijn, This is fantastic. Thank you. My colleague Kyrylo Simonov had the chance to read TheBookOfOhdsi just yesterday and found it very helpful for understanding what these tools are doing. In particular, beyond Atlas, it was great to see how the CohortMethod “R” library fits into the broader picture. - Clark

(Seng Chan You) #3

@schuemie Awesome! Now I should start to translate it into Korean!

(Andrew Williams) #4

Thanks so much @schuemie, @msuchard, @David_Madigan, and @Patrick_Ryan!
This is extremely well done and thorough and just what I needed to help guide colleagues who want to start using these tools.

(Kristin Kostka, MPH) #5

@schuemie This chapter is a fabulous start.

I’m using a few of the Atlas-generated PLE outputs right now and I have to be honest, 13.3.4 Running the study package needs more detail. The README file today is missing some key knowledge on how to actually execute a study package.

If not inside this exact section, we need a user guide that explains the code generated by Atlas (e.g. a brief explanation of each of the components, what they do, etc) and basic step-by-step of how to utilize the package. This is important because if the PLE package is being transmitted as a network study, it will undergo significant review by organizations receiving the package. We shouldn’t assume people automatically understand the outputs and should be clear there’s no funny business in here. I imagine some sites will be forced to explain to their IT group, “no we’re not getting hacked by OHDSI… this is clean and acceptable code. Here’s what it does…”

Right now, the README has placeholders inside the code where a user would need to add their own instructions of how to load the package, e.g.:

  1. In ‘R’, use the following code to install the [studyname] package:

To do: Need to provide some instructions for installing the study package itself.

Or maybe this was a note to yourself? :wink:

Maybe we can set-up a call and quickly chat through the general steps here? @George_Argyriou and I are testing @mattspotnitz’s package and admittedly we’re not sure what the ‘best practice’ is for loading. We both have different opinions :laughing:

I will cover some of the nuances of executing a study in the Network Studies chapter, too… but I think the PLE chapter needs to explain the PLE side of things.

(Sarah Seager) #6

Echoching @krfeeney - I do think that if the book is to enable a brand new shiny user to run a PLE package, then perhaps there does need to be a bit more detailed explanation of how the code is generated.

I know a certain level of understanding is required but I think step-by-step guides are super useful in this instance.


Very nice. I’d suggest from a quick initial read to make explicit in what populations the effect is being estimated under various approaches described. For example, in the new user cohort method, the estimate generated from a matching approach is standardized to the population receiving the target drug. If a weighting approach is used, usually the estimate via IPTW is standardized to the “total” combined population , although SMR could be used to standardize to the target only etc.

I have a few other more detailed, nit picky comments that I’ll attempt to place in the issue tracker.

(Christian Reich) #8


I haven’t had the chance to engage much with this project, even though it is incredibly useful. But a 14-hour flight to Hong Kong with no Wifi distractions allows me to get things done.

But this is beautiful.

Still, for it to read consistently it is essential that we standardize the way we talk about things. We may need some editorial cheat sheet. In fact, we may want to create that online and in community fashion.

  • Naming of standard things:

    • Event vs. domain vs. clinical entity
    • Introduction of things (CDM for Common Data Model) and subsequent use. If we use hyperlinks heavily we can get away with the introductions only once. Otherwise, I’d say each section introduces all abbreviations.
    • Claims, EHR. Remember, the audience is international.
    • ID or Id or id of identifier
    • Columns or fields
  • Spelling of standard terms and other editing choices

    • Capitalization. All Domains should be capitalized (Drug, Condition, the word Concept, Standard Concept, Source Concept, etc.)
    • Reference to CDM tables and fields. In the CDM documentation, we used ALL_CAPITAL for tables, and all_lower_case for fields. We may turn the latter into italic_field_names. But I wouldn’t also use capitalization, to help differentiating between the two.
    • Reference to records in a table. For example, records in the DRUG_EXPOSURE table should be called Drug Exposures.
    • Naming of vocabularies. ICD-9, ICD-9-CM, ICD9CM, Read, etc.
    • Quoting things or using full nouns (the “Results” schema) with single or double quotes.
    • Healthcare (instead of health care)
    • Use-cases or use cases
  • References. We should cross-reference heavily heavily heavily. The book should be one ocean of hyperlinks, so folks can read any chapter at any time and not depend on having read others before. Concept IDs should be referenced in Athena.

  • Numbering of things. If we intend to use this as a html or PDF document, rather than a printed text, we don’t need to number tables and figures. They can be referenced through hyperlinks a lot better. It also saves us from constantly fixing the numbers if things get added or removed.

  • Way to address the reader and us. So, the reader should be either “you”, “one”, “we”, or not be called at all by using passive voice. Right now, it is all over the place.

I am sending revisions of the CDM, Vocabulary and PLE sections to the authors, adding a bunch of comments for consideration and fixing typos.

(Martijn Schuemie) #9

Thanks @Christian_Reich! This is very helpful. I’ll start making these changes.

I think the intention is to have something that works as a paper book, and add features that may help online / PDF versions. For example, concept IDs should just be numbers (not visible URLs), but in HTML / PDF they should be hyperlinks to ATHENA. Similarly, we need the figure and table numbering, which in HTML / PDF are also hyperlinks. The numbering is done fully automatically, so I wouldn’t worry about that.

I’ve tried to re-introduce abbreviations in every chapter, but I’m sure we haven’t been completely consistent with that.

(Martijn Schuemie) #10

Great news everyone! Another chapter has been drafted.

Please take a look at the Data Analytics Use Cases chapter by @David_Madigan. This is a shorter chapter, as it is primarily an introduction to the Data Analytics section of the book.

(To @Christian_Reich’s point: I think it’s “use case”, not “use-case”, based on a quick Google search)

(Martijn Schuemie) #11

Next chapter draft ready for review: SQL and R by @schuemie and @Rijnbeek!

(Martijn Schuemie) #12

Another chapter ready for review!: Method Validity

(Martijn Schuemie) #13

Two more chapters now have a full first draft: Common Data Model and Patient-Level Prediction. Both have also already been reviewed, and are currently being revised based on those reviews, so it is perhaps better to wait a bit before taking a look.

Having trouble keeping track of the state of the various chapters? I’ve created this Google spreadsheet to help.

(Martijn Schuemie) #14

Two more chapters ready for review: Where to begin and Software Validity.

(Ellen Palmer) #15

Sorry for the delay everyone - I have a rough draft of the characterization chapter here: https://drive.google.com/file/d/12Z5sQw7pN-GPfi1brLlNVfg1aRPvwoge/view?usp=sharing

I am taking a summer course that was more intense than expected, so the chapter is not as polished as I would have liked. I also didn’t have time to get it into R markdown - I’ll have time in 3 weeks to do so, or I can schedule a phone/skype meeting around lunchtime EST with someone more familiar with R markdown to get it formatted anytime this week or next.

Of note, sections 11.3.2 and 11.4 still need a lot of work. I found that I couldn’t actual make my own feature when I went back to finish up 11.4 today, so I copied/pasted what was in the notes from our meetup. I also wasn’t completely sure of the level of detail needed on preset features in 11.3.2, so that section does need extra scrutiny.

Finally, I do have a couple of citations in the chapter that I will get into the book’s citation page by the end of this work week.

(Clark C. Evans) #16


Here are my random thoughts about the Common Data Model chapter. I’d love it to have more detail with regard to date and datetime treatment. I think it’s important to specify not just what the fields should contain, but to clarify, how various algorithms treat these fields.

  • The observation_period says that the end_date is the latest encounter on record, but it doesn’t expressly tell me that the end_date is inclusive (is it?). How should condition or drug events limited to the observation period be pruned, must they have an end_date <= the observation period? These are essentially the same information, one specified normative form, the other specified operationally.

  • For the visit_occurrence table, visit_end_date seems to also be inclusive, saying that it “should match”. How are these date intervals treated by cohort definitions relative to periods and conditions, etc.

  • For condition_start_date this is defined to be the condition recording date, which seems to be quite different from HL7 FHIR’s “onset” date. The cohort logic seem to treat the start_date as an onset, not the date of record. Further, this definition, one could easily have an start_date that is much later than the end_date. Is this intended or expected? Yet, the condition_end_date is when the condition is considered to have ended, which seems compatible with FHIR’s “abatement” date.

  • For drug_exposure the start/end date seems to be interpreted differently, exclusive rather than inclusive (or am I reading this wrong?). This seems to be touched on in a drug supply thread from 2015. Switching from inclusive treatment to exclusive interval endpoint is something that one could easily get wrong if one isn’t careful.

Generally, since the purpose of this system is to produce cohorts and higher-level analysis, really, the meaning of these fields is exactly how they are treated by the algorithms (which upon scanning the SQL. What would be most helpful is an exact description of how these columns affect the analysis performed. Or perhaps this belongs somewhere else?

There are also some missing details:

  • What is the relationship between date and datetime and how is a transition expected. It says that midnight is to be used for the start_datetime when the exact time is unknown; but there is no corresponding statement for end_datetime. In some ways, end_datetime can handle inclusive/exclusive differentiation, but a recommendation should be here… given that most treatment is specified to be inclusive, perhaps it should be defaulted to23:59.99 when the exact ending time is unknown?

  • How should a missing end_date be treated? Is it treated as unknown data? The generated SQL code usually treats a NULL value as being equivalent to the start_date, or, in some cases for conditions, and without explanation, the start_date+1 (which seems to be a confusion of inclusive/exclusive interpretation).

Thanks for listening to random thoughts.

Making sense of date offsets in generated SQL