OHDSI Home | Forums | Wiki | Github

Proposal for new Smoking conventions - please weigh in


We have been discussing this for a while. Eventually, we started a version of smoking hierarchy. After reviewing all the existing concepts containing “smoke”, “nicotine”, “cigarette” etc. it turned out the variety of attributes and their combinations are not that large, really. For example, heavy smoker does not get applied to, say, pipe or hookah. See the result here.

We are proposing a new hierarchy making the following assumptions:

  1. We treat tobacco, smoking and nicotine dependence as synonyms. Even though there are electronic cigarettes, which really are not smoke, in reality these are used that way.
  2. All non-nicotine smoke (or electronic version of that) is out, such as smoking illicit or legal drugs and marijuana.
  3. Family history is out. For the patient’s history see below.
  4. Smoke from fires etc. is out.
  5. All sorts of toxic effects, sequelae of nicotine abuse and allergies are out. They are separate conditions, and we don’t know how the nicotine came into the body, and whether that was an acute event, rather than smoking.
  6. Lab tests for nicotine are out.
  7. Nicotine replacement therapy are due to nicotine abuse, but still not the same thing, and therefore out
  8. Water pipes, shishas and hookahs as identical.
  9. Second hand smoking and passive smoking is treated the same.
  10. Only cigarette frequency is measured in the existing Concepts, cigars, pipes and hookahs almost never are. The cigarette frequency per day definition is:
    • trivial=0-1 cigarettes per day
    • light=1-9
    • moderate=10-19
    • heavy=20-39
    • very heavy or aggressive=>40
      We don’t need the exact number of cigarettes (it is probably false precision anyway). Source concepts that ask for exact number per day or week need to be manually mapped to these five categories. If the source concept is mentioning one of the five frequencies without the type we automatically assume cigarettes.

This means we have the following dimensions:

  • Type: hookah pipe, cigarette, moist tobacco, chewed tobacco, smokeless, pipe, passive, electronic cigarette, cigar and snuff.
  • Amount as above: trivial, light, moderate, heavy, very heavy
  • Timing: ex (=in remission), in utero, perinatal

The resulting hierarchy is much simpler than expected:


One thing we need to discuss is should we create additional concepts where we combine all of these with the various timings. The big one is “ex”, meaning the smoking behavior is in remission but happened in the past. Alternatively, we relegate those to “History of”. The “in utero” and “perinatal” are usually only combined with “Smoking” and “Cigarette”. Again, unless the Observation Period covers this period of time it might be “History of”.

Here is what we are not modeling, mostly because we didn’t actually find that many concepts:

  • Time since cessation
  • Episodic smoking
  • Age of start of smoking
  • Duration of unsuccessful cessation
  • Overall time of smoking (irrespective of strength)
  • Negative concepts

The latter might cause some stress with folks (as usual flavors of null do).

Please think and discuss, and let us know. In particular, let us know if the omissions make sense.

@aostropolets + @Christian_Reich

1 Like

Very nice @aostropolets & @Christian_Reich!

From Colorado’s point of view, duration of use and/or start and stop dates are missing.Very heavy tobacco consumption for 1 year versus 50 years has very different physiological affects on the body.

But do you have codes you are using for that, @MPhilofsky ? We need to hang this on to something.

Yes, we use the standard concept_id = 3004518 for pack per day. And we have a custom concept_id for tobacco used in years. Both could be children of cigarette.

Don’t we want these types? It says they’re very distinctive. Is there any chance it would be recorded?

What about occasional/social smoking? You map it to “Smoker”, but I wouldn’t do. I know, they say it’s harmful. Isn’t such an assumption in the terminoly is a reason for that?
At least, a distinct category is needed to set the borderline between “1 cigarette per week/month/year” and “0-1 cigarette per day”.

It’s pretty the same as “Passive” is a “Smoker, where “Passive” means not just a fact, but also a risk of exposure (according to mapping). Do we want all this in the 'Smoker” cohort?

Another thing is the survey data/classification terms you map to the “Smoker” concepts. Would it be “Maps to”? You can’t just map the questions. How would ETL treat it? So unless we have a MAPPING table set, it’s also manual work.

40770347 Have you ever smoked regularly [PhenX]
45508052 [V]Tobacco use
4041306 Tobacco use and exposure
40766305 Have you ever smoked part or all of a cigarette [PhenX]
40766943 Do or did you inhale the cigar smoke [PhenX]

Some negative facts simply confirm that patient doesn’t smoke or mean nothing. So no need for any mapping. Just keep them alive, ok?

4196422 Not a passive smoker
45522772 Smoking review not indicated
40664614 Smoking status and exposure to second hand smoke in the home not assessed, reason not given
45508195 Parents do not smoke
45441534 Never smoked tobacco

Also, you map the contextual facts. Is there a real need? Is there any chance that “non-attractive appearance” would be registered, while the entire “smoking” not? I’m not sure about mapping to “Smoker”:

37021066 My smoking makes me less attractive to other people [PROMIS]
37021082 If I quit smoking I will be more attractive to others [PROMIS]
37020314 I crave cigarettes at certain times of day [PROMIS]
40766360 How soon after you wake up do, or did, you smoke your first cigarette [FTND]
36713256 Tobacco cessation education not done
37019831 The idea of not having any cigarettes causes me stress [PROMIS]
36208999 Evaluation and management of smoking cessation note | {Setting}
4263877 Smokers cough
40766306 Have you smoked at least 100 cigarettes in your entire life [PhenX]

If we use “History of…” here, you can’t even know about the remission. While we map all these “did or do you” / “did you” / “have you ever”, it’s gonna be altogether in the “History of…” group?

Don’t you want to add one more dimension here? @Christian_Reich

Can you not translate that to the categories trivial to very heavy?

Don’t think so. Never heard of the distinction, and doubt we have that level of detail in the data. Leave alone the lack of use cases.

Why? We need some simple categorical level of smoke exposure. Anything below 1 cigaratte per day is negligible compared to the other categories.

Well, if the data tell us there is exposure to smoke it’s smoking. But you are right, these are not very precise distinctions. It should be just good enough.

That’s a question for the survey discussion. I agree, you cannot map a question, only the question-answer combo.

Right now, negative facts are conspicuous of absence in the OMOP CDM, with the exception of the Measurements and Observations.

We don’t need that. If somebody says we do - give me the use case.

Well, that’s the question.

Yes, I thought so too. But then I didn’t find much in the way of codes like that. Only how heavy the smoking was in cigarettes per day. So, I dropped it.

1 Like

The idea of the new smoking convention is really great, but we have something to add.

As @MPhilofsky pointed out

To cover that, the way to measure the amount a person has smoked over a long period of time was introduced: the pack-years. 1 pack-year is equal to smoking 1 pack per day for 1 year, or 2 packs per day for half a year, and so on, while the pack is containing 20 cigarettes.

Cumulative smoking exposure and intensity of smoking (daily exposure) are both frequently measured [e.g. 1.), 2]. In studies, based on cumulative smoking exposure, smokers are typically stratified to light, moderate and heavy, based on pack-years. Using the exact same terms we use to stratify smokers based on their daily dose of tobacco. So we can spot the situation when a person meets the criteria of light smoker based on the daily dose (e.g. 5 cigarettes daily) and moderate/heavy smoker based on total tobacco consumption (30 years of 5 cigarettes daily) and vice versa (very heavy tobacco smoker for only a few weeks). And body effects for them would differ significantly.

My proposal is to add the second definition of light, moderate, and heavy tobacco smokers based on the pack-years they smoked. I think these newly added concepts should live under the current classification: Light smoker defined as 0,1 - 20 pack-years as a child of light smoker, etc.

And integrate to this hierarchy the standard concept to keep time since smoking cessation in CDM.
We have good SNOMED (however, it Is observation) for that: Time since stopped smoking

There are at least 2 reasons for that: 1) smoking affects health significantly, but with time effects of it on the organism become less and less plus 2) smoking questionaries usually have that question in one or another form, we can easily store this information without any ‘History of’ concepts.

Also, this project can be good start for wide mapping table

Even with my proposals, I still have some concerns about calculating the observation period. We have a patient, who started smoking at the age of 16, now he is 81, and he ended smoking last week. What should we do?

What do you think?


As I said. Packyears is an established measure in the smoking research, but I just couldn’t find any codes in any vocabulary. If it is not coded we know it won’t be in the data. Except you found it somewhere in hiding.

Smoking has a bunch of problems for us in that it is somewhat incompatible with the typical longitudinal healthcare data. Firstly, smoking is a lifetime thing, while our data typically last months to years. So, all this “history of” becomes heavily overloaded. Secondly, there are several different dimensions of measuring it: cigs per day, packyears, time since quitting, and that is just for cigarettes. At some point the amount of concepts becomes overwhelming. We need a simple system we can rely on. And thirdly, the data tend to be very crude anyway. We all know how people talk about their smoking habits. Not exactly witness testimonials under oath.

So, not sure what kind of precision we really should aspire to. My hunch would be to first consistently implement the current system, and then see.


We met these codes hiding in UKB:
35810304 Light smokers, at least 100 smokes in lifetime
35811050 Pack years of smoking
35811051 Pack years adult smoking as proportion of life span exposed to smoking
35810327 Age stopped smoking
37021508 Number of cigarettes smoked in Lifetime
3003421 Cigarettes smoked total (pack per year) - Reported
37020654 Age when stopped smoking cigarettes completely
in READ:
45478368 Cigarette pack-years

And also in some custom vocabularies we met during various ETLs.

I completely agree that history of concepts become heavily overloaded and that we need simple solution.

But aren’t we oversimplifying?
Any thoughts on wide mapping implementation on smoking?

Sounds good. We could make it an alternative set of children to “Cigarette smoking”. Currently, we have trivial, light, moderate, heavy and very heavy. Would these packyears-derived ones fall into these categories? Would make it simple, but we would have two different definitions. Probably not so clean. How about 5 categories of smoking (=now) and 3 (or also 5) categories of cumulative smoking? Or we just put in the packyears.

“Time since” in observational data is ugly. Usually, data are time stamped, and “time since” is calculated. Alternatively we have “history of”. @Alexdavv proposed something to address the timing. Should we consider that?

Dear all, let’s use this theme as a primary one for smoking data discussions.

There’s an update on this topic, and feedback is appreciated. Please have a look here