OHDSI Home | Forums | Wiki | Github


On today’s call I discussed that an updated CDM SynPUF is now available.

What is SynPUF?

How was SynPUF made by CMS?

  • “The CMS Linkable 2008–2010 Medicare DE-SynPUF originated from a disjoint (mutually exclusive from existing samples) 5% random sample of beneficiaries from the 100% Beneficiary Summary File for 2008. To exclude any overlap with the beneficiaries in the existing 5% CMS research sample, 3 the beneficiaries in that other sample were excluded, and a 5-in-95 random draw was made with the remaining 95% of beneficiaries. A variety of statistical disclosure limitation techniques were used to protect the confidentiality of Medicare beneficiaries in the CMS Linkable 2008–2010 Medicare DE-SynPUF. The DE-SynPUF was created by starting with an actual beneficiary as a “seed” for a synthetic beneficiary. Synthetic beneficiaries and their claims are based on actual seed beneficiaries. Disclosure is reduced through multiple deterministically or stochastically applied treatment mechanisms. First, hot decking based procedures are used to find donors for beneficiary-level variables and individual claims. Second, other synthetic processes are used to protect other elements of the data. Disclosure limitation methods used in the process include variable reduction, suppression, substitution, synthesis, date perturbation, and coarsening. Please refer to the CMS Linkable 2008–2010 Medicare Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) User Manual for details regarding how DE-SynPUF was created” (link)

How do I get access to the SynPUF CDM?


Hi Erica,

My name Song, a researcher working for EvidNet in Korea.

I’ve downloaded the SynPUF CDM from the link mentioned above but noticed that all concept names for CPT4 vocabulary have a “” value in the concept.csv file. The following query returned 0 rows:

select *
from cdm_synpuf.concept
where vocabulary_id = 'CPT4'
  and concept_name <> '""';

Is this intentional? Is there any way I can replace in these “” values with proper concept names?

Thanks in advance.


Hello Song,

The vocabularies are downloaded from Athena. SynPUF only contains clinical data and does not come with its own vocabulary files.

To include CPT4 in the vocabulary, please follow the instructions that come with the vocabulary download. Depending on your operating system, it requires executing the cpt.bat or cpt.sh file.

Hi Maxim,

Thank you for feedback.

However, the SynPUF CDM I downloaded contains all the .csv files related to vocabularies, i.e. concept, concept_ancestor, concept_class, concept_relationship, etc. This is also mentioned in the README.md.


Are you saying that I have to download the vocabularies from ATHENA regardless of the already provided vocabulary-related .csv files? If so, I’m quite confused because I don’t see the point in providing the vocabulary tables in the first place.

Best regards,


Well, there’s a trick with CPT4. We can’t distribute it directly due to licence issues.
Therefore, we provide you with a utility that downloads the descriptions separately and merges them together.
And this utility is distributed via Athena only.
So, you need to download something from Athena anyway, to get this unitilty.

it’s “cpt4.jar” file, actually.

Ah, sorry for the added confusion. If you need CPT4, then download through Athena indeed.

And last time I checked, it comes with the bash/batch file to execute the cpt4.jar for you. :wink:

Yeah that is why you are seeing that.

On a side note, in recent months we have been preferring Synthea to SynPUF. Another synthetic data set to explore. :slight_smile:

@Dymshyts, @MaximMoinat, @ericaVoss

I checked ATHENA and saw that there is an EULA license required for CPT4. I totally get it now. Thank you all for clarifying this out!

We’ll definitely check out Synthea as well!

Best regards,


Recently tried and updated the ETL-CMS python_etl and I am getting the following error:

– all files present…

build_maps starting…
No existing location_dict_file found (looked for ->/Users/sudoshi/Downloads/Data/CMS_Control/location_dictionary.txt)
No existing provider_id_care_site_file found (looked for ->/Users/sudoshi/Downloads/Data/CMS_Control/provider_id_care_site.txt)
No existing npi_provider_id_file found (looked for ->/Users/sudoshi/Downloads/Data/CMS_Control/npi_provider_id.txt)
Reading omop_concept_relationship_file → /Users/sudoshi/vocabulary_download_v5_latest/CONCEPT_RELATIONSHIP.csv
Writing to log file → /Users/sudoshi/Downloads/Data/concept_relationship_debug_log.txt
Traceback (most recent call last):
File “/Users/sudoshi/Documents/GitHub/ETL-CMS/python_etl/CMS_SynPuf_ETL_CDM_v5.py”, line 2036, in
File “/Users/sudoshi/Documents/GitHub/ETL-CMS/python_etl/CMS_SynPuf_ETL_CDM_v5.py”, line 390, in build_maps
with open(omop_concept_relationship_file,‘r’) as fin,
FileNotFoundError: [Errno 2] No such file or directory: ‘/Users/sudoshi/vocabulary_download_v5_latest/CONCEPT_RELATIONSHIP.csv’

Although everything IS in place. Any assistance would be appreciated from anyone who has recently used this tool?

I think the issue is that when I used futurize to convert the script from python2 to python3, there are some remaining discrepancies. The solution might be to use a virtual environment, install python2 and dependencies and run the original scripts. I will try that and report back.

Using python2 on MacOS with the requirements installed is working so far. I should have the full 2.33M by tomorrow.