Who is working with UK Biobank?

Patrick_Ryan · May 8, 2019, 3:56pm

I have heard in various conversations that researchers have been working with UK Biobank and did/planned to/want to convert their data instance to the OMOP CDM. Since this is a valuable data resource that can be used by many organizations, it seems a natural opportunity for community collaboration around a common ETL for this data. Does anyone have anything they could share to get this ball rolling?

ellayoung · May 8, 2019, 4:04pm

This is very exciting to hear ! We are mapping from Cerner Millenium to OMOP, so our ability to help depends on the format/nature of the source data coming from the biobank… do they use standard vocabularies ? keep us posted !

gregk · May 8, 2019, 5:15pm

@Patrick_Ryan we do not currently work with UK Biobank data but definitely aware of this resource and would be interested in joining this effort and can contribute resources for both ETL and vocab mappings.

kyriakosschwarz · June 24, 2019, 10:46am

It would also be interesting for us.

nick · September 24, 2019, 12:10am

Hi @all – Just picking this back up to see if anyone has worked with getting the UKBB into the CDM? HMU!

mcantor2 · October 10, 2019, 7:59pm

We are also interested in getting UKB into the CDM- we are doing a lot of work with it at Regeneron.

Maybe a side meeting at AMIA (or the Columbia reception) to figure out how to get this moving?

spiros · January 8, 2020, 12:34pm

Hey @all, we are actively working on converting the UK BIobank to OMOP. We’ve done quite a lot of work on the EHR component (hospital and primary care) as it’s similar to another dataset we’ve converted to OMOP (called CALIBER) but there’s still loads to do. Happy to coordinate around this and maybe have a meeting/discussion over the next month or so?

SCYou · January 8, 2020, 1:42pm

@spiros I love to join

mcantor2 · January 8, 2020, 2:49pm

Hi Spiros and all-
We are working within the UKB Pharma consortium to map the UKB to OMOP. We are in the process of getting the group of interested companies together and will most likely start the work (with a vendor) within the next month or two. We have been talking w/the UKB leadership and they are supportive of the project. Would be great to hear how far along you all are since we may be able to limit the scope of our project and get it done faster.

gregk · January 8, 2020, 3:08pm

Hi Spiros / Michael,

Thanks for bring this up - we are also converting BioBank UK to OMOP, I would say in early stages at this point. Would be great to connect

SCYou · January 8, 2020, 3:38pm

So great!!
@gregk

Is there anything we can contribue to? this is one of my reasons worrking for genomic CDM.

spiros · January 13, 2020, 1:53pm

Dear all, this is great, thanks for the enthusiastic responses.

Can I suggest, as a first step, we get together on a call to discuss who is doing what and try to coordinate? We are doing this work as part of a larger IMI project called BigData@Heart and working with an SME in the Netherlands (the Hyve) plus some inhouse developers at University College London.

Could interested parties please send me an email (s.denaxas at ucl.ac.uk) and I will organize a call to discuss next steps.

linikujp · January 13, 2020, 2:38pm

I am interested in joining. Already sent an email to @spiros
Thanks.

Vojtech_Huser · January 13, 2020, 5:04pm

We are also working with it. (have a pending project application (see below)

The list of their Data Elements is public here: : Browse by Category

For those who are further along, I would be curious to know how the files are organized. In what language you are developing (planing to develop) your ETL? (we may use just R (mostly tidyverse) and skip SQL if possible)

nlw · January 13, 2020, 6:52pm

I would be interested in joining the working group as well. I have been working with UKB data for the last year and a half and have access to 1000’s data elements, but haven’t yet transformed to OMOP. Would like to coordinate and/or use this as a great test case for developing a new ETL. (We’re a python house, so that’s probably the language I’d prefer.) @spiros

spiros · January 16, 2020, 6:15pm

Files are in CSV - in theory usable in both R and Python, in practice a bit of a pain as baseline data are in “wide format” i.e. base table is 500000x9000 or so - challenging to load in Pandas, even if you specify dtypes manually, more luck using Dask but still SQL is much much faster and more intuitive.

spiros · January 16, 2020, 6:16pm

Hey Nicole, thanks, please drop me an email so I can add your address to the mailing list.

spiros · February 3, 2020, 9:38am

Dear all, I’ve created a Doodle to help us find a suitable date/time to have an initial discussion - could you please have a look and mark your availability accordingly ?

https://doodle.com/poll/q8zwh45m73z37xzi

thanks
Spiros

p.s. apologies to EU friends, I’ve set timeslots late in the PM to enable US colleagues to join.

Vojtech_Huser · March 13, 2020, 5:35pm

Some notes from the first meeting UKBB WG meeting:

Spiros, please announce when is our next meeting.

Vojtech_Huser · April 2, 2020, 10:03pm

We continue to explore the CDEs in UKBB.

@MaximMoinat - we would like to join forces with your team to work on them.

At the link, we created an overview using the R package referenced in the past.

When is the next meeting?
You had a google drive folder with nice outputs as well. Can you please get in touch with me - so that we don’t duplicate effort.

github.com

lhncbc/CDE/blob/master/ukbiobank/ukbb_dd-de-expanded.csv

DE_id,DE_description,Data_type,Data_type_description,Group_id,Group_id_description,PV,PV_description,PV_count
3,Verbal interview duration,Integer,"whole numbers, for example the age of a participant on a particular date",152,Process durations,NA,NA,NA
4,Biometrics duration,Integer,"whole numbers, for example the age of a participant on a particular date",152,Process durations,NA,NA,NA
5,Sample collection duration,Integer,"whole numbers, for example the age of a participant on a particular date",152,Process durations,NA,NA,NA
6,Conclusion duration,Integer,"whole numbers, for example the age of a participant on a particular date",152,Process durations,NA,NA,NA
19,Heel ultrasound method,Categorical (single),"a single answer selected from a coded list or tree of mutually exclusive options, for example a yes/no choice",100018,Bone-densitometry of heel,1|2|3|6|7,Direct entry|Manual entry|Not performed|Not performed - equipment failure|Not performed - other reason,5
21,Weight method,Categorical (single),"a single answer selected from a coded list or tree of mutually exclusive options, for example a yes/no choice",100010,Body size measures,-1|1|2|3|4,Question not asked due to previous answers|Direct entry|Manual entry of full results|Manual measurement of weight only|Not performed,5
23,Spirometry method,Categorical (single),"a single answer selected from a coded list or tree of mutually exclusive options, for example a yes/no choice",100020,Spirometry,0|1|6|7|9,Direct entry|Manual|Not performed - equipment failure|Not performed - other reason|Cannot be measured,5
31,Sex,Categorical (single),"a single answer selected from a coded list or tree of mutually exclusive options, for example a yes/no choice",100094,Baseline characteristics,0|1,Female|Male,2
33,Date of birth,Date,"a calendar date, for example 14th October 2010",100094,Baseline characteristics,NA,NA,NA
34,Year of birth,Integer,"whole numbers, for example the age of a participant on a particular date",100094,Baseline characteristics,NA,NA,NA
35,Was blood sampling attempted,Categorical (single),"a single answer selected from a coded list or tree of mutually exclusive options, for example a yes/no choice",100002,Blood sample collection,0|1,No|Yes,2
36,Blood pressure device ID,Text,"data composed of alphanumeric characters, for example the first line of an address",100011,Blood pressure,NA,NA,NA
37,Blood pressure manual sphygmomanometer device ID,Text,"data composed of alphanumeric characters, for example the first line of an address",100011,Blood pressure,NA,NA,NA
38,Hand grip dynamometer device ID,Text,"data composed of alphanumeric characters, for example the first line of an address",100019,Hand grip strength,NA,NA,NA
39,Height measure device ID,Text,"data composed of alphanumeric characters, for example the first line of an address",100010,Body size measures,NA,NA,NA
40,Manual scales device ID,Text,"data composed of alphanumeric characters, for example the first line of an address",100010,Body size measures,NA,NA,NA
41,Seating box device ID,Text,"data composed of alphanumeric characters, for example the first line of an address",100010,Body size measures,NA,NA,NA
42,Spirometer device ID,Text,"data composed of alphanumeric characters, for example the first line of an address",100020,Spirometry,NA,NA,NA
43,Impedance device ID,Text,"data composed of alphanumeric characters, for example the first line of an address",100009,Impedance measures,NA,NA,NA

This file has been truncated. show original