OHDSI Home | Forums | Wiki | Github

CDM v6 - BIGINT vs INT in SQL Server?

Dear Colleagues!

We are in an ETL (OMOP CDM 6.0) process. Our next step is the database creation. We are working on SQL Server.

There are 86 BIGINT in the sqlserver DDL scripts.
But I don’t find the reason why is it necessary except in a few case (eg.: https://github.com/OHDSI/CommonDataModel/issues/198).

As everyone knows BIGINT is two time bigger than INT, docs.

Can I modify these data types in our case?
I’m sure that in our use case the BIGINT for person_id is unnecessary we have around 1million person in our source system.

So my questions:
1.) Why BIGINT necessary except a few column?
2.) Can I modify these columns data types? - can it hurt me in the future if I do this? (eg. compatibility with other systems)

Thank you for your answers!


As always, testing is the only way to know the impact on changing data types; but in general, the speed and size of a BigInt as opposed to an Int, does not have a big impact.

Compatibility for running scripts and/or tools, is likely to eventually cause an issue with narrowing the data type.

You can use int in lieu of bigint throughout, but bigint was chosen due to the size of some of the community’s CDM databases, hundreds of millions of patients.

perhaps it is a legacy issue, i always use bigINT if it was some keyfield for data. memory is cheap nowadays anyway.

very often, i also use VARCHAR(200) instead of VARCHAR(50) for source field. and quite sometimes it is still not enough.

change it as you wish. it is not a legal requirement.

Thank you for Your answers and advices !

My choice is the INT except with measurement_id, this is the only table where we have the chance to earn more than 2 billion data row in the following 5 to 10 years.

@lychenus13, not the memory is the only thing what is affected when you use BIGINT instead of INT, check this answer: https://dba.stackexchange.com/a/258006/194101

Even in this ‘answer’, it is said that the practical effect may be dependent on the work environment; this is very true. Replication of a BigInt is still much faster than a VARCHAR or a NVARCHAR, yet how many of those do we have?
I have built some pretty massive datasets and the only time I have seen a big impact on the database, was when doing self joins. Recursive tables can make BigInt joins noticeable slower than Ints.

Really, is it worth the work to keep a separate schema for maybe very little( or no ) benefits? Perhaps first focus on tables that have a recursive structure and work out from there if you need the extra speed.