OHDSI Home | Forums | Wiki | Github

Convention for varchar cells: null vs empty string

Hello - what is the recommended convention for representing ‘no data’ in OMOP common data model cells.

Scenario for subseting concepts that are not invalid using invalid_reason field: The various options may be

where invalid_reason is null
or
where invalid_reason = ‘’
or
where length(trim(invalid_reason)) = 0

Please share your opinion or experience.

I prefer where invalid_reason IS NULL, but that is because I have the
ability to ensure that NULL rather than an empty string is used in the
database when there is no value. You might not always have that luxury, in
which case you might have to live with what the DBA has decided to use.
However, I think you will find that most of the code in OHDSI assumes that
‘WHERE invalid_reason IS NULL’ will work.

BTW, I do not think your last test, will work for all databases. I googled
the length of NULL and it seems Oracle length(NULL) returns NULL and NULL
<> 0. Might change to ‘WHERE length(trim(COALESCE(invalid_reason, ‘’
)))=0’, but have not tested this.

2 Likes

Agreed with Don. We’ve used DB NULLs in our platform, and that has been effective across SQL Server, Microsoft PDW, and Amazon Redshift.

Thank you. The problem with nulls is join on null is unknown. How do we handle that?

https://technet.microsoft.com/en-us/library/ms190409(v=sql.105).aspx

The invalid reason column isn’t a column that you would join anything to (it’s not a reference column in any other table).

If there are any columns in the CDM model that is intended to be joined to another table but is set nullable, then we should make that change to the spec.

-Chris

Also, the link you provided, @Gowtham_Rao, shows the solution to joining to a null column:

SELECT * FROM 
table1 t1 INNER JOIN table2 t2 
ON t1.a = t2.c OR (t1.a IS NULL AND t2.c IS NULL)

Not very efficient, but this does logically work.

-Chris

Is there?

Thank you. I started the thread using the example of invalid_reason. Agree with the group that making this field NULL does not have major impact. How about the other fields in CDM. Would it be a good convention to use empty-strings instead of NULL for these other fields like concept_id, revenue_code_source etc

I don’t have a specific reproducible analytic example yet, but there was an impact of the null on the concept fields during ETL. Specifically, i was looking at the results of my ETL, and there were less than expected rows of data. This was because of null values being ignored (null is not equal to null).

We were thinking that all other fields in the CDM with no values should be empty strings and not use NULLs - to avoid confusion to the analysts who may not know the difference between null and empty string. Empty string is more intuitive and also computationally efficient.

Thoughts?

I don’t think so, but I thought it safe to ask.

I’d disagree with this assessment: null is, by definition, the absence of value. An empty strings is a value of a string, in as much as it has a length, you can compare it to other values, etc. Having an invalid reason of “” doesn’t mean it doesn’t have an invalid reason, it means it does have an invalid reason, and the reason is {silence}. Conversely, a NULL value in invalid_reason means there is no invalid_reason.

On the note of joins, it makes sense that you wouldn’t want null columns to be joined together. The idea of a join is that you are matching a value from one table to a value of another. No value, no join. It also follows that LEFT JOINs that do not have a match return a NULL value and not an empty string.

Concept_Ids are ints, so can’t store a “” value. You might have noticed that we provide a CONCEPT_ID = 0 for ‘unknown concept’ but this value demonstrates the difference between ‘absence of value vs. the presence of a value’ beautifully: In the MEASUREMENT table, we might see a row with ‘value_as_concept’ which sometimes might be 0, other times might be null. 0 means that there was a value in the source system but couldn’t be mapped to a CDM concept. a null means that this measurement has no value for a value_as_concept. Measurements that have non-null value for value_as_concept might be a ‘pass’ or ‘fail’ value for a test, while a null valued row is the type of measurement where the existence of the record means something was identified (such as an influenza virus was identified).

I hope this helps!

-Chris

That does make sense - thank you.

t