We have to be very clear about what validation means here. To me, your validation is useful to assessing how complete the LAERTES evidence base is wrt drug-HOI associations that have been established by some reasonable consensus process. What you are finding it seems is that it is relatively complete relative to the OMOP reference set (i.e., high recall) but has a great deal of noisy associations (i.e., low precision) Since the evidence base is nothing more than a collection of evidence, I don’t think we can do much more than this right now. However, that it is still very useful. With the baseline you are establishing, we can quantitatively show the effect of using different strategies to deal with issues during the ETL process.
On the other hand, Erica has worked on machine learning to predict drug-HOI associations using the evidence sources and statistics as features. This is like generating a knowledge base and therefore requires a different approach to validation. One that I think is more what you originally hoped to do.