Thanks for clarifying, Christian.
I think we will need the ability to store Genomics sequence and functional data. Let me try to clarify this: (there are many open-ended questions for which we may not have an answer yet; but, I guess, this will provide you a direction to think and advance) -
Purpose: (a) Perform complex query and analyses, (b) browse the data, and © possibly provide some visualization.
Data models in this area can be broadly classified into two categories - broad and deep.
A broad database model stores essentially a single kind of data, but stores such data from many organisms. Examples of broad databases include Swiss-Prot (for protein sequences), PRINTS (for protein fingerprints) and WIT (for pathway data).
A deep database model focuses on one or a small number of species, but stores many different kinds of data, generally including both sequence and functional data. Examples of deep databases are MIPS, SGD and YPD.
The requirement that I foresee is for a deep database model, in that it stores many different kinds of data from a single organism (say, human being).
The work you are doing now to extend the CDM should not specifically seek to subsume the functionality of existing databases, many of which do a good job at storing, organising and disseminating biological data. Instead, the emphasis should be on the close association of analyses with the stored data, so that users interact with the database principally in terms of the analysis tasks they want to carry
out, and not so much in terms of the stored data.
For example, when analysing gene expression data, it may be useful to have access to the sequences
upstream of the genes, or to the cellular location of their protein products.
Some analyses that I will like to see supported in CDM are:
-
Rather than analysing Transcriptome data in a spreadsheet and asking which genes are up or downregulated, it should be possible to ask which mRNAs encoding membrane-associated proteins are up-regulated.
-
Relating Gene Expression to Gene Structure.
-
Relating Gene Expression to Cellular Location.
-
Relating Gene Expression to Chromosome Position.
-
Relating Regulatory Sequences to Protein-Protein Interactions.
Hope this helps.
Also, looking forward to get a solution for “cohort_definition” and “attribute_definition” types of data.
P.S: Do you have a timeline in mind to incorporate sequence and functional information data into the model? If so, would you be kind enough to let me know?
Many thanks,
Sekhar H.