We have several new additions and changes to the InterMine core data model coming in InterMine 2.0 (due early 2018).
We had a great discussion on Thursday about the proposed changes. Below are the decisions we made.
Multiple Genome Versions, Varieties / Subspecies / Strains
We were not able to come to an agreement, but everyone still felt there might be a core data model that can allow for single and multiple genomes that will be useful for all InterMines.
The fundamental question is do we want organism to be per genome version, or allow organism to have several genome versions. In the latter case, we’d also need a helper class, e.g. “Strain”, that would contain information about the genome.
This topic is sufficiently complex that we’ve agreed to try a more formal process here, listing our different options, their potential impact etc. More information on this process soon!
Proposed addition to the InterMine core data model
<class name="SyntenicRegion" extends="SequenceFeature" is-interface="true"> <reference name="syntenyBlock" referenced-type="SyntenyBlock" reverse-reference="syntenicRegions"/> </class> <class name="SyntenyBlock" is-interface="true"> <collection name="syntenicRegions" referenced-type="SyntenicRegion" reverse-reference="syntenyBlock" /> <collection name="dataSets" referenced-type="DataSet" /> <collection name="publications" referenced-type="Publication" /> </class>
- We decided against making a SyntenyBlock a bio-entity, even though it would benefit from inheriting some references.
- We also decided against the SyntenicRegion1 / SyntenicRegion1 format and instead they will be in a collection of regions.
GO Evidence Codes
Currently the GO evidence codes are only a controlled vocabulary and are limited to the code abreviation, e.g
IEA. However UniProt and other data sources have started to use ECO ontology terms to represent the GO evidence codes instead.
We decided against changing the GO Evidence Code to be an ECO ontology term.
- The ECO ontology is not comprehensive
- Some mines have a specific data model for evidence terms
Instead we are going to add attributes to the GO Evidence Code:
- Add a link to more information on the GO evidence codes
- Add the full name of the evidence code.
- Change GOEvidenceCode to be OntologyAnnotationEvidenceCode
We decided against loading a full description of the evidence code. The description on the GO site is a full page. We tried shortening but then it didn’t really add much information. Also there is no text file with the description available.
We are also going to move evidence to Ontology Annotation.
GOEvidenceCode will be renamed OntologyAnnotationEvidenceCode:
<class name="OntologyAnnotationEvidenceCode" is-interface="true"> <attribute name="code" type="java.lang.String" /> <attribute name="name" type="java.lang.String" /> <attribute name="URL" type="java.lang.String" /> </class>
GOEvidence will be renamed OntologyEvidence:
<class name="OntologyEvidence" is-interface="true"> <reference name="code" referenced-type="OntologyAnnotationEvidenceCode"/> <collection name="publications" referenced-type="Publication"/> </class>
Evidence will move to OntologyAnnotation from GOAnnotation:
<class name="OntologyAnnotation" is-interface="true"> <collection name="evidence" referenced-type="OntologyEvidence"/> </class>
Ontology Annotations – Subject
Currently you can only reference BioEntities, e.g. Proteins and Genes, in an annotation. This is unsuitable as any object in InterMine can be annotated, e.g. Protein Domains. To solve this problem, we will add a new data type,
Annotatable" is-interface="true"> <attribute name="primaryIdentifier" type="java.lang.String"/> <collection name="ontologyAnnotations" referenced-type="OntologyAnnotation" reverse-reference="subject"/> <collection name="publications" referenced-type="Publication" reverse-reference="bioEntities"/> </class> <class name="OntologyAnnotation" is-interface="true"> <reference name="subject" referenced-type="Annotatable" reverse-reference="ontologyAnnotations"/> </class> <class name="BioEntity" is-interface="true" extends="
This will add complexity to the data model but this would be hidden from casual users with templates.
Also publications will be moved from
Annotatable. This change will allow us to have both publication and annotations on things that are not
primaryIdentifier will also be moved from
BioEntity to `Annotatable`.
Protein molecular weight
Protein.molecularWeight is going to be changed from an integer to a float.
Organism taxon ID
Organism.taxonId is going to be changed from an integer to a string.
Sequence Ontology update
The sequence ontology underpins the core InterMine data model. We will update to the latest version available.
If you would like to be involved in these discussions, please do join our community calls or add your comments to the GitHub tickets. We want to hear from you!