InterMine 2.0: Proposed Model Changes (III)

We have several new additions and changes to the InterMine core data model coming in InterMine 2.0 (due early 2018).

We had a great discussion on Thursday about the proposed changes. Below are the decisions we made.

Multiple Genome Versions, Varieties / Subspecies / Strains

 

We were not able to come to an agreement, but everyone still felt there might be a core data model that can allow for single and multiple genomes that will be useful for all InterMines.

The fundamental question is do we want organism to be per genome version, or allow organism to have several genome versions. In the latter case, we’d also need a helper class, e.g. “Strain”, that would contain information about the genome.

This topic is sufficiently complex that we’ve agreed to try a more formal process here, listing our different options, their potential impact etc. More information on this process soon!

Syntenic Regions

Proposed addition to the InterMine core data model

<class name="SyntenicRegion" extends="SequenceFeature" is-interface="true">
    <reference name="syntenyBlock" referenced-type="SyntenyBlock" reverse-reference="syntenicRegions"/>
  </class>
  
  <class name="SyntenyBlock" is-interface="true">    
   <collection name="syntenicRegions" referenced-type="SyntenicRegion" reverse-reference="syntenyBlock" />
   <collection name="dataSets" referenced-type="DataSet" />
   <collection name="publications" referenced-type="Publication" />
  </class>
  • We decided against making a SyntenyBlock a bio-entity, even though it would benefit from inheriting some references.
  • We also decided against the SyntenicRegion1 / SyntenicRegion1 format and instead they will be in a collection of regions.

GO Evidence Codes

Currently the GO evidence codes are only a controlled vocabulary and are limited to the code abreviation, e.g IEA. However UniProt and other data sources have started to use ECO ontology terms to represent the GO evidence codes instead.

We decided against changing the GO Evidence Code to be an ECO ontology term.

  • The ECO ontology is not comprehensive
  • Some mines have a specific data model for evidence terms

Instead we are going to add attributes to the GO Evidence Code:

  • Add a link to more information on the GO evidence codes
  • Add the full name of the evidence code.
  • Change GOEvidenceCode to be OntologyAnnotationEvidenceCode

We decided against loading a full description of the evidence code. The description on the GO site is a full page. We tried shortening but then it didn’t really add much information. Also there is no text file with the description available.

We are also going to move evidence to Ontology Annotation.

GOEvidenceCode will be renamed OntologyAnnotationEvidenceCode:

<class name="OntologyAnnotationEvidenceCode" is-interface="true">
 <attribute name="code" type="java.lang.String" />
 <attribute name="name" type="java.lang.String" />
 <attribute name="URL" type="java.lang.String" />
</class>

GOEvidence will be renamed OntologyEvidence:

<class name="OntologyEvidence" is-interface="true">
 <reference name="code" referenced-type="OntologyAnnotationEvidenceCode"/>
 <collection name="publications" referenced-type="Publication"/>
</class>

Evidence will move to OntologyAnnotation from GOAnnotation:

<class name="OntologyAnnotation" is-interface="true">
 <collection name="evidence" referenced-type="OntologyEvidence"/>
</class>

 

Ontology Annotations – Subject

Currently you can only reference BioEntities, e.g. Proteins and Genes, in an annotation. This is unsuitable as any object in InterMine can be annotated, e.g. Protein Domains. To solve this problem, we will add a new data type, Annotatable.

<class name="Annotatable" is-interface="true"> <attribute name="primaryIdentifier" type="java.lang.String"/> <collection name="ontologyAnnotations" referenced-type="OntologyAnnotation" reverse-reference="subject"/> <collection name="publications" referenced-type="Publication" reverse-reference="bioEntities"/> </class> <class name="OntologyAnnotation" is-interface="true"> <reference name="subject" referenced-type="Annotatable" reverse-reference="ontologyAnnotations"/> </class> <class name="BioEntity" is-interface="true" extends="Annotatable"/>

This will add complexity to the data model but this would be hidden from casual users with templates.

Also publications will be moved from BioEntity to Annotatable. This change will allow us to have both publication and annotations on things that are not BioEntities. primaryIdentifier will also be moved from BioEntity to `​Annotatable`.

Protein molecular weight

Protein.molecularWeight is going to be changed from an integer to a float.

Organism taxon ID

Organism.taxonId is going to be changed from an integer to a string.

Sequence Ontology update

The sequence ontology underpins the core InterMine data model. We will update to the latest version available.


If you would like to be involved in these discussions, please do join our community calls or add your comments to the GitHub tickets. We want to hear from you!

InterMine 2.0: Proposed Model Changes (II)

We have several new additions and changes to the InterMine core data model coming in InterMine 2.0 (due Fall 2017).

We had a great discussion on Thursday about the proposed changes. Below are the decisions we made.

Multiple Genome Versions

Many InterMine instances have several different genome versions.

Proposed addition to the InterMine core data model

  <class name="Organism" is-interface="true">
    <attribute name="annotationVersion" type="java.lang.String"/>
    <attribute name="assemblyVersion" type="java.lang.String"/>
  </class>

Multiple Varieties / Subspecies / Strains

We were going to add variety to the Organism data type to indicate subtypes that have the same taxon ID, however some people expressed a concern that this term wasn’t generic enough.

Proposed addition to the InterMine core data model

  <class name="Organism" is-interface="true">
    <attribute name="variety" type="java.lang.String"/>
  </class>

Other suggestions:

  1. Strain
  2. Subspecies
  3. Stock
  4. Line
  5. Accession
  6. Subtype
  7. Ecotype
  8. Isolate
  9. Others? …

It was suggested that we take a vote to choose the name. Please note that you can overwrite attribute names locally. But it would be better if we could all (mostly) agree!

User Interface

Both the above changes will require updates to the core InterMine code where it is assumed that Organism.taxonID is the unique field. This assumption will be replaced so that the new fields in Organism, where present, are used for the primary key.

For user friendliness, it will be necessary to assign unique organism names. Users will then be able to easily identify distinct versions in template queries and widgets.

Syntenic Regions

Proposed addition to the InterMine core data model

<class name="SyntenicRegion" extends="SequenceFeature" is-interface="true">
    <reference name="syntenyBlock" referenced-type="SyntenyBlock" reverse-reference="syntenicRegions"/>
  </class>
  
  <class name="SyntenyBlock" is-interface="true">    
   <collection name="syntenicRegions" referenced-type="SyntenicRegion" reverse-reference="syntenyBlock" />
   <reference name="dataSet" referenced-type="DataSet" />
   <reference name="publication" referenced-type="Publication" />
  </class>
  • We decided against making a SyntenyBlock a bio-entity, even though it would benefit from inheriting some references.
  • We also decided against the SyntenicRegion1 / SyntenicRegion1 format and instead they will be in a collection of regions.

GO Evidence Codes

Currently the GO evidence codes are only a controlled vocabulary and are limited to the code abreviation, e.g IEA. However UniProt and other data sources have started to use ECO ontology terms to represent the GO evidence codes instead.

We decided against changing the GO Evidence Code to be an ECO ontology term.

  • The ECO ontology is not comprehensive
  • Some mines have a specific data model for evidence terms

Instead we are going to add attributes to the GO Evidence Code:

  • Add full description of the GO Evidence Code
  • Add a link to more information on the GO evidence codes
  • (Optional) add a link to the ECO term, if available.
<class name="GOEvidenceCode" is-interface="true">
 <attribute name="code" type="java.lang.String" />
 <attribute name="description" type="java.lang.String" />
 <attribute name="URL" type="java.lang.String" />
</class>

IEA evidence code example

Ontology Annotations – Subject

Currently you can only reference BioEntities, e.g. Proteins and Genes, in an annotation. This is unsuitable as any object in InterMine can be annotated, e.g. Protein Domains. To solve this problem, we will add a new data type, Annotatable.

<class name="Annotatable" is-interface="true"> <collection name="ontologyAnnotations" referenced-type="OntologyAnnotation" reverse-reference="subject"/> </class> <class name="OntologyAnnotation" is-interface="true"> <reference name="subject" referenced-type="BioObject" reverse-reference="ontologyAnnotations"/> </class> <class name="BioEntity" is-interface="true" extends="Annotatable"/>

This will add complexity to the data model but this would be hidden from casual users with templates.


If you would like to be involved in these discussions, please do join our community calls or add your comments to the GitHub tickets. We want to hear from you!

InterMine 2.0: PROPOSED Model Changes

We have several new additions and changes to the InterMine core data model coming in InterMine 2.0 (due Fall 2017).

You can follow the detailed conversation for each change on GitHub. Please note, these are only the proposals and will be discussed further on community calls. Join the conversation!

Multiple Genome Versions

Many InterMine instances have several different genome versions.

Proposed addition to the InterMine core data model

  <class name="Organism" is-interface="true">
    <attribute name="annotationVersion" type="java.lang.String"/>
    <attribute name="assemblyVersion" type="java.lang.String"/>
  </class>

Multiple Varieties / Subspecies / Strains

We’re going to add variety to the Organism data type to indicate two strains that have the same taxon ID.

Proposed addition to the InterMine core data model

  <class name="Organism" is-interface="true">
    <attribute name="variety" type="java.lang.String"/>
  </class>

User Interface

Both the above changes will require updates to the core InterMine code where it is assumed that Organism.taxonID is the unique field. This assumption will be replaced so that the new fields in Organism, where present, are used for the primary key.

For user friendliness, it will be necessary to assign unique organism names. Users will then be able to easily identify distinct versions in template queries and widgets.

Syntenic Regions

Proposed addition to the InterMine core data model

  <class name="SyntenicRegion" extends="SequenceFeature" is-interface="true">
    <reference name="partner" referenced-type="SyntenicRegion" reverse-reference="partner" />    
    <reference name="syntenyBlock" referenced-type="SyntenyBlock"/>
  </class>
  
  <class name="SyntenyBlock" is-interface="true">
    <attribute name="medianKs" type="java.lang.Double"/>    
    <collection name="syntenicRegions" referenced-type="SyntenicRegion"/>
  </class>

GO Evidence Codes

Currently the GO evidence codes are only a controlled vocabulary and are limited to the code abreviation, e.g IEA. However UniProt and other data sources have started to use ECO ontology terms to represent the GO evidence codes instead.

Current model

<class name="GOEvidence" is-interface="true">
 <reference name="code" referenced-type="GOEvidenceCode"/>
</class>

Proposed change to the InterMine core data model

<class name="GOEvidence" is-interface="true">
 <reference name="code" referenced-type="ECOTerm"/>
</class>

The ECO term would have the GO evidence code abbreviation along with the full description.

IEA evidence code example

Not many GO annotation data sets use ECO terms (yet) but InterMine will implement a lookup-service to replace the traditional GO evidence codes with the corresponding ECO term during data loading.


If you would like to be involved in these discussions, please do join our community calls or add your comments to the GitHub tickets. We want to hear from you!