InterMine 2.0: Proposed Model Changes (II)

We have several new additions and changes to the InterMine core data model coming in InterMine 2.0 (due Fall 2017).

We had a great discussion on Thursday about the proposed changes. Below are the decisions we made.

Multiple Genome Versions

Many InterMine instances have several different genome versions.

Proposed addition to the InterMine core data model

  <class name="Organism" is-interface="true">
    <attribute name="annotationVersion" type="java.lang.String"/>
    <attribute name="assemblyVersion" type="java.lang.String"/>
  </class>

Multiple Varieties / Subspecies / Strains

We were going to add variety to the Organism data type to indicate subtypes that have the same taxon ID, however some people expressed a concern that this term wasn’t generic enough.

Proposed addition to the InterMine core data model

  <class name="Organism" is-interface="true">
    <attribute name="variety" type="java.lang.String"/>
  </class>

Other suggestions:

  1. Strain
  2. Subspecies
  3. Stock
  4. Line
  5. Accession
  6. Subtype
  7. Ecotype
  8. Isolate
  9. Others? …

It was suggested that we take a vote to choose the name. Please note that you can overwrite attribute names locally. But it would be better if we could all (mostly) agree!

User Interface

Both the above changes will require updates to the core InterMine code where it is assumed that Organism.taxonID is the unique field. This assumption will be replaced so that the new fields in Organism, where present, are used for the primary key.

For user friendliness, it will be necessary to assign unique organism names. Users will then be able to easily identify distinct versions in template queries and widgets.

Syntenic Regions

Proposed addition to the InterMine core data model

<class name="SyntenicRegion" extends="SequenceFeature" is-interface="true">
    <reference name="syntenyBlock" referenced-type="SyntenyBlock" reverse-reference="syntenicRegions"/>
  </class>
  
  <class name="SyntenyBlock" is-interface="true">    
   <collection name="syntenicRegions" referenced-type="SyntenicRegion" reverse-reference="syntenyBlock" />
   <reference name="dataSet" referenced-type="DataSet" />
   <reference name="publication" referenced-type="Publication" />
  </class>
  • We decided against making a SyntenyBlock a bio-entity, even though it would benefit from inheriting some references.
  • We also decided against the SyntenicRegion1 / SyntenicRegion1 format and instead they will be in a collection of regions.

GO Evidence Codes

Currently the GO evidence codes are only a controlled vocabulary and are limited to the code abreviation, e.g IEA. However UniProt and other data sources have started to use ECO ontology terms to represent the GO evidence codes instead.

We decided against changing the GO Evidence Code to be an ECO ontology term.

  • The ECO ontology is not comprehensive
  • Some mines have a specific data model for evidence terms

Instead we are going to add attributes to the GO Evidence Code:

  • Add full description of the GO Evidence Code
  • Add a link to more information on the GO evidence codes
  • (Optional) add a link to the ECO term, if available.
<class name="GOEvidenceCode" is-interface="true">
 <attribute name="code" type="java.lang.String" />
 <attribute name="description" type="java.lang.String" />
 <attribute name="URL" type="java.lang.String" />
</class>

IEA evidence code example

Ontology Annotations – Subject

Currently you can only reference BioEntities, e.g. Proteins and Genes, in an annotation. This is unsuitable as any object in InterMine can be annotated, e.g. Protein Domains. To solve this problem, we will add a new data type, Annotatable.

<class name="Annotatable" is-interface="true"> <collection name="ontologyAnnotations" referenced-type="OntologyAnnotation" reverse-reference="subject"/> </class> <class name="OntologyAnnotation" is-interface="true"> <reference name="subject" referenced-type="BioObject" reverse-reference="ontologyAnnotations"/> </class> <class name="BioEntity" is-interface="true" extends="Annotatable"/>

This will add complexity to the data model but this would be hidden from casual users with templates.


If you would like to be involved in these discussions, please do join our community calls or add your comments to the GitHub tickets. We want to hear from you!

InterMine 2.0: PROPOSED Model Changes

We have several new additions and changes to the InterMine core data model coming in InterMine 2.0 (due Fall 2017).

You can follow the detailed conversation for each change on GitHub. Please note, these are only the proposals and will be discussed further on community calls. Join the conversation!

Multiple Genome Versions

Many InterMine instances have several different genome versions.

Proposed addition to the InterMine core data model

  <class name="Organism" is-interface="true">
    <attribute name="annotationVersion" type="java.lang.String"/>
    <attribute name="assemblyVersion" type="java.lang.String"/>
  </class>

Multiple Varieties / Subspecies / Strains

We’re going to add variety to the Organism data type to indicate two strains that have the same taxon ID.

Proposed addition to the InterMine core data model

  <class name="Organism" is-interface="true">
    <attribute name="variety" type="java.lang.String"/>
  </class>

User Interface

Both the above changes will require updates to the core InterMine code where it is assumed that Organism.taxonID is the unique field. This assumption will be replaced so that the new fields in Organism, where present, are used for the primary key.

For user friendliness, it will be necessary to assign unique organism names. Users will then be able to easily identify distinct versions in template queries and widgets.

Syntenic Regions

Proposed addition to the InterMine core data model

  <class name="SyntenicRegion" extends="SequenceFeature" is-interface="true">
    <reference name="partner" referenced-type="SyntenicRegion" reverse-reference="partner" />    
    <reference name="syntenyBlock" referenced-type="SyntenyBlock"/>
  </class>
  
  <class name="SyntenyBlock" is-interface="true">
    <attribute name="medianKs" type="java.lang.Double"/>    
    <collection name="syntenicRegions" referenced-type="SyntenicRegion"/>
  </class>

GO Evidence Codes

Currently the GO evidence codes are only a controlled vocabulary and are limited to the code abreviation, e.g IEA. However UniProt and other data sources have started to use ECO ontology terms to represent the GO evidence codes instead.

Current model

<class name="GOEvidence" is-interface="true">
 <reference name="code" referenced-type="GOEvidenceCode"/>
</class>

Proposed change to the InterMine core data model

<class name="GOEvidence" is-interface="true">
 <reference name="code" referenced-type="ECOTerm"/>
</class>

The ECO term would have the GO evidence code abbreviation along with the full description.

IEA evidence code example

Not many GO annotation data sets use ECO terms (yet) but InterMine will implement a lookup-service to replace the traditional GO evidence codes with the corresponding ECO term during data loading.


If you would like to be involved in these discussions, please do join our community calls or add your comments to the GitHub tickets. We want to hear from you!

California Dreaming: InterMine Dev Conf 2017 Report – Day 1

2017’s developer conference has been and gone; time to pay my dues in a blog post or two.

Day 0: Welcome dinner, 29 March 2017

The Cambridge InterMine arrived at Walnut Creek without a hitch, and after a jetlagged attempt at a night’s sleep we sat down to a mega-grant-writing session in the hotel lobby, fuelled by several pots of coffee and plates of nachos.

By 7PM, people had begun to gather in the lobby to head to the inaugural conference dinner at the delicious Walnut Creek Yacht Club. We had to change the venue quite late on in the game, meaning we decided to wander down the street to collect some of the InterMiners who had ended up at the original venue (sorry!!). By the end of the meal, most of the UK contingent was dead on their feet – 10pm California time worked out to be 6am according to our body clocks, so when Joe offered to give several of us a lift back to the hotel, it was impossible to decline.

20170329_221945

Day 1: Workshop Intro

The day started with intros from our PI, Gos, and our host, David Goodstein. 

Josh and I followed up by introducing BlueGenes, the UI we’ve been working on to replace InterMine’s older JSP-based UI. You can view Josh’s slide deck , try out a live demoor browse / check out the source on GitHub.

Next came one of my favourite parts: short talks from InterMiners.

Short community talks

Doppelgangers – Joel Richardson, MGI

Joel gave a great presentation about Doppelgangers in InterMine – that is, occasionally, depending on your data sets and config, you can end up with duplicate or strange / incomplete InterMine objects in your mine. He follows up with explanations of the root causes and mitigation methods – a great resource for any InterMiner who is working in data source integration! 

Genetic data in Mines – Sam Hokin, NCGR/LegFed

Next up was Sam’s talk about his various beany mines, including CowpeaMine, which has only genetics data, rather than the more typical InterMine genomic data. He’s also implemented several custom data visualisations on gene report pages – check out the slides or mines for more details.

JBrowse and Inter-mine communication – Vivek Krishnakumar, JCVI

Vivek focused on some great cross-InterMine collaborations (slides here), including the technical challenges integrating JBrowse into InterMine, as well as a method to link to other InterMines using synteny rather than InterMine’s typical homology approach.

InterMine at JGI – Joe Carlson, Phytozome, JGI

Joe has the privilege to run the biggest InterMine, covering (currently) 72 data sets on 69 organisms. Compared to most InterMines, this is massive! Unsurprisingly, this scale comes with a few hitches many of the other mines don’t encounter. Joe’s slides give a great overview of the problems you might encounter in a large-scale InterMine and their solutions.

Afternoon sessions

FAIR and the semantic web – Daniela & Justin

After a yummy lunch at a nearby cafe, Justin introduced the concept of FAIR, and discussed InterMine’s plans for a FAIRer future (slides). Discussion topics included:

  • How to make stable URIs (InterMine object IDs are transient and will change between builds)
  • Enhanced embedded metadata in webpages and query results (data provenance, licencing)
  • Better Findablility (the F in FAIR) by registering InterMine resources with external registries
  • RDF generation / SPARQL querying

This was followed up by Daniela’s introduction to RDF and SPARQL, which provided a great basic intro to the two concepts in an easily-understood manner. I really loved these slides, and I reckon they’d be a good introduction for anyone interested in learning more about what RDF and SPARQL are, whether or not you’re interested in InterMine .

Extending the InterMine Core Data Model – Sergio

Sergio ran the final session, “Extending the InterMine Core Data Model“. Shared models allow for easier cross-InterMine queries, as demoed in the GO tool prototype:

This discussion raised several interesting talking points:

  • Should model extensions be created via community RFC?
  • If so, who is involved? Developers, community members, curators, other?
  • Homologue or homolog? Who knew a simple “ue” could cause incompatibility problems? Most InterMine use the “ue” variation, with the exception of PhytoMine. An answer to this problem was presented in the “friendly mine” section of Vivek’s talk earlier in the day.

Another great output was Siddartha Basu’s gist on setting up InterMine – outlining some pain points and noting the good bits.

Most of us met up for dinner afterwards at Kevin’s Noodle House – highly recommended for meat eaters, less so for veggies.

A flurry of deadlines: Grants, GSoC, workshops, and more…

We blogged in February commenting that we had a lot of events over the March / April period. Here’s a re-cap:

  • Attending conferences: Amongst the team we attended Bioschemas, the Elixir all-hands, and the Cambridge Scientific Computation Day.
  • InterMine training: We delivered a training workshop about using InterMine at the EBI, part of their Introduction to Omics data integration week-long course.
    • This went well despite a server-room meltdown which conveniently timed itself for the morning of the same day (the training session was in the afternoon, so we thankfully had time to get the servers back up!).
    • In contrast to previous years, every single hand went up when we asked if the participants wrote code as part of their job. Next time, we will try to allow for a longer session on using InterMine web services, rather than the 15 minute slot we allocated this time!
  • Developer Workshop and Hackathon: 5 days in sunny California, spending time with InterMiners from around the world. Longer blog posts to follow, but in the meantime you can browse the agenda for links to slides from each session, or the storify summary of tweets.
  • Google Summer of Code: We’re participating in Google Summer of Code (GSoC) this year (previously) as a mentoring organisation. We had over 50 interested students and 30 distinct applications, many of which were simply brilliant. The deadline for students applying, naturally, was the day after the hackathon, making finding time to provide student feedback a challenge. Maybe there’s a reason to be grateful for jet-lag induced wakefulness at odd hours!
  • Grants: A tale of two grants… :
    • New application: We had a grant application deadline that was, once again, the day after the hackathon. Uh-oh! Feverish figure fixes, tentative typo tweaks and word-count winnowing was squeezed in at every opportunity.
    • Good news about an old application: Meanwhile, we got the news that we’d been fortunate enough to have our hard work pay off: a grant we’d applied for last year as part of the BBSRC BBR 2016 call was agreed to! Hint: the future of InterMine is looking very FAIR, possibly even SPARQLing. More details in a later post.

Events coming up soon:

Goal in sight: better uptime

TL;DR: We’re planning to host our CDN and an InterMine registry on AWS in the future. For more specifics, read on…

Over the course of 2015, we had several note-worthy service outages.

First of all there was a downpour in Cambridge that caused massive flooding, which took out the university’s network connection. The server room actually had to be pumped out. Yikes.

Next, the machine room had a broken chiller – this was so severe that the servers began to heat up almost immediately and we had to go into shutdown straight away whilst repair people worked on it. Services came back up the next morning, but only just in time for our training workshop.

Then came the big nasty DDOS attack against Janet, a higher education network which spans the UK. From the university, we could access some websites, but many more were down. Crucially, no one could reach our servers from the outside, even though this time they were running.

Apart from taking down HumanMine, FlyMine, and our websites, this also took down our CDN, meaning that anyone who hosted their own InterMine but relied upon us for their CDN was also affected by the outages.

Amazon Web Services to the rescue

We’re always working our hardest to make sure things run as smoothly as possible, but sometimes things are beyond our control. This number of outages is obviously less than ideal, and we’d like to do better in 2016.

One solution is to set up your own InterMine CDN, but could we do something to help you on our end, something that doesn’t require extra work for others? After a bit of thinking, we realised we probably could.

So as part of a plan to offer a more reliable service, our CDN will be migrated to live on AWS in the next few months. AWS has a pretty darned good uptime track record. What’s more, Amazon serves a massive chunk of the web – so if AWS ever is down, the likelihood is that much of the web will be down as well, and the outage won’t be surprising to any potential users.

An InterMine registry

Another feature that’s been in planning works for a while is an InterMine registry. Like the CDN, it’s pretty important that a dynamic discoverable registry have near-perfect uptime. So hopefully you can expect to see the hardcoded mine lists disappearing as we start to take advantage of a hosting service with high availability.

Watch this space for the official announcement and release dates for the CDN upgrade and registry release!

A final note about outages

While email was okay when Janet was playing up, the machine room breakdowns took out our mail server, and hence the mailing list – so the “Hey guys, is your server down?” queries didn’t reach us until after everything had recovered.

If you’re aware of a problem, the best place to find out more will be our Twitter account, @intermineorg. We’ll always make an effort to update our twitter feed – by mobile network if necessary – with details of outages, whether planned or emergency.

Featured image courtesy of Torkild Retvedt via Flickr. Licence CC-BY-SA 2.0

BiVi 2015

It’s the first year someone from InterMine has attended BiVi (Biology Visualisation), which is in its second year of a (currently) three-year plan to create a community around biology visualisation.

The atmosphere was pleasantly thriving, with probably 20 or 30 attendees from Cambridge, Edinburgh, London, Oxford, and even Institut Curie in France. I think it was safe to say that most of the attendees were people with a computer science background who worked in biology-related fields, although that wasn’t a strict rule.

Two themes were particularly popular this year:

  1. 3d molecule visualisation. Multiple different talks / groups of BiVians had exciting developments to show us, from Hapitimol, a visualisation with haptic feedback, to EZMol, with an easy-to-use wizard to produce visualisations, Foldsynth, a physics based engine from Goldsmiths, MARender, a Javascript-driven biomedical imaging library, and BioBlox, which gamifies protein docking to allow crowd-sourced research.
  2. Usability. Biology is a rich, complex field. Computer people making tools for biologists need to keeps things easy to use. EZMol, mentioned above, was created due to a notable lack of simple usable 3d visualisation tools. Our upcoming InterMine 2.0 release is another push towards creating a better user experience.

One of the most impressive tools demonstrated was Zegami, a tool to annotate, view, and filter images. It may have started as a biological tool, but it’s equally functional for your holiday snaps or for sorting and filtering Netflix movies. It’s a shame we don’t really have any image data in InterMine (well, in Fly or HumanMine at least) given how cool it looked.

zegami.png
Example from zegami.com: South Australian Museum invertebrates.

A few other tools / demos of note included:

  • Reactome pathway browser, which is fully embeddable into your own web app.
  • Jalview, ‘a free program for multiple sequence alignment editing, visualisation and analysis’. The main Jalview dev is also one of the organisers of the VizBi biology vis conference, next taking place in Germany.
  • Aequatus-vis uses Ensembl web services to visualise homologous gene families.

On day 2, I gave a short talk on the future of InterMine, focusing on why we want to revamp our UI, and what we think we’re doing better in InterMine 2.0. The slides are available on Google Drive (better format) or Slideshare.

 

 

New Blog!

We’ve decided to streamline our blogging experience a little bit. Rather than maintaining several separate but mostly similar blogs for HumanMine, FlyMine, and InterMine, this blog will act as a combined stream.

Don’t worry – this doesn’t mean you’ll be forced to view irrelevant updates if you’re only interested in one of the sub categories. WordPress is great about filtering via tag or category. Here are a few quick links:

InterMine-specific updates: https://intermineorg.wordpress.com/category/intermine/

FlyMine-specific updates: https://intermineorg.wordpress.com/category/flymine/

HumanMine-specific updates: https://intermineorg.wordpress.com/category/humanmine/ (It’s empty now, but coming soon!)