InterMine 2.0: PROPOSED Model Changes

We have several new additions and changes to the InterMine core data model coming in InterMine 2.0 (due Fall 2017).

You can follow the detailed conversation for each change on GitHub. Please note, these are only the proposals and will be discussed further on community calls. Join the conversation!

Multiple Genome Versions

Many InterMine instances have several different genome versions.

Proposed addition to the InterMine core data model

  <class name="Organism" is-interface="true">
    <attribute name="annotationVersion" type="java.lang.String"/>
    <attribute name="assemblyVersion" type="java.lang.String"/>
  </class>

Multiple Varieties / Subspecies / Strains

We’re going to add variety to the Organism data type to indicate two strains that have the same taxon ID.

Proposed addition to the InterMine core data model

  <class name="Organism" is-interface="true">
    <attribute name="variety" type="java.lang.String"/>
  </class>

User Interface

Both the above changes will require updates to the core InterMine code where it is assumed that Organism.taxonID is the unique field. This assumption will be replaced so that the new fields in Organism, where present, are used for the primary key.

For user friendliness, it will be necessary to assign unique organism names. Users will then be able to easily identify distinct versions in template queries and widgets.

Syntenic Regions

Proposed addition to the InterMine core data model

  <class name="SyntenicRegion" extends="SequenceFeature" is-interface="true">
    <reference name="partner" referenced-type="SyntenicRegion" reverse-reference="partner" />    
    <reference name="syntenyBlock" referenced-type="SyntenyBlock"/>
  </class>
  
  <class name="SyntenyBlock" is-interface="true">
    <attribute name="medianKs" type="java.lang.Double"/>    
    <collection name="syntenicRegions" referenced-type="SyntenicRegion"/>
  </class>

GO Evidence Codes

Currently the GO evidence codes are only a controlled vocabulary and are limited to the code abreviation, e.g IEA. However UniProt and other data sources have started to use ECO ontology terms to represent the GO evidence codes instead.

Current model

<class name="GOEvidence" is-interface="true">
 <reference name="code" referenced-type="GOEvidenceCode"/>
</class>

Proposed change to the InterMine core data model

<class name="GOEvidence" is-interface="true">
 <reference name="code" referenced-type="ECOTerm"/>
</class>

The ECO term would have the GO evidence code abbreviation along with the full description.

IEA evidence code example

Not many GO annotation data sets use ECO terms (yet) but InterMine will implement a lookup-service to replace the traditional GO evidence codes with the corresponding ECO term during data loading.


If you would like to be involved in these discussions, please do join our community calls or add your comments to the GitHub tickets. We want to hear from you!

InterMine community roundup: June 2017

Here are some of the exciting things that have been happening in the InterMine community recently:

Thanks to everyone who has contributed including students and their mentors. You guys are awesome!

excited Kermit via GIPHY

Have you done anything exciting with InterMine lately? email info [at] intermine [dot] org, tweet us at @intermineorg, or pop into chat.intermine.org to tell us about it… we’d love to feature you in a future round-up!

InterMine’s Python Client: Now with tutorials!

We’re excited to announce that our Python client is getting a new suite of tutorials / cookbook “recipes” to ease you into coding with InterMine.

The tutorials are in Jupyter notebook format (.ipynb), and you can preview or check out the tutorials on GitHub: https://github.com/intermine/intermine-ws-python-docs.

Right now tutorial 1 and tutorial 2 are online, and we’ll be adding more over the summer, with a target of around twelve tutorials. If you run through any of the tutorials and have feedback, we’d love to hear from you – info at intermine dot org, tweet us your thoughts, or open a ticket.

These tutorials are brought to you by Samarth, a fantastic community volunteer. Thanks Samarth!!

InterMine-Python tutorials
Screenshot of the first tutorial

Google Summer of Code: Coding period starts!

As of the 30th of May, the community bonding period is over and official coding starts for GSoC. The first evaluation period is between June 26 to June 30 (full timeline).

Preparing for the evaluation

We don’t have full details of the evaluation questions yet, but the Student Manual and Mentor Manual provide a decent overview – it’s likely to be a few short questions ensuring work and communication are occurring and are on-track.

Students: What you need to do:

Follow your workplan and communicate regularly with your mentor!  Evidence of work can include emails regarding progress, demos if possible, and GitHub commits / PRs. Read the Student Manual entry on evaluations. Remember you’ll need to complete an evaluation on your mentor, too.

Mentors: What you’ll need to do:

Make sure you’re communicating with your student regularly and you’re confident about their progress. If you are on vacation during the evaluation period (or immediately before), make clear plans now, and make sure your student knows what will be happening and who their backup mentor/evaluator is for this time period.

Please also read the Mentor Manual on evaluations, and consider arranging a face-to-face feedback session, since your student can’t see your evaluation details beyond a pass/fail status.

 

 

GraphConnect – a Neo4j conference

neo4jconference

We were in London to attend GraphConnect, the annual conference organised by Neo4j.
It was fantastic to meet so many people around the world enthusiastic about graph databases, and a lot of people that, like us, are prototyping/exploring Neo4j as possible alternative to relational databases.

They have announced the release of Neo4j 3.2 which promises to bring a huge improvement in term of performance; the compiled Cypher runtime has improved the speed by ~300% for a subset of basic queries and the introduction of native label indexes has also improved the write speed.

They have also added the composite indexes (that InterMine uses a lot) and the use of indexes with the OR operator. We highlighted the problem months ago on stackoverflow and we were so surprised to see it fixed. We have to update our “What we didn’t like about Neo4j” list by removing 2 items. We’re really happy about that!

It was a pleasure to attend Jesus Barrasa’s talk on debunking some RDF versus property graph alternative facts. He demoed how a RDF resource does not necessarily have to live in a triple store but can also be stored in Neo4j. Here are part1 and part2 of “Neo4j is your RDF store”, a nice post where he describes his work in more detail.

Another nice tool they have implemented is the ETL tool to import data from a relational database into Neo4j by applying some simple rules.

The solution-based talks demonstrated how Neo4j is being used to solve complex, real world problems ranging from travel recommendation engines to measuring the impact of slot machine locations on casino floors. While the topics were diverse, a common theme across their respective architectures was the use of GraphAware’s plugins, some of which are free. One plugin that looks particularly interesting is the Neo4j2Elastic tool which transparently pushes data from Neo4j to ElasticSearch.

During the conference, we discovered that there is a Neo4j Startup Program that allows to have Neo4j enterprise edition for free. Not sure if we count as a start up though!

Overall, we’re super happy with the improvements Neo4j has made, and super impressed with Neo4j’s growing community. Looking forward to meeting with Neo4j team in London, at their meetup, and sharing our small experience with the community!

GSoCers Assemble! Announcing the InterMine GSOC 2017 students

Google Summer of Code is officially open as of 16:00 UTC today! This year InterMine will have five students coding over the summer, with five projects:

gsoc-icon-192

  • InterMineR will be getting better docs and hopefully submitted to R repos. Konstantinos Kyritsis will be working on this with the help of InterMine mentors Julie and Rachel.
  • Our Android App will get a younger sibling in the form of an iOS app, thanks to Nadia Yudina. I’ll be the primary mentor for this project.
  • We’ll finally have a proper registry of all the great InterMines out there, brought to you by Leonardo Kuffo with Daniela mentoring the project.
  • Samyadeep Basu will be looking at an ‘InterMine Similarity project’ – given a Gene (or other entity) from InterMine – are there any other interesting entities related to it in some way? Josh is the lead mentor on this project.
  • Yash Sharma will be working on creating Neo4j-InterMine API endpoints under Sam Hokin‘s mentorship.

We wish we could have accepted more of you. In total we had more than 40 students interested in GSoC 2017 with InterMine, resulting in around 30 finalised applications. Many of the applications were brilliant – far more than we could possibly have accepted. Deciding who to accept was really tough, and even if you didn’t get a place in GSoC with us you’re still entirely welcome to contribute to any of our projects if you had any ideas.

Suggestions for accepted students

Congratulations on being accepted. We’re really glad to have you on board. Please have a quick read through our GSoC guidelines to get started.

During the community bonding period, here are a few ideas for getting involved.

  • Find out more details that might pertain to your project (obviously) – investigate the API or work on bugs
  • Project management – in your project’s GitHub repo create milestones, tickets, project boards as appropriate.
  • Write an intro blog post about yourself & your planned work (to be posted here and/or a personal blog we could link to).
  • Come hang in the chat (below).

Non-GSoC InterMine community: you can play too!

We’ve created a couple of chat rooms at chat.intermine.org. We’ll be encouraging our GSoC students to hang out in the #general channel, and you’re welcome to, as well. The students are from all around the world – come make them feel at home!

The FAIR journey

backpack

It’s time to celebrate. After some hectic weeks preparing and organizing the InterMine conference we can now take a deep breath and and get ready for our new BBSRC funded project to make data more FAIR.

It will be a short rest, just the time to check that we have everything we need to start this long journey but also to savour the excitement before the departure towards FAIRness, a destination that will enhance InterMine compliance with FAIR principles; InterMine has been at the forefront of getting research data into the hands of scientists for over 10 years and we’re excited to support the formalisation of these principles.

As team, we recognized with no doubts, the need to implement the FAIR data principles, making biological data stored in InterMine instances more Findable, Accessible, Interoperable, and Reusable by both humans and machines,  as well as the huge impact that this achievement might have on the quality of biological data served by InterMine. Implementing some mechanisms that make data stored in InterMine FAIR, we provide a unique opportunity to make ALL data collection, served by nearly 30 public biological InterMine instances worldwide, FAIR.

This is a great chance and we didn’t want to miss it!

Here some important milestones we want to achieve along the journey:

  • Generate globally unique and stable URIs to identify InterMine data objects and register them in community bioinformatics repositories (for instance bio.tools and Identifiers.org) in order to provide more findable and accessible data.
  • Apply suitable ontologies to the core InterMine data model to make the semantic of InterMine data explicit and facilitate data exchange and interoperability
  • Embed metadata in InterMine web pages and add extra metadata to InterMine’s existing structured query results to make data more findable
  • Provide a RDF representation of data stored, lists and query results, and the bulk download of all InterMine in RDF form, in order to allow the users to import InterMine resources into their local triplestore
  • Provide an infrastructure for a SPARQL endpoint where the user can perform federated queries over multiple data sets
  • Improve accessibility of data licenses for integrated sources via web interface and REST web-service.

It will be an exciting challenge.  Follow us on this blog.