Call recording available: GSoC 2019 Final Presentations

Our Google Summer of Code students presented their work at a special edition of the community call yesterday. You can catch up on the entire recording on YouTube – or scroll down to see individual presentations. The agenda and notes accompanying the call (including code and slides links) is in Google Docs.

Prabodh Kotasthane – Spring Migration

Prabodh’s presentations starts at 3:54: https://youtu.be/ZzV6JmVRQmA?t=234

Slides

Ankur Kumar – InterMine Cloud

Ank’s presentation starts at 13:12: https://youtu.be/ZzV6JmVRQmA?t=792

Laksh Singla – Upgrading imjs & im-tables

Laksh’s presentation starts at 21:08: https://youtu.be/ZzV6JmVRQmA?t=1268

Rahul Yadav – Single Sign-In

Rahul’s presentation starts at 27:39 https://youtu.be/ZzV6JmVRQmA?t=1659

Deepak Kumar – InterMine Schema Validator

Deepak’s presentation starts at 24:11 https://youtu.be/ZzV6JmVRQmA?t=2051

Akshat Bhargava – Data Visualisations

Akshat’s presentation starts at 41:30 https://youtu.be/ZzV6JmVRQmA?t=2490

InterMine 4.0 – InterMine as a FAIR framework

We are excited to publish the latest version of InterMine, version 4.0.

It’s a collection of our efforts to make InterMine more “FAIR“. As an open source data warehouse, InterMine’s raison d’être is to be a framework that enables people to quickly and easily provide public access to their data in a user friendly manner. Therefore InterMine has always strived to make data Findable, Accessible, Interoperable and Reusable and this push is meant to formally apply the FAIR principles to InterMine.

What’s included in this release?

  1. Generate globally unique and stable URLs to identify InterMine data objects in order to provide more findable and accessible data.
  2. Apply suitable ontologies to the core InterMine data model to make the semantic of InterMine data explicit and facilitate data exchange and interoperability
  3. Embed metadata in InterMine web pages to make data more findable
  4. Improve accessibility of data licenses for integrated sources via web interface and REST web-service.

More details below!

How to upgrade?

This is a non-disruptive release, but there are additions to the data model. Therefore, you’ll want to increment your version, then build a new database when upgrading. No other action is required.

However, keep reading for how to take advantages of the new FAIR features in this release.

Unique and stable URLs

We’ve added a beautiful new user-friendly URL.

Example: http://beta.flymine.org/beta/gene:FBgn0000606

Currently this is used only in the “share” button in the report pages and in the web pages markup. In the future, this will be the only URL seen in the browser location bar.

For details on how to configure your mine’s URLs, see the docs here.

See our previous blog posts on unique identifiers.

Decorating the InterMine data model with ontology terms

InterMine 4.0 introduces the ability to annotate your InterMine data model with ontology terms.

While these data are not used (yet), it’s an important feature in that it’s going to facilitate cross-InterMine querying, and eventually cross-database analysis — allowing us to answer questions like “Is the ‘gene’ in MouseMine the same ‘gene’ at the EBI?”.

For details on how to add ontologies to your InterMine data model, see the docs here.

Embedding metadata in InterMine webpages

We’ve added structured data to web pages in format of JSON-LD to make data more findable, and these data are indexed by Google data search. Bioschemas.org is extending Schema.org with life science-specific types, adding required properties and cardinality on the types. For more details see the docs here.

By default this feature is disabled. For details on how to enable embedding metadata in your webpages, see the docs here.

Data licences

In our ongoing effort to make the InterMine system more FAIR, we have started working on improving the accessibility of data licences, retaining licence information supplied by the data sources integrated in InterMine, and making it available to humans via our web application and machines via queries.

See our previous blog post on data licences.

For details on how to add data licences to your InterMine, see the docs.

Future FAIR plans

  1. Provide a RDF representation of data stored, lists and query results, and the bulk download of all InterMine in RDF form, in order to allow the users to import InterMine resources into their local triplestore
  2. Provide an infrastructure for a SPARQL endpoint where the user can perform federated queries over multiple data sets

Upcoming Releases

The next InterMine version will likely be ready in the Fall/Winter and include some user interface updates.

Docs

To update your mine with these new changes, see upgrade instructions. This is a non-disruptive release.

See release notes for detailed information.

InterMine 3.1.2 – patch release

We’ve released a small batch of bug fixes and small features. Thank you so much to our contributors: Sam Hokin, Arunan Sugunakumar and Joe Carlson!

Features

  • Templates can be tagged by any user, not just the super user. (Via webservice only – for now)

Fixes

  • When searching our docs, some times the “.html” extension was dropped. This was fixed by our beautiful documentation hosters – readthedocs.org
  • Installing the “bio” project via Gradle does not fail if you do not have the test properties file.
  • Gradle logs error fixed
  • Removed old GAF 1.0 code
  • Fixed XML library issue:  java.lang.ClassCastException for org.apache.xerces
  • Set converter.class correctly
  • Updated the protein atlas expression graph
  • Handle NULL values returned by NCBI web services
  • Updated Solr to support new Solr versions
  • Removed unneeded Gretty plugin
  • Better error handling for CHEBI web services
  • Publication abstract is longer than postgres index
  • Removed phenotype key, it’s not in the core model and has conflicting key
  • Updated ObjectStoreSummary to handle ignored fields consistently.

Upcoming Releases

InterMine 4.0 is scheduled for release the week of 7 May 2019.

Docs

To update your mine with these new changes, see upgrade instructions. This is a non-disruptive release.

See release notes for detailed information.

Data integration and Machine Learning for drug target validation

Hi!

In this blog post I would like to give a brief overview of what I’m currently working on.

Knowledge Transfer Partnership: what & why?

First, in order to give context to this post, last year InterMine at University of Cambridge and STORM Therapeutics, a spin-out of University of Cambridge working on small modulating RNA enzymes for the treatment of cancer, were awarded a Knowledge Transfer Partnership (KTP) from the UK Government (read this post for more information). With this award, the objective is to help STORM Therapeutics advance their efforts in cancer research, and contribute to their ultimate goal of drug target validation.

As part of the KTP Award, a KTP Associate needs to be appointed by both the knowledge base (University of Cambridge) and the company (STORM). The role of the KTP Associate is to act as the KTP Project Manager and is in charge of the successful delivery of the project. For this project, I was appointed as the KTP Associate, with a Research Software Engineer / Research Associate role at the University of Cambridge, for the total duration of the project: 3 years.

Machine learning and a new mine: StormMine

Now that you know what the KTP project is about, and who is delivering it, let’s move on to more interesting matters. In order to successfully delivering this project, the idea is to use the InterMine data warehouse to build a knowledge base for the company, STORM, that enables their scientist to have all the relevant data for their research in a single, integrated, place. For this reason, several new data sources will be integrated into a STORM’s deployment of the InterMine data warehouse (StormMine, from now on), and appropiate data visualizations will be added.

Then, once the data is integrated, we can think towards analysing the data to gather insights that may help the company goals, such as applying statistical and Machine Learning methods to gather information from the data, as well as building computational intelligence models. This leads the way towards what I’ve been working on since my start in February, and will continue until July 2019.

In general terms, I’m currently focused on building Machine Learning models that are able to learn how to differentiate between known drug targets and non-targets from available biological data. This part of work is going to be used as my Master’s Thesis, which I hopefully will deliver in July! Moreover, with this analysis, we will be able to answer three extremely relevant questions for STORM, and which are the questions leading the current work on the project. These questions are

  1. Which are the most promising target genes for a cancer type?
  2. Which features are most informative in predicting novel targets?
  3. Given a gene, for which cancer types is it most relevant?

If you are interested in learning more about this work, stay tuned for next posts, and don’t hesitate contacting me, either by email (ar989@cam.ac.uk) or connect with me in LinkedIn (click here)!

 

GSoC 2019 with InterMine is ON!

After the fabulous experience we’ve had with GSoC in 2017 and 2018, we’re delighted to announce that we’ll be mentoring again this year. It’s almost impossible to describe the breadth of experience, quality, and insight students bring us every year and we’re so excited to meet a whole new batch of students again in 2019.

Prospective student?

If you’re a student interested in working with us, your first port of call is our GSoC site. Most of our students hang out at chat.intermine.org too.

We have a Q&A webinar coming up on March 12, 2019 at 3PM UK time (when is it in your timezone?) where we’ll share tips for good applications, GSoC alumni from previous years will share their experiences, and we’ll briefly describe all of the project ideas and answer any questions. If you can’t make it, add your questions to the agenda before the call and we’ll answer them during the call anyway! Here’s the agenda and joining instructions.

Interested in mentoring?

Generally we expect mentors to come from our community – InterMine users, developers, or previous students. If you fit into one of those categories and want to help mentor, email yo@intermine.org. Not sure if you’d be a good fit? We’re still happy to discuss any ideas!

InterMine 3.1.1 – patch release

We’ve released a small batch of bug fixes and small features. Thank you so much to our contributors: Sam Hokin, Paulo Nuin and Joel Richardson!

Features

  • Added access to the GFF header in the GFF parser
  • GFF sequence handler has access to feature now
  • Added DOES NOT CONTAIN constraint
  • Added a few end points for BlueGenes

Fixes

  • InterPro source handles DTD correctly
  • Updated to new GitHub URL for Gretty Plugin
  • Fixed OMIM link outs
  • NCBI going to update their GFF files (at our request! Thanks Wayne!)
    • fix spelling on feature “DNaseI_hypersensitive_site” i
    • change “recombination_region” to “recombination_feature”
  • Updated external links on enrichment widget
  • Handle NULL search index correctly
  • Fix publication with NULL title
  • Fixed log library dependency conflict
  • Removed deprecated Yahoo login link
  • Fixed Panther source to handle proteins

Upcoming Releases

  • 3.1.2 – More small bug fixes
  • 4.0.0 – FAIR release

Docs

To update your mine with these new changes, see upgrade instructions. This is a non-disruptive release.

See release notes for detailed information.

We’re adopting a code of conduct!

TL;DR:

Please read our Code of Conduct draft and comment if you need to.

Longer:

A growing and important movement in open source communities is to adopt a code of conduct, which generally governs behaviour amongst community members, and provides backing to enforce necessary actions if anyone within the community behaves in an unacceptable or unwelcoming manner. We haven’t had any problems, and we’d like things to stay that way in the future.

If the past is anything to go by, we’ll set this code of conduct up and rarely or never need to enforce anything, but it’s better to have clear guidelines in place and not need them, than vice versa. We’d also like to get this in place before anything happens, rather than as an obvious too-late response to an incident – not that we’re anticipating anything!

The draft we’ve put together is adapted quite closely from the Django code of conduct. We’re particularly grateful to them for licencing it under a creative commons attribution licence so we could re-use it.

Read the InterMine Code of Conduct draft here.

Questions?

If you’d like more info about codes of conduct – why they’re important, what topics they cover, etc., please see:

Comments, or questions that weren’t answered by the links above?

Feel free to comment on this post, tweet us, email yo@intermine.org, or info@intermine.org. Please comment by the 19th of March 2019.

 

Header image from flickr, taken by Mike McSharry and licenced under CC-BY-2.0 https://www.flickr.com/photos/mikemcsharry/5360225083/

Persistent identifiers (URI) and navigable URLs in InterMine

Local unique identifiers (LUIs)

A LUI (Local Unique Identifier) is  an identifier guaranteed to be unique in a given local context (e.g. a single data collection). [Ref. https://doi.org/10.1371/journal.pbio.2001414]. InterMine’s existing local identifiers are based on an internal database ID; they are unique but they are not preserved across database releases. For example, the ID 1007854 which currently identifies the gene zen in FlyMine, it’s not persistent; after the next build the link http://www.flymine.org/flymine/report.do?id=1007854 will be not valid.

In InterMine, we have implemented new persistent local unique identifiers which are preserved across releases; they are based on the class types, defined in the InterMine core model, and the external IDs from the main data source provider integrated.

Some examples are:
protein:P31946 (protein identifier)
publication:8829651 (PubMed identifier)
gene:MGI:1924206 (gene identifier)

Persistent URIs

An URI (Uniform Resource Identifier) is an identifier which is unique on the web, and not only within the local context as the LUI is, and actionable, so if you copy it in the web address bar you are redirected to the source. URIs need to be persistent in order to to provide reliable sources, always findable and accessible.

Some examples of persistent URIs are:
http://purl.uniprot.org/uniprot/P05455 where P05455 is the LUI for UniProt
http://identifiers.org/biosample/SAMEA104559033 where SAMEA104559033 is the LUI for biosample.

Where are Permanent URIs going to be used by InterMine?
1. To markup the web pages for search engines with Bioschemas.org or Schema.org types: set the identifier attribute with the persistent URI in DataCatalog, DataSeta and BioChemEntity types.
2. To generate RDF: we need persistent URI to set the subject in the triples generated.

We need to generate persistent URIs only if we create new entities. If a mine instance DOES NOT create new entities, it needs to re-use the existing URIs provided by the main source provider.
In FlyMine, for example, the RDF generated for the protein P05455 integrated from UniProt, which is the main resource provider for that data type, should be:

<http://purl.uniprot.org/uniprot/P05455> rdf:type <http://semanticscience.org/resource/SIO_010043> .
<http://purl.uniprot.org/uniprot/P05455> rdfs:label “Protein P05455” .

But how to generate persistent URIs?

There are different options to generate persistent URIs, and the mine administrator will choose the option which is more suitable to the mine instance.

Option 1: Generate Persistent URIs using third party resolvers

In order to provide permanent URIs, we can configure the mine instance to use Identifiers.org as PURL (permanent URI) provider. These the steps to follow:

1. register the mine instance in Identifiers.org as data collection

Namespace/prefix LegumeMine
URI (assigned by Identifiers.org) http://identifiers.org/legumemine
Primary Resource https://mines.legumeinfo.org/legumemine

2. set, in the mine instance, the property identifier.uri.base with the URI assigned by Identifiers.org (e.g. http://identifiers.org/legumemine).

The URI, generated by LegumeMine, for the entity GeneticMarker with primary identifier 118M3 will be: http://identifiers.org/legumemine:geneticmarker:118M3. This is persistent, unique and actionable: if you paste it in the web browser address you will redirected to the navigable URL: https://mines.legumeinfo.org/legumemine/geneticmarker:118M3 by identifiers.org.

Option 2 – Generate Persistent URIs setting a redirection system

A mine administrator might prefer to implement an in-house redirection system (couple of lines in apache or nginx configuration files) setting a purl system similar to purl.uniprot.org.

In LegumeMine, for example, the permanent URI, for the entity GeneticMarker with identifier 118M3 might be: http://purl.legumemine.org/legumemine/geneticmarker:118M3.

Option 3 – Use navigable URLs

A mine administrator might decide to use the navigable URLs (see the next section) as permanent URIs.
An example is given by ZFIN where the URI http://zfin.org/ZDB-GENE-040718-423 coincides with the navigable URL.

Navigable persistent URLs

The navigable or access URLs are the URLs of the web pages where the users are redirected.
Examples of the new navigable URLs in Flymine:
http://flymine.org/flymine/protein:P31946
http://flymine.org/flymine/publication:781465
http://flymine.org/flymine/gene:MGI:1924206

The new navigable URLs will not change at every build! We can guarantee they will be persistent setting redirection system which resolves old URLs.

Navigable URLs usage

1. Permanent link button in the current report page
2. Permanent link button in BlueGenes, InterMine’s new user interface.
3. To markup the web pages with Bioschemas.org type: url attribute will be set with the navigable URL
4. To generate RDF: the field schema:mainEntityOfPage will be set with the persistent URL. For example:

<http://identifiers.org/MGI:97490> schema:mainEntityOfPage  <https://mousemine.org/gene:MGI:97490>

The new permanent URIs and URLs have been scheduled to be released in InterMine 4.0 release.

Being FAIR – data licences in InterMine

licenceM
Image Licence: CC BY-ND 2.0, via flickr https://www.flickr.com/photos/juditk/4499834152/. No changes made to image.

In our ongoing effort to make the InterMine system more FAIR, we have started working on improving the accessibility of data licences, retaining licence information supplied by the data sources integrated in InterMine, and making it available to humans via our web application and machines via queries.

Open data licences

If you want to make your software open you need to:

  1. publish your software in a public space
  2. apply a suitable free and open source licence

The absence of licence means that nobody can legally use, copy, reproduce or distribute your software.

Same for data. If you want to make you data open you need to:

  1. publish your data
  2. apply a suitable open data licence

Without a licence the users don’t know how to use and re-use your data! Choosing licence is not always easy but there are already some open licences designed exclusively for data developed by Open Data Commons (https://www.opendatacommons.org) and Creative Commons (https://creativecommons.org).

Data licences in InterMine

InterMine provides a library of data parsers for 26 popular data sets, e.g. NCBI, UniProt etc. We went through each of these core InterMine data sources and recorded the data licence for each. During this process we identified 3 cases:

p[ie pie chart showing that 34.7% of data sources had licences, 53.18 has some licence info, and 11.5% had no licencing info at all.

Case 1: Data source had a data licence (34.6%)

Example: http://creativecommons.org/licenses/by/4.0/

Perfect, ideally all data sets would have licenced data!

Case 2: Data source had some information about how data can be reused (53.8%)

Example: https://www.ncbi.nlm.nih.gov/home/about/policies/

Good to have information on how to reuse the data, but these URLs might change. Also, in some cases the wording was vague or confusing, and the page itself was hard to find.

For example, one data provider has a statement “This work by our lab is licensed under …”, what does “this work” mean? Software? Data? Both? It wasn’t clear. Another data provider offers their data “free of all copyright restrictions”.  How do we represent that?

Case 3: Data source had no information about how data can be reused (11.5%)

Example: Experimental data which has no data licence.

In cases where no data licence is listed and there was no information about how data can be reused, we have emailed them and asked for clarification.

Solutions

We have to find a way to provide data licensing information even though these data are inconsistent. And regardless of how popular data licenses become in the future, due to the integrative nature of InterMine, we’ll always have to handle all three cases.

What’s the best way to present these data in InterMine so that data consumers can easily understand how they can re-use data?

Possible options:

  1. Only provide URL to official data licence as recommended by voiD, the “Vocabulary Of Interlinked Datasets”
    1. URL will not change, e.g. http://creativecommons.org/licenses/by/4.0/
    2. Easy to ascertain permissiveness
    3. Easy to compare across data sets
  2. Provide URL to data licence OR to more information
    1. URL might change
    2. Useful because people can get details on allowed usage, even if there is no data licence
  3. Provide licence text and URL. Would provide more information immediately to users where there isn’t a licence
    1. Danger of being inaccurate or out of date
    2. User would not have to leave the InterMine to see what’s allowed

What do you think? Please let us know your opinion – leave a comment on this post, pop by chat, or email our developer lists to discuss this further.

InterMine 3.1 – Extending the Core InterMine Data Model with Multiple Genome Versions, Strains

Advances in sequencing technologies mean that genome sequence and annotation data for multiple strains of a species are now often available. An update to the InterMine core data model was decided that would allow addition of Strain data should it be available without affecting InterMines which do not have this data.

It was decided that the addition of a new class, Strain, which is referenced by Organism and Sequence feature and vice versa, would allow both the flexibility required and allow for addition of further data and expansion if required.

strains

The Strain class has the following features/advantages:

  • SequenceFeature entities, such as Genes, would continue to reference Organism, but would also reference the new Strain class, allowing for queries returning SequenceFeatures for a specific strain.
  • Providing strain information as a separate class allows individual InterMine’s to reference other information as required, such as Genotype and Stocks.
  • The Strain class extends BioEntity so will include strain-relevant attributes such as PrimaryIdentifier and Name and will reference other collections such as synonym.
  • Minimal changes to the user interface will be required as, to our knowledge, SequenceFeatures in individual strains always have a unique identifier. With the help of templates if necessary, users will be able to identify particular SequenceFeatures and which strain they originate from.

To update your mine with these new changes, see upgrade instructions. This is a non-disruptive release.

See release notes and the notes from the community call for more details. Please join our community calls if you’d like to be part of future data model decisions! (Details of upcoming calls are available via our developer mailing list).