InterMine 3.1.1 – patch release

We’ve released a small batch of bug fixes and small features. Thank you so much to our contributors: Sam Hokin, Paulo Nuin and Joel Richardson!

Features

  • Added access to the GFF header in the GFF parser
  • GFF sequence handler has access to feature now
  • Added DOES NOT CONTAIN constraint
  • Added a few end points for BlueGenes

Fixes

  • InterPro source handles DTD correctly
  • Updated to new GitHub URL for Gretty Plugin
  • Fixed OMIM link outs
  • NCBI going to update their GFF files (at our request! Thanks Wayne!)
    • fix spelling on feature “DNaseI_hypersensitive_site” i
    • change “recombination_region” to “recombination_feature”
  • Updated external links on enrichment widget
  • Handle NULL search index correctly
  • Fix publication with NULL title
  • Fixed log library dependency conflict
  • Removed deprecated Yahoo login link
  • Fixed Panther source to handle proteins

Upcoming Releases

  • 3.1.2 – More small bug fixes
  • 4.0.0 – FAIR release

Docs

To update your mine with these new changes, see upgrade instructions. This is a non-disruptive release.

See release notes for detailed information.

Advertisements

We’re adopting a code of conduct!

TL;DR:

Please read our Code of Conduct draft and comment if you need to.

Longer:

A growing and important movement in open source communities is to adopt a code of conduct, which generally governs behaviour amongst community members, and provides backing to enforce necessary actions if anyone within the community behaves in an unacceptable or unwelcoming manner. We haven’t had any problems, and we’d like things to stay that way in the future.

If the past is anything to go by, we’ll set this code of conduct up and rarely or never need to enforce anything, but it’s better to have clear guidelines in place and not need them, than vice versa. We’d also like to get this in place before anything happens, rather than as an obvious too-late response to an incident – not that we’re anticipating anything!

The draft we’ve put together is adapted quite closely from the Django code of conduct. We’re particularly grateful to them for licencing it under a creative commons attribution licence so we could re-use it.

Read the InterMine Code of Conduct draft here.

Questions?

If you’d like more info about codes of conduct – why they’re important, what topics they cover, etc., please see:

Comments, or questions that weren’t answered by the links above?

Feel free to comment on this post, tweet us, email yo@intermine.org, or info@intermine.org. Please comment by the 19th of March 2019.

 

Header image from flickr, taken by Mike McSharry and licenced under CC-BY-2.0 https://www.flickr.com/photos/mikemcsharry/5360225083/

Persistent identifiers (URI) and navigable URLs in InterMine

Local unique identifiers (LUIs)

A LUI (Local Unique Identifier) is  an identifier guaranteed to be unique in a given local context (e.g. a single data collection). [Ref. https://doi.org/10.1371/journal.pbio.2001414]. InterMine’s existing local identifiers are based on an internal database ID; they are unique but they are not preserved across database releases. For example, the ID 1007854 which currently identifies the gene zen in FlyMine, it’s not persistent; after the next build the link http://www.flymine.org/flymine/report.do?id=1007854 will be not valid.

In InterMine, we have implemented new persistent local unique identifiers which are preserved across releases; they are based on the class types, defined in the InterMine core model, and the external IDs from the main data source provider integrated.

Some examples are:
protein:P31946 (protein identifier)
publication:8829651 (PubMed identifier)
gene:MGI:1924206 (gene identifier)

Persistent URIs

An URI (Uniform Resource Identifier) is an identifier which is unique on the web, and not only within the local context as the LUI is, and actionable, so if you copy it in the web address bar you are redirected to the source. URIs need to be persistent in order to to provide reliable sources, always findable and accessible.

Some examples of persistent URIs are:
http://purl.uniprot.org/uniprot/P05455 where P05455 is the LUI for UniProt
http://identifiers.org/biosample/SAMEA104559033 where SAMEA104559033 is the LUI for biosample.

Where are Permanent URIs going to be used by InterMine?
1. To markup the web pages for search engines with Bioschemas.org or Schema.org types: set the identifier attribute with the persistent URI in DataCatalog, DataSeta and BioChemEntity types.
2. To generate RDF: we need persistent URI to set the subject in the triples generated.

We need to generate persistent URIs only if we create new entities. If a mine instance DOES NOT create new entities, it needs to re-use the existing URIs provided by the main source provider.
In FlyMine, for example, the RDF generated for the protein P05455 integrated from UniProt, which is the main resource provider for that data type, should be:

<http://purl.uniprot.org/uniprot/P05455> rdf:type <http://semanticscience.org/resource/SIO_010043> .
<http://purl.uniprot.org/uniprot/P05455> rdfs:label “Protein P05455” .

But how to generate persistent URIs?

There are different options to generate persistent URIs, and the mine administrator will choose the option which is more suitable to the mine instance.

Option 1: Generate Persistent URIs using third party resolvers

In order to provide permanent URIs, we can configure the mine instance to use Identifiers.org as PURL (permanent URI) provider. These the steps to follow:

1. register the mine instance in Identifiers.org as data collection

Namespace/prefix LegumeMine
URI (assigned by Identifiers.org) http://identifiers.org/legumemine
Primary Resource https://mines.legumeinfo.org/legumemine

2. set, in the mine instance, the property identifier.uri.base with the URI assigned by Identifiers.org (e.g. http://identifiers.org/legumemine).

The URI, generated by LegumeMine, for the entity GeneticMarker with primary identifier 118M3 will be: http://identifiers.org/legumemine:geneticmarker:118M3. This is persistent, unique and actionable: if you paste it in the web browser address you will redirected to the navigable URL: https://mines.legumeinfo.org/legumemine/geneticmarker:118M3 by identifiers.org.

Option 2 – Generate Persistent URIs setting a redirection system

A mine administrator might prefer to implement an in-house redirection system (couple of lines in apache or nginx configuration files) setting a purl system similar to purl.uniprot.org.

In LegumeMine, for example, the permanent URI, for the entity GeneticMarker with identifier 118M3 might be: http://purl.legumemine.org/legumemine/geneticmarker:118M3.

Option 3 – Use navigable URLs

A mine administrator might decide to use the navigable URLs (see the next section) as permanent URIs.
An example is given by ZFIN where the URI http://zfin.org/ZDB-GENE-040718-423 coincides with the navigable URL.

Navigable persistent URLs

The navigable or access URLs are the URLs of the web pages where the users are redirected.
Examples of the new navigable URLs in Flymine:
http://flymine.org/flymine/protein:P31946
http://flymine.org/flymine/publication:781465
http://flymine.org/flymine/gene:MGI:1924206

The new navigable URLs will not change at every build! We can guarantee they will be persistent setting redirection system which resolves old URLs.

Navigable URLs usage

1. Permanent link button in the current report page
2. Permanent link button in BlueGenes, InterMine’s new user interface.
3. To markup the web pages with Bioschemas.org type: url attribute will be set with the navigable URL
4. To generate RDF: the field schema:mainEntityOfPage will be set with the persistent URL. For example:

<http://identifiers.org/MGI:97490> schema:mainEntityOfPage  <https://mousemine.org/gene:MGI:97490>

The new permanent URIs and URLs have been scheduled to be released in InterMine 4.0 release.

Being FAIR – data licences in InterMine

licenceM
Image Licence: CC BY-ND 2.0, via flickr https://www.flickr.com/photos/juditk/4499834152/. No changes made to image.

In our ongoing effort to make the InterMine system more FAIR, we have started working on improving the accessibility of data licences, retaining licence information supplied by the data sources integrated in InterMine, and making it available to humans via our web application and machines via queries.

Open data licences

If you want to make your software open you need to:

  1. publish your software in a public space
  2. apply a suitable free and open source licence

The absence of licence means that nobody can legally use, copy, reproduce or distribute your software.

Same for data. If you want to make you data open you need to:

  1. publish your data
  2. apply a suitable open data licence

Without a licence the users don’t know how to use and re-use your data! Choosing licence is not always easy but there are already some open licences designed exclusively for data developed by Open Data Commons (https://www.opendatacommons.org) and Creative Commons (https://creativecommons.org).

Data licences in InterMine

InterMine provides a library of data parsers for 26 popular data sets, e.g. NCBI, UniProt etc. We went through each of these core InterMine data sources and recorded the data licence for each. During this process we identified 3 cases:

p[ie pie chart showing that 34.7% of data sources had licences, 53.18 has some licence info, and 11.5% had no licencing info at all.

Case 1: Data source had a data licence (34.6%)

Example: http://creativecommons.org/licenses/by/4.0/

Perfect, ideally all data sets would have licenced data!

Case 2: Data source had some information about how data can be reused (53.8%)

Example: https://www.ncbi.nlm.nih.gov/home/about/policies/

Good to have information on how to reuse the data, but these URLs might change. Also, in some cases the wording was vague or confusing, and the page itself was hard to find.

For example, one data provider has a statement “This work by our lab is licensed under …”, what does “this work” mean? Software? Data? Both? It wasn’t clear. Another data provider offers their data “free of all copyright restrictions”.  How do we represent that?

Case 3: Data source had no information about how data can be reused (11.5%)

Example: Experimental data which has no data licence.

In cases where no data licence is listed and there was no information about how data can be reused, we have emailed them and asked for clarification.

Solutions

We have to find a way to provide data licensing information even though these data are inconsistent. And regardless of how popular data licenses become in the future, due to the integrative nature of InterMine, we’ll always have to handle all three cases.

What’s the best way to present these data in InterMine so that data consumers can easily understand how they can re-use data?

Possible options:

  1. Only provide URL to official data licence as recommended by voiD, the “Vocabulary Of Interlinked Datasets”
    1. URL will not change, e.g. http://creativecommons.org/licenses/by/4.0/
    2. Easy to ascertain permissiveness
    3. Easy to compare across data sets
  2. Provide URL to data licence OR to more information
    1. URL might change
    2. Useful because people can get details on allowed usage, even if there is no data licence
  3. Provide licence text and URL. Would provide more information immediately to users where there isn’t a licence
    1. Danger of being inaccurate or out of date
    2. User would not have to leave the InterMine to see what’s allowed

What do you think? Please let us know your opinion – leave a comment on this post, pop by chat, or email our developer lists to discuss this further.

JavaScript everywhere – the BlueGenes Tool API version 1 is released!

If you attended the 2017 InterMine developer workshop, you may recall the discussion we had about embedding tools in InterMine’s new UI, BlueGenes. One of the biggest priorities was to make sure that it was easy and fun to create visualisations for your mine.

It’s taken a lot of sweat, toil, testing, and iteration, but I’m incredibly excited to announce that Version 1.0 of the Tool API is released today. This means that you’ll be able to view all of your favourite client-side tools in BlueGenes, hopefully with just a few quick tweaks. It’ll also be relatively straightforward to install tools that other people have created.

bluegenes_protein_report_page_for_A0A0B4KEJ0_DROME
Preview of a protein report page for A0A0B4KEJ0_DROME, with the InterMine Cytoscape interaction viewer and ProtVista protein feature viewer both embedded.

 

You can try it out yourself, at bluegenes.apps.intermine.org – just search for a gene or protein report page.

How does it work?

We didn’t want to re-invent the wheel, and JavaScript definitely has package managers out there already. We use npm (node package manager) to package and install all BlueGenes tools. You can find all BlueGenes compatible tools by browsing the tag bluegenes-intermine-tool, and once you’ve configured your InterMine to work with the tool API, installing a new tool is often as simple as typing npm install @intermine/my-new-tool-name --save into a terminal. Equally, updating tools is as simple as running npm update from your tools folder. BlueGenes then looks inside a file in your tool folder called package.json, and outputs all installed tools listed there. package.json is an npm configuration file which contains a manifest of all the installed packages.

Getting started

Running tools in BlueGenes

If you have an InterMine that’s at least at version 2.1, you can start bluegenes with the ./gradlew blueGenesStart task. See our docs for details.

It’s also possible to run tools on standalone BlueGenes.

Once tools are installed, you can see the entire list of them under the developer menu in BlueGenes (top right > click the cog > developer > tool “app store”).

Developing tools and converting your existing tools

So, you’ve gotten some tools installed and you’d like to add some more of your own? We have the full tool API specs, a tutorial to walk you through creating your first tool, and a nice tool-scaffolder yeoman generator that will create most of the boilerplate files you need automatically, so you can spend time on more important things like eating cake 🎂 and feeding the cat.

Some credits and thanks

A huge thanks to Vivek Krishnakumar for the very first draft of the Tool API specs, and to Josh Heimbach for further work on the spec. I’d also like to thank Julie, who patiently tested the Tool API installation process and helped me iron out a lot of the bugs.

Future plans

What’s next for BlueGenes and the Tool API? Well, we have some updates planned specifically for the Tool API, including: 

  • extending the tool support to list result pages working with legacy tools that aren’t packaged as Javascript modulesBetter integration with BioJS.

You can see a roadmap here for the Tool API. First, though, we’ll probably be thinking about some of the final bits of polish BlueGenes needs before it can be officially launched as a non-beta UI, including: 

  • Authentication. When the InterMine web services were initially implemented, it was with an eye to enable data scientists and bioinformaticians to access data from InterMine. Some authentication-related services had been implemented, such as token-based authentication, but given the fact that the web services weren’t designed with a full application layer in mind, we need to add some more, including as user registration.
  • A MyMine section. We started with some prototypes late last year / early this year, but ended up rolling them back due to unexpected complications.
  • Speedier and more configurable report pages. Is there anything you’ve always wished for in a report page that’s not a tool? Feel free to ping us with ideas.

 

Questions, concerns, confusion, ideas?

Drop by the InterMine chat, email the group at info@intermine.org, or drop me a mail directly – yo@intermine.org.

InterMine 3.1 – Extending the Core InterMine Data Model with Multiple Genome Versions, Strains

Advances in sequencing technologies mean that genome sequence and annotation data for multiple strains of a species are now often available. An update to the InterMine core data model was decided that would allow addition of Strain data should it be available without affecting InterMines which do not have this data.

It was decided that the addition of a new class, Strain, which is referenced by Organism and Sequence feature and vice versa, would allow both the flexibility required and allow for addition of further data and expansion if required.

strains

The Strain class has the following features/advantages:

  • SequenceFeature entities, such as Genes, would continue to reference Organism, but would also reference the new Strain class, allowing for queries returning SequenceFeatures for a specific strain.
  • Providing strain information as a separate class allows individual InterMine’s to reference other information as required, such as Genotype and Stocks.
  • The Strain class extends BioEntity so will include strain-relevant attributes such as PrimaryIdentifier and Name and will reference other collections such as synonym.
  • Minimal changes to the user interface will be required as, to our knowledge, SequenceFeatures in individual strains always have a unique identifier. With the help of templates if necessary, users will be able to identify particular SequenceFeatures and which strain they originate from.

To update your mine with these new changes, see upgrade instructions. This is a non-disruptive release.

See release notes and the notes from the community call for more details. Please join our community calls if you’d like to be part of future data model decisions! (Details of upcoming calls are available via our developer mailing list).

Google Summer of Code 2019 – it’s time to get your thinking caps on!

TL;DR: Send us your awesome project ideas and/or volunteer to be a mentor!

Longer version:

We need your ideas!

GSoC 2019 has been announced, and as in 2017 and 2018, InterMine will be applying again to become a mentor organisation.  This means we’re back at the “we want your project ideas!” phase – and we do! If you work with or use an InterMine and have ideas for its improvement – might it be something big enough for a student to work on for three months? Any of these types of ideas would be great:

  • An interesting exploratory project that answers a question – “is x likely to be possible or practical with InterMine?”
  • Fixing something that’s always bothered you – in 2018, we managed this with the Solr Search update!
  • A well-scoped application like the InterMine iOS app
  • A set of Javascript Visualisations for your InterMine’s data.

Even if you don’t have time to mentor the project yourself, any ideas like this would be greatly appreciated! You can add them directly to the ideas doc, or feel free to contact us to chat about it more.

Could you mentor for an InterMine project?

In 2017 and 2018 we’ve had mentors from the community, including mentors who were previously GSoC students. Ideally, interested mentors should be known to us, perhaps because you are an InterMine user, developer/maintainer/administrator, or a previous student. If you don’t have any project ideas of your own, you may be able to pick one from the project list that suits your skills and interests.

What is mentoring like, you might ask? We have the basics set out in our mentor terms and conditions (which isn’t as dry as the title suggests). Some things to note:

  • The busiest period is the application phase, when multiple students will be interacting with you to learn about and contribute to the community.
  • After this, things calm down a lot. You’re expected to meet (virtually) with your student at least once a week for the three months of the coding phase. Proactive students often only need an hour or two here or there, but other students may need more hands-on attention.
  • Students wages are paid by Google. Mentors also get a small stipend and thank-you gift after GSoC is complete.
  • You’ll be paired with a Cambridge-based mentor for support, guidance, and cover while on vacation.

A great opportunity all around

GSoC as a program ends up being incredibly valuable two-way exchange. Students get three months of paid work experience at an open source organisation, and on the other side InterMine and InterMine mentors end up with the chance to guide projects and see some truly fantastic work implemented. Promising students might even end up applying for vacancies when they come up – it’s a great way to broaden your community!