Persistent identifiers (URI) and navigable URLs in InterMine

Local unique identifiers (LUIs)

A LUI (Local Unique Identifier) is  an identifier guaranteed to be unique in a given local context (e.g. a single data collection). [Ref. https://doi.org/10.1371/journal.pbio.2001414]. InterMine’s existing local identifiers are based on an internal database ID; they are unique but they are not preserved across database releases. For example, the ID 1007854 which currently identifies the gene zen in FlyMine, it’s not persistent; after the next build the link http://www.flymine.org/flymine/report.do?id=1007854 will be not valid.

In InterMine, we have implemented new persistent local unique identifiers which are preserved across releases; they are based on the class types, defined in the InterMine core model, and the external IDs from the main data source provider integrated.

Some examples are:
protein:P31946 (protein identifier)
publication:8829651 (PubMed identifier)
gene:MGI:1924206 (gene identifier)

Persistent URIs

An URI (Uniform Resource Identifier) is an identifier which is unique on the web, and not only within the local context as the LUI is, and actionable, so if you copy it in the web address bar you are redirected to the source. URIs need to be persistent in order to to provide reliable sources, always findable and accessible.

Some examples of persistent URIs are:
http://purl.uniprot.org/uniprot/P05455 where P05455 is the LUI for UniProt
http://identifiers.org/biosample/SAMEA104559033 where SAMEA104559033 is the LUI for biosample.

Where are Permanent URIs going to be used by InterMine?
1. To markup the web pages for search engines with Bioschemas.org or Schema.org types: set the identifier attribute with the persistent URI in DataCatalog, DataSeta and BioChemEntity types.
2. To generate RDF: we need persistent URI to set the subject in the triples generated.

We need to generate persistent URIs only if we create new entities. If a mine instance DOES NOT create new entities, it needs to re-use the existing URIs provided by the main source provider.
In FlyMine, for example, the RDF generated for the protein P05455 integrated from UniProt, which is the main resource provider for that data type, should be:

<http://purl.uniprot.org/uniprot/P05455> rdf:type <http://semanticscience.org/resource/SIO_010043> .
<http://purl.uniprot.org/uniprot/P05455> rdfs:label “Protein P05455” .

But how to generate persistent URIs?

There are different options to generate persistent URIs, and the mine administrator will choose the option which is more suitable to the mine instance.

Option 1: Generate Persistent URIs using third party resolvers

In order to provide permanent URIs, we can configure the mine instance to use Identifiers.org as PURL (permanent URI) provider. These the steps to follow:

1. register the mine instance in Identifiers.org as data collection

Namespace/prefix LegumeMine
URI (assigned by Identifiers.org) http://identifiers.org/legumemine
Primary Resource https://mines.legumeinfo.org/legumemine

2. set, in the mine instance, the property identifier.uri.base with the URI assigned by Identifiers.org (e.g. http://identifiers.org/legumemine).

The URI, generated by LegumeMine, for the entity GeneticMarker with primary identifier 118M3 will be: http://identifiers.org/legumemine:geneticmarker:118M3. This is persistent, unique and actionable: if you paste it in the web browser address you will redirected to the navigable URL: https://mines.legumeinfo.org/legumemine/geneticmarker:118M3 by identifiers.org.

Option 2 – Generate Persistent URIs setting a redirection system

A mine administrator might prefer to implement an in-house redirection system (couple of lines in apache or nginx configuration files) setting a purl system similar to purl.uniprot.org.

In LegumeMine, for example, the permanent URI, for the entity GeneticMarker with identifier 118M3 might be: http://purl.legumemine.org/legumemine/geneticmarker:118M3.

Option 3 – Use navigable URLs

A mine administrator might decide to use the navigable URLs (see the next section) as permanent URIs.
An example is given by ZFIN where the URI http://zfin.org/ZDB-GENE-040718-423 coincides with the navigable URL.

Navigable persistent URLs

The navigable or access URLs are the URLs of the web pages where the users are redirected.
Examples of the new navigable URLs in Flymine:
http://flymine.org/flymine/protein:P31946
http://flymine.org/flymine/publication:781465
http://flymine.org/flymine/gene:MGI:1924206

The new navigable URLs will not change at every build! We can guarantee they will be persistent setting redirection system which resolves old URLs.

Navigable URLs usage

1. Permanent link button in the current report page
2. Permanent link button in BlueGenes, InterMine’s new user interface.
3. To markup the web pages with Bioschemas.org type: url attribute will be set with the navigable URL
4. To generate RDF: the field schema:mainEntityOfPage will be set with the persistent URL. For example:

<http://identifiers.org/MGI:97490> schema:mainEntityOfPage  <https://mousemine.org/gene:MGI:97490>

The new permanent URIs and URLs have been scheduled to be released in InterMine 4.0 release.

Advertisements

Bioschemas Summer Progress and InterMine

A couple of weeks ago we took part in the May ELIXIR Bioschemas meeting, along with representatives from Google, the European Bioinformatics Institute (EBI) and other participating organizations from the UK and beyond.

To give some background, Bioschemas is based on schema.org, an initiative to produce schemas that can be directly embedded in websites to give more structure to data. Search engines can understand this more easily than simple text, and it’s the stuff that powers a proportion of Google snippets (those box-outs you see on Google search results when you search for something popular). For example, let’s suppose I wanted to tell search engines more about my Jazz event. This is what I would embed in the webpage for the event.

<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "Event",
  "name": "Hot Digits Jazz Afternoons",
  "startDate": "2017-04-24T14:30-17:00",
  "location": {
    "@type": "Place",
    "name": "Hot Digits",
    "address": {
      "@type": "PostalAddress",
      "streetAddress": "444 Trumpington St",
      "addressLocality": "Cambridge",
      "postalCode": "CB2 1QA",
      "addressCountry": "UK"
    }
  },
  "image": "http://www.example.com/event_image/12345",
  "description": "Join us for an afternoon of Jazz with Tom Colborn (aka 'Delta Tom').",
  "performer": {
    "@type": "PerformingGroup",
    "name": "Tom Colborn"
  }
}

Bioschemas wants to do the same but for biological information (like genes, proteins, samples, etc.). So in InterMine, for the CHEY_BACSU protein report page in SynBioMine we might have something like this:

<script type="application/ld+json">
{
  "@context":"http://schema.org",
  "@type":"BiologicalEntity",
  "biologicalType":"protein",
  "name":"CHEY_BACSU",
  "url":"http://beta.synbiomine.org/synbiomine/report.do?id=111921899",
  "about":"Integrated InterMine information for Protein CHEY_BACSU",
  "keywords":"protein, CHEY_BACSU",
    "inDataset": {
      "@type":"Dataset",
      "url":"http://beta.synbiomine.org/synbiomine/release-5"
    },
  "crossReference": {
    "@type":"Thing",
    "url":"http://beta.synbiomine.org/synbiomine/report.do?id=6010402"
  },
  "taxon":"https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=224308&lvl=3&lin=f&keep=1&srchmode=1&unlock",
  "taxon":"http://www.uniprot.org/taxonomy/224308"
  "sequence":"MAHRILIVDDAAFMRMMIKDILVKNGFEVVAEAENGAQAVEKYKEHSPDLVTMDITMPEM
 DGITALKEIKQIDAQARIIMCSAMGQQSMVIDAIQAGAKDFIVKPFQADRVLEAINKTLN",
  "datePublished":"2017-05-26",
  "citation": {
    "@type":"CreativeWork",
    "name":"UniProt",
    "url":"http://www.uniprot.org"
  },
  "citation": {
    "@type":"CreativeWork",
    "name":"Ecocyc",
    "url":"http://ecocyc.org"
  },
}

A search engine (or a specialized life sciences search tool) can then crawl and aggregate the structures embedded in a wide range of life sciences websites (particular those with lots of small sites such as biological samples in biobanks). The goal is to make it considerably easier for scientists to find information relevant to their research without having to visit lots of sites individually.

The job of Bioschemas is to go through the existing schema.org schemas and decide what existing stuff we can use (such as Dataset) and what we need to propose as new schemas (such as BiologicalEntity). schema.org schemas are big bags of attributes with no cardinality constraints as they need to satisfy a lot of different use cases, so another job of Bioschemas is to recommend which attributes to use and at what cardinality, both for data in general (DataSet, for example) and for specific life sciences entities, such as proteins and biological samples.

We made some great progress at this meeting and the results, such as draft schemas specifications, are going up on the Bioschemas groups page. The next phase is for specific resources, such as Uniprot and the Protein Data Bank in Europe to try out these schemas on real data and catch the obvious problems so that we can refine the specifications further. At InterMine we’ve also done some extremely prototype work on testing these ideas and we’ll continue to participate enthusiastically, particularly as this is an important component of our coming work to make InterMine-hosted data more Findable, Accessible, Interoperable and Reuseable.

Bioschemas work is at an early and draft stage, but it’s an open community that welcomes anybody who wants to join in the effort. You can find more details on how to participate in our mailing list and issue tracker at Bioschemas.

The FAIR journey

backpack

It’s time to celebrate. After some hectic weeks preparing and organizing the InterMine conference we can now take a deep breath and and get ready for our new BBSRC funded project to make data more FAIR.

It will be a short rest, just the time to check that we have everything we need to start this long journey but also to savour the excitement before the departure towards FAIRness, a destination that will enhance InterMine compliance with FAIR principles; InterMine has been at the forefront of getting research data into the hands of scientists for over 10 years and we’re excited to support the formalisation of these principles.

As team, we recognized with no doubts, the need to implement the FAIR data principles, making biological data stored in InterMine instances more Findable, Accessible, Interoperable, and Reusable by both humans and machines,  as well as the huge impact that this achievement might have on the quality of biological data served by InterMine. Implementing some mechanisms that make data stored in InterMine FAIR, we provide a unique opportunity to make ALL data collection, served by nearly 30 public biological InterMine instances worldwide, FAIR.

This is a great chance and we didn’t want to miss it!

Here some important milestones we want to achieve along the journey:

  • Generate globally unique and stable URIs to identify InterMine data objects and register them in community bioinformatics repositories (for instance bio.tools and Identifiers.org) in order to provide more findable and accessible data.
  • Apply suitable ontologies to the core InterMine data model to make the semantic of InterMine data explicit and facilitate data exchange and interoperability
  • Embed metadata in InterMine web pages and add extra metadata to InterMine’s existing structured query results to make data more findable
  • Provide a RDF representation of data stored, lists and query results, and the bulk download of all InterMine in RDF form, in order to allow the users to import InterMine resources into their local triplestore
  • Provide an infrastructure for a SPARQL endpoint where the user can perform federated queries over multiple data sets
  • Improve accessibility of data licenses for integrated sources via web interface and REST web-service.

It will be an exciting challenge.  Follow us on this blog.