Bioschemas Summer Progress and InterMine

A couple of weeks ago we took part in the May ELIXIR Bioschemas meeting, along with representatives from Google, the European Bioinformatics Institute (EBI) and other participating organizations from the UK and beyond.

To give some background, Bioschemas is based on schema.org, an initiative to produce schemas that can be directly embedded in websites to give more structure to data. Search engines can understand this more easily than simple text, and it’s the stuff that powers a proportion of Google snippets (those box-outs you see on Google search results when you search for something popular). For example, let’s suppose I wanted to tell search engines more about my Jazz event. This is what I would embed in the webpage for the event.

<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "Event",
  "name": "Hot Digits Jazz Afternoons",
  "startDate": "2017-04-24T14:30-17:00",
  "location": {
    "@type": "Place",
    "name": "Hot Digits",
    "address": {
      "@type": "PostalAddress",
      "streetAddress": "444 Trumpington St",
      "addressLocality": "Cambridge",
      "postalCode": "CB2 1QA",
      "addressCountry": "UK"
    }
  },
  "image": "http://www.example.com/event_image/12345",
  "description": "Join us for an afternoon of Jazz with Tom Colborn (aka 'Delta Tom').",
  "performer": {
    "@type": "PerformingGroup",
    "name": "Tom Colborn"
  }
}

Bioschemas wants to do the same but for biological information (like genes, proteins, samples, etc.). So in InterMine, for the CHEY_BACSU protein report page in SynBioMine we might have something like this:

<script type="application/ld+json">
{
  "@context":"http://schema.org",
  "@type":"BiologicalEntity",
  "biologicalType":"protein",
  "name":"CHEY_BACSU",
  "url":"http://beta.synbiomine.org/synbiomine/report.do?id=111921899",
  "about":"Integrated InterMine information for Protein CHEY_BACSU",
  "keywords":"protein, CHEY_BACSU",
    "inDataset": {
      "@type":"Dataset",
      "url":"http://beta.synbiomine.org/synbiomine/release-5"
    },
  "crossReference": {
    "@type":"Thing",
    "url":"http://beta.synbiomine.org/synbiomine/report.do?id=6010402"
  },
  "taxon":"https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=224308&lvl=3&lin=f&keep=1&srchmode=1&unlock",
  "taxon":"http://www.uniprot.org/taxonomy/224308"
  "sequence":"MAHRILIVDDAAFMRMMIKDILVKNGFEVVAEAENGAQAVEKYKEHSPDLVTMDITMPEM
 DGITALKEIKQIDAQARIIMCSAMGQQSMVIDAIQAGAKDFIVKPFQADRVLEAINKTLN",
  "datePublished":"2017-05-26",
  "citation": {
    "@type":"CreativeWork",
    "name":"UniProt",
    "url":"http://www.uniprot.org"
  },
  "citation": {
    "@type":"CreativeWork",
    "name":"Ecocyc",
    "url":"http://ecocyc.org"
  },
}

A search engine (or a specialized life sciences search tool) can then crawl and aggregate the structures embedded in a wide range of life sciences websites (particular those with lots of small sites such as biological samples in biobanks). The goal is to make it considerably easier for scientists to find information relevant to their research without having to visit lots of sites individually.

The job of Bioschemas is to go through the existing schema.org schemas and decide what existing stuff we can use (such as Dataset) and what we need to propose as new schemas (such as BiologicalEntity). schema.org schemas are big bags of attributes with no cardinality constraints as they need to satisfy a lot of different use cases, so another job of Bioschemas is to recommend which attributes to use and at what cardinality, both for data in general (DataSet, for example) and for specific life sciences entities, such as proteins and biological samples.

We made some great progress at this meeting and the results, such as draft schemas specifications, are going up on the Bioschemas groups page. The next phase is for specific resources, such as Uniprot and the Protein Data Bank in Europe to try out these schemas on real data and catch the obvious problems so that we can refine the specifications further. At InterMine we’ve also done some extremely prototype work on testing these ideas and we’ll continue to participate enthusiastically, particularly as this is an important component of our coming work to make InterMine-hosted data more Findable, Accessible, Interoperable and Reuseable.

Bioschemas work is at an early and draft stage, but it’s an open community that welcomes anybody who wants to join in the effort. You can find more details on how to participate in our mailing list and issue tracker at Bioschemas.

Bioschemas

Justin and Gos attended the BioSchemas kick-off meeting at Hinxton this week. As well as giving a short talk about InterMine (the slides are on FigShare), Justin managed to jot down a few thoughts about the event:


The aim of the Bioschemas schema is to come up with simple metadata that can be embedded in webpages (JSON-LD, RDFa, etc.) to make it easier to find data. For instance, suppose you wanted datasets that concerned the effect of a certain drug.  At the moment, if you search in Google for that drug name, you will find relevant datasets, but also possibly datasets that happen to use that drug as part of their protocol and more things besides.

But if you can embed a schema which names the subject of the dataset in a structured way (e.g. puts an ontology term URL in a dataset.subject field) then you can pull up more relevant data.

However, there is a strong concern with keeping such markup as light as possible so that people aren’t put off annotating their datasets.  Hence, a notional rule that there should only be 6 properties per class (e.g. just name, description, url, keywords, variableMeasured, creator.name for dataset).

As such, bioschemas won’t be replacing any of the existing ‘heavyweight’ schemas (DATS, DataCite’s model, OMICSdi model, our own InterMine data model), as it’s not meant to be used as an internal data model.

  • Bioschemas is about getting properties into schema.org (as supported by Google, Bing, Yahoo, etc.).  However, schema.org is a big bag of potential metadata with few constraints (no cardinality, for instance!).  It’s up to Bioschemas to come up with constraints for our purposes, especially on generic metadata such as DataSet and DataCatalog.
  • Bioschemas is also largely about general areas at the moment (datasets, data sources, etc.) though there is some specific work on protein annotations and samples.  But not, for instance, on genes and proteins at this stage (though presumably protein annotations would need some metadata for proteins….)
  • Things are at an early stage – the next year will (hopefully) see some Bioschemas definitions.  There is still debate about exactly what it can be used for beyond search.  For instance, we (InterMine) may be able to use them to improve our data integration process if all sources start embedding common metadata in their download files as well as on webpages.
  • Bioschemas is more than schema work. The initiative covers topics such as identifiers, citation, metrics and tools too (which will be relevant to us in the future).  We can get value from these other areas too – for instance there was discussion of a standard way of providing notification that a data source (uniprot, etc.) had updated, which would be very useful to us in building mines that automatically update.  There was also talk of having metadata that specifies when a data source changes its format – another thing that would be tremendously useful for us.
  • The presentations were interesting and the group friendly.  InterMine seemed to be well received.  The group doesn’t have a many actual data consumers and integrators, so I think that we can make a valuable contribution from that perspective.