Researchers connected in Berlin

researchersConnected.png

I really enjoyed attending the Neo4j Life & Health Sciences Workshop, organized in Berlin, this week, by Michael and Petra: a day rich with great presentations about the application and utility of graph technology in several research areas. Here are only few examples:

  • The Ontology Lookup Service, a repository for biomedical ontologies, implemented with the support of graph databases and Apache Solr for indexing, different technologies for different purposes.
  • In the Lamond lab (University of Dundee), they model proteomics data with graph databases in order to understand protein behaviour under different conditions and dimensions of analysis.
  • MetaProteomeAnalyzer (MPA), a tool for analyzing & visualizing metaproteomics, uses Neo4j as the backend for metaproteomics data analysis software.
  • Tabloid Proteome is a database of associated protein pairs, derived from mass-spectrometry based proteomics experiments, implemented using a graphdb, which can help also to discover proteins that are connected indirectly, or may have information that you are not looking for!
  • Reactome is a pathway database which has recently migrated from MySQL to Neo4j, with relevant performance improvement. You can access data via the GraphCore open source Java library, developed  with Spring Data Neo4j, or via Neo4j browser.

I’ve lost count of how many times I heard sentences like: “Biology systems are complex and growing and graphs are the native data model” or “Graph database technology is an effective tool for modelling highly connected data as we have in biology systems”. We already knew it, but it’s been very encouraging and promising hearing it again from so many researchers and practitioners with higher experience than us in graph technologies.

In the afternoon, I attended the workshops “Data modelling with Neo4j”; starting from the data sources we usually work with, we have tried to model the entities and the relationships in order to answer some relevant questions. Modelling can be very challenging and, in some cases, it might depend on the questions you have to answer!

Before the end, I had the chance to give a short presentation about our experience with Neo4j.

Thanks again Michael and Petra for organizing such a great event!

Advertisements

Out and about: where to find InterMiners over June and July 2017

We recently added a public google calendar you can subscribe to if you’re interested in knowing what we’re up to, or when public holidays might mean we’re out of the office. Here’s a quick lowdown on upcoming events:

20 June 2017: InterMine community dev call.

21 June 2017: Neo4j Life and Health sciences day in Berlin. Keep your eyes peeled for Daniela!

28 June 2017: Daniela will be presenting on our experiences with Neo4j at the London Neo4J GraphDB meetup.

4 and 18 July 2017: InterMine community dev calls.

22-23 July 2017: I’ll be presenting a poster at BOSC/ISMB about BlueGenes, with the fantastically witty title “Forever in BlueGenes: a next-generation genomic data interface powered by InterMine”. 👖


If you’re a GSoC student or mentor, there will also be the evaluation periods at the end of each month, but you’re doubtless well aware of those!

Further in the future, you may find us at SWAT4LS, ISWC, and further Bioschemas events. We’ll keep you posted!

Are you attending any fun events? Let us know!

If you’re going to be at an event this year where you’ll be telling others about your work with InterMine and might like some InterMine stickers or handouts – or perhaps you’d like to guest-blog about it or share your slides – please ping us.

 

 

 

Bioschemas Summer Progress and InterMine

A couple of weeks ago we took part in the May ELIXIR Bioschemas meeting, along with representatives from Google, the European Bioinformatics Institute (EBI) and other participating organizations from the UK and beyond.

To give some background, Bioschemas is based on schema.org, an initiative to produce schemas that can be directly embedded in websites to give more structure to data. Search engines can understand this more easily than simple text, and it’s the stuff that powers a proportion of Google snippets (those box-outs you see on Google search results when you search for something popular). For example, let’s suppose I wanted to tell search engines more about my Jazz event. This is what I would embed in the webpage for the event.

<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "Event",
  "name": "Hot Digits Jazz Afternoons",
  "startDate": "2017-04-24T14:30-17:00",
  "location": {
    "@type": "Place",
    "name": "Hot Digits",
    "address": {
      "@type": "PostalAddress",
      "streetAddress": "444 Trumpington St",
      "addressLocality": "Cambridge",
      "postalCode": "CB2 1QA",
      "addressCountry": "UK"
    }
  },
  "image": "http://www.example.com/event_image/12345",
  "description": "Join us for an afternoon of Jazz with Tom Colborn (aka 'Delta Tom').",
  "performer": {
    "@type": "PerformingGroup",
    "name": "Tom Colborn"
  }
}

Bioschemas wants to do the same but for biological information (like genes, proteins, samples, etc.). So in InterMine, for the CHEY_BACSU protein report page in SynBioMine we might have something like this:

<script type="application/ld+json">
{
  "@context":"http://schema.org",
  "@type":"BiologicalEntity",
  "biologicalType":"protein",
  "name":"CHEY_BACSU",
  "url":"http://beta.synbiomine.org/synbiomine/report.do?id=111921899",
  "about":"Integrated InterMine information for Protein CHEY_BACSU",
  "keywords":"protein, CHEY_BACSU",
    "inDataset": {
      "@type":"Dataset",
      "url":"http://beta.synbiomine.org/synbiomine/release-5"
    },
  "crossReference": {
    "@type":"Thing",
    "url":"http://beta.synbiomine.org/synbiomine/report.do?id=6010402"
  },
  "taxon":"https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=224308&lvl=3&lin=f&keep=1&srchmode=1&unlock",
  "taxon":"http://www.uniprot.org/taxonomy/224308"
  "sequence":"MAHRILIVDDAAFMRMMIKDILVKNGFEVVAEAENGAQAVEKYKEHSPDLVTMDITMPEM
 DGITALKEIKQIDAQARIIMCSAMGQQSMVIDAIQAGAKDFIVKPFQADRVLEAINKTLN",
  "datePublished":"2017-05-26",
  "citation": {
    "@type":"CreativeWork",
    "name":"UniProt",
    "url":"http://www.uniprot.org"
  },
  "citation": {
    "@type":"CreativeWork",
    "name":"Ecocyc",
    "url":"http://ecocyc.org"
  },
}

A search engine (or a specialized life sciences search tool) can then crawl and aggregate the structures embedded in a wide range of life sciences websites (particular those with lots of small sites such as biological samples in biobanks). The goal is to make it considerably easier for scientists to find information relevant to their research without having to visit lots of sites individually.

The job of Bioschemas is to go through the existing schema.org schemas and decide what existing stuff we can use (such as Dataset) and what we need to propose as new schemas (such as BiologicalEntity). schema.org schemas are big bags of attributes with no cardinality constraints as they need to satisfy a lot of different use cases, so another job of Bioschemas is to recommend which attributes to use and at what cardinality, both for data in general (DataSet, for example) and for specific life sciences entities, such as proteins and biological samples.

We made some great progress at this meeting and the results, such as draft schemas specifications, are going up on the Bioschemas groups page. The next phase is for specific resources, such as Uniprot and the Protein Data Bank in Europe to try out these schemas on real data and catch the obvious problems so that we can refine the specifications further. At InterMine we’ve also done some extremely prototype work on testing these ideas and we’ll continue to participate enthusiastically, particularly as this is an important component of our coming work to make InterMine-hosted data more Findable, Accessible, Interoperable and Reuseable.

Bioschemas work is at an early and draft stage, but it’s an open community that welcomes anybody who wants to join in the effort. You can find more details on how to participate in our mailing list and issue tracker at Bioschemas.

GraphConnect – a Neo4j conference

neo4jconference

We were in London to attend GraphConnect, the annual conference organised by Neo4j.
It was fantastic to meet so many people around the world enthusiastic about graph databases, and a lot of people that, like us, are prototyping/exploring Neo4j as possible alternative to relational databases.

They have announced the release of Neo4j 3.2 which promises to bring a huge improvement in term of performance; the compiled Cypher runtime has improved the speed by ~300% for a subset of basic queries and the introduction of native label indexes has also improved the write speed.

They have also added the composite indexes (that InterMine uses a lot) and the use of indexes with the OR operator. We highlighted the problem months ago on stackoverflow and we were so surprised to see it fixed. We have to update our “What we didn’t like about Neo4j” list by removing 2 items. We’re really happy about that!

It was a pleasure to attend Jesus Barrasa’s talk on debunking some RDF versus property graph alternative facts. He demoed how a RDF resource does not necessarily have to live in a triple store but can also be stored in Neo4j. Here are part1 and part2 of “Neo4j is your RDF store”, a nice post where he describes his work in more detail.

Another nice tool they have implemented is the ETL tool to import data from a relational database into Neo4j by applying some simple rules.

The solution-based talks demonstrated how Neo4j is being used to solve complex, real world problems ranging from travel recommendation engines to measuring the impact of slot machine locations on casino floors. While the topics were diverse, a common theme across their respective architectures was the use of GraphAware’s plugins, some of which are free. One plugin that looks particularly interesting is the Neo4j2Elastic tool which transparently pushes data from Neo4j to ElasticSearch.

During the conference, we discovered that there is a Neo4j Startup Program that allows to have Neo4j enterprise edition for free. Not sure if we count as a start up though!

Overall, we’re super happy with the improvements Neo4j has made, and super impressed with Neo4j’s growing community. Looking forward to meeting with Neo4j team in London, at their meetup, and sharing our small experience with the community!

California Dreaming: InterMine Dev Conf 2017 Report – Day 1

2017’s developer conference has been and gone; time to pay my dues in a blog post or two.

Day 0: Welcome dinner, 29 March 2017

The Cambridge InterMine arrived at Walnut Creek without a hitch, and after a jetlagged attempt at a night’s sleep we sat down to a mega-grant-writing session in the hotel lobby, fuelled by several pots of coffee and plates of nachos.

By 7PM, people had begun to gather in the lobby to head to the inaugural conference dinner at the delicious Walnut Creek Yacht Club. We had to change the venue quite late on in the game, meaning we decided to wander down the street to collect some of the InterMiners who had ended up at the original venue (sorry!!). By the end of the meal, most of the UK contingent was dead on their feet – 10pm California time worked out to be 6am according to our body clocks, so when Joe offered to give several of us a lift back to the hotel, it was impossible to decline.

20170329_221945

Day 1: Workshop Intro

The day started with intros from our PI, Gos, and our host, David Goodstein. 

Josh and I followed up by introducing BlueGenes, the UI we’ve been working on to replace InterMine’s older JSP-based UI. You can view Josh’s slide deck , try out a live demoor browse / check out the source on GitHub.

Next came one of my favourite parts: short talks from InterMiners.

Short community talks

Doppelgangers – Joel Richardson, MGI

Joel gave a great presentation about Doppelgangers in InterMine – that is, occasionally, depending on your data sets and config, you can end up with duplicate or strange / incomplete InterMine objects in your mine. He follows up with explanations of the root causes and mitigation methods – a great resource for any InterMiner who is working in data source integration! 

Genetic data in Mines – Sam Hokin, NCGR/LegFed

Next up was Sam’s talk about his various beany mines, including CowpeaMine, which has only genetics data, rather than the more typical InterMine genomic data. He’s also implemented several custom data visualisations on gene report pages – check out the slides or mines for more details.

JBrowse and Inter-mine communication – Vivek Krishnakumar, JCVI

Vivek focused on some great cross-InterMine collaborations (slides here), including the technical challenges integrating JBrowse into InterMine, as well as a method to link to other InterMines using synteny rather than InterMine’s typical homology approach.

InterMine at JGI – Joe Carlson, Phytozome, JGI

Joe has the privilege to run the biggest InterMine, covering (currently) 72 data sets on 69 organisms. Compared to most InterMines, this is massive! Unsurprisingly, this scale comes with a few hitches many of the other mines don’t encounter. Joe’s slides give a great overview of the problems you might encounter in a large-scale InterMine and their solutions.

Afternoon sessions

FAIR and the semantic web – Daniela & Justin

After a yummy lunch at a nearby cafe, Justin introduced the concept of FAIR, and discussed InterMine’s plans for a FAIRer future (slides). Discussion topics included:

  • How to make stable URIs (InterMine object IDs are transient and will change between builds)
  • Enhanced embedded metadata in webpages and query results (data provenance, licencing)
  • Better Findablility (the F in FAIR) by registering InterMine resources with external registries
  • RDF generation / SPARQL querying

This was followed up by Daniela’s introduction to RDF and SPARQL, which provided a great basic intro to the two concepts in an easily-understood manner. I really loved these slides, and I reckon they’d be a good introduction for anyone interested in learning more about what RDF and SPARQL are, whether or not you’re interested in InterMine .

Extending the InterMine Core Data Model – Sergio

Sergio ran the final session, “Extending the InterMine Core Data Model“. Shared models allow for easier cross-InterMine queries, as demoed in the GO tool prototype:

This discussion raised several interesting talking points:

  • Should model extensions be created via community RFC?
  • If so, who is involved? Developers, community members, curators, other?
  • Homologue or homolog? Who knew a simple “ue” could cause incompatibility problems? Most InterMine use the “ue” variation, with the exception of PhytoMine. An answer to this problem was presented in the “friendly mine” section of Vivek’s talk earlier in the day.

Another great output was Siddartha Basu’s gist on setting up InterMine – outlining some pain points and noting the good bits.

Most of us met up for dinner afterwards at Kevin’s Noodle House – highly recommended for meat eaters, less so for veggies.

A flurry of deadlines: Grants, GSoC, workshops, and more…

We blogged in February commenting that we had a lot of events over the March / April period. Here’s a re-cap:

  • Attending conferences: Amongst the team we attended Bioschemas, the Elixir all-hands, and the Cambridge Scientific Computation Day.
  • InterMine training: We delivered a training workshop about using InterMine at the EBI, part of their Introduction to Omics data integration week-long course.
    • This went well despite a server-room meltdown which conveniently timed itself for the morning of the same day (the training session was in the afternoon, so we thankfully had time to get the servers back up!).
    • In contrast to previous years, every single hand went up when we asked if the participants wrote code as part of their job. Next time, we will try to allow for a longer session on using InterMine web services, rather than the 15 minute slot we allocated this time!
  • Developer Workshop and Hackathon: 5 days in sunny California, spending time with InterMiners from around the world. Longer blog posts to follow, but in the meantime you can browse the agenda for links to slides from each session, or the storify summary of tweets.
  • Google Summer of Code: We’re participating in Google Summer of Code (GSoC) this year (previously) as a mentoring organisation. We had over 50 interested students and 30 distinct applications, many of which were simply brilliant. The deadline for students applying, naturally, was the day after the hackathon, making finding time to provide student feedback a challenge. Maybe there’s a reason to be grateful for jet-lag induced wakefulness at odd hours!
  • Grants: A tale of two grants… :
    • New application: We had a grant application deadline that was, once again, the day after the hackathon. Uh-oh! Feverish figure fixes, tentative typo tweaks and word-count winnowing was squeezed in at every opportunity.
    • Good news about an old application: Meanwhile, we got the news that we’d been fortunate enough to have our hard work pay off: a grant we’d applied for last year as part of the BBSRC BBR 2016 call was agreed to! Hint: the future of InterMine is looking very FAIR, possibly even SPARQLing. More details in a later post.

Events coming up soon: