Researchers connected in Berlin

researchersConnected.png

I really enjoyed attending the Neo4j Life & Health Sciences Workshop, organized in Berlin, this week, by Michael and Petra: a day rich with great presentations about the application and utility of graph technology in several research areas. Here are only few examples:

  • The Ontology Lookup Service, a repository for biomedical ontologies, implemented with the support of graph databases and Apache Solr for indexing, different technologies for different purposes.
  • In the Lamond lab (University of Dundee), they model proteomics data with graph databases in order to understand protein behaviour under different conditions and dimensions of analysis.
  • MetaProteomeAnalyzer (MPA), a tool for analyzing & visualizing metaproteomics, uses Neo4j as the backend for metaproteomics data analysis software.
  • Tabloid Proteome is a database of associated protein pairs, derived from mass-spectrometry based proteomics experiments, implemented using a graphdb, which can help also to discover proteins that are connected indirectly, or may have information that you are not looking for!
  • Reactome is a pathway database which has recently migrated from MySQL to Neo4j, with relevant performance improvement. You can access data via the GraphCore open source Java library, developed  with Spring Data Neo4j, or via Neo4j browser.

I’ve lost count of how many times I heard sentences like: “Biology systems are complex and growing and graphs are the native data model” or “Graph database technology is an effective tool for modelling highly connected data as we have in biology systems”. We already knew it, but it’s been very encouraging and promising hearing it again from so many researchers and practitioners with higher experience than us in graph technologies.

In the afternoon, I attended the workshops “Data modelling with Neo4j”; starting from the data sources we usually work with, we have tried to model the entities and the relationships in order to answer some relevant questions. Modelling can be very challenging and, in some cases, it might depend on the questions you have to answer!

Before the end, I had the chance to give a short presentation about our experience with Neo4j.

Thanks again Michael and Petra for organizing such a great event!

GraphConnect – a Neo4j conference

neo4jconference

We were in London to attend GraphConnect, the annual conference organised by Neo4j.
It was fantastic to meet so many people around the world enthusiastic about graph databases, and a lot of people that, like us, are prototyping/exploring Neo4j as possible alternative to relational databases.

They have announced the release of Neo4j 3.2 which promises to bring a huge improvement in term of performance; the compiled Cypher runtime has improved the speed by ~300% for a subset of basic queries and the introduction of native label indexes has also improved the write speed.

They have also added the composite indexes (that InterMine uses a lot) and the use of indexes with the OR operator. We highlighted the problem months ago on stackoverflow and we were so surprised to see it fixed. We have to update our “What we didn’t like about Neo4j” list by removing 2 items. We’re really happy about that!

It was a pleasure to attend Jesus Barrasa’s talk on debunking some RDF versus property graph alternative facts. He demoed how a RDF resource does not necessarily have to live in a triple store but can also be stored in Neo4j. Here are part1 and part2 of “Neo4j is your RDF store”, a nice post where he describes his work in more detail.

Another nice tool they have implemented is the ETL tool to import data from a relational database into Neo4j by applying some simple rules.

The solution-based talks demonstrated how Neo4j is being used to solve complex, real world problems ranging from travel recommendation engines to measuring the impact of slot machine locations on casino floors. While the topics were diverse, a common theme across their respective architectures was the use of GraphAware’s plugins, some of which are free. One plugin that looks particularly interesting is the Neo4j2Elastic tool which transparently pushes data from Neo4j to ElasticSearch.

During the conference, we discovered that there is a Neo4j Startup Program that allows to have Neo4j enterprise edition for free. Not sure if we count as a start up though!

Overall, we’re super happy with the improvements Neo4j has made, and super impressed with Neo4j’s growing community. Looking forward to meeting with Neo4j team in London, at their meetup, and sharing our small experience with the community!

California Dreaming: InterMine Dev Conf 2017 Report – Day 1

2017’s developer conference has been and gone; time to pay my dues in a blog post or two.

Day 0: Welcome dinner, 29 March 2017

The Cambridge InterMine arrived at Walnut Creek without a hitch, and after a jetlagged attempt at a night’s sleep we sat down to a mega-grant-writing session in the hotel lobby, fuelled by several pots of coffee and plates of nachos.

By 7PM, people had begun to gather in the lobby to head to the inaugural conference dinner at the delicious Walnut Creek Yacht Club. We had to change the venue quite late on in the game, meaning we decided to wander down the street to collect some of the InterMiners who had ended up at the original venue (sorry!!). By the end of the meal, most of the UK contingent was dead on their feet – 10pm California time worked out to be 6am according to our body clocks, so when Joe offered to give several of us a lift back to the hotel, it was impossible to decline.

20170329_221945

Day 1: Workshop Intro

The day started with intros from our PI, Gos, and our host, David Goodstein. 

Josh and I followed up by introducing BlueGenes, the UI we’ve been working on to replace InterMine’s older JSP-based UI. You can view Josh’s slide deck , try out a live demoor browse / check out the source on GitHub.

Next came one of my favourite parts: short talks from InterMiners.

Short community talks

Doppelgangers – Joel Richardson, MGI

Joel gave a great presentation about Doppelgangers in InterMine – that is, occasionally, depending on your data sets and config, you can end up with duplicate or strange / incomplete InterMine objects in your mine. He follows up with explanations of the root causes and mitigation methods – a great resource for any InterMiner who is working in data source integration! 

Genetic data in Mines – Sam Hokin, NCGR/LegFed

Next up was Sam’s talk about his various beany mines, including CowpeaMine, which has only genetics data, rather than the more typical InterMine genomic data. He’s also implemented several custom data visualisations on gene report pages – check out the slides or mines for more details.

JBrowse and Inter-mine communication – Vivek Krishnakumar, JCVI

Vivek focused on some great cross-InterMine collaborations (slides here), including the technical challenges integrating JBrowse into InterMine, as well as a method to link to other InterMines using synteny rather than InterMine’s typical homology approach.

InterMine at JGI – Joe Carlson, Phytozome, JGI

Joe has the privilege to run the biggest InterMine, covering (currently) 72 data sets on 69 organisms. Compared to most InterMines, this is massive! Unsurprisingly, this scale comes with a few hitches many of the other mines don’t encounter. Joe’s slides give a great overview of the problems you might encounter in a large-scale InterMine and their solutions.

Afternoon sessions

FAIR and the semantic web – Daniela & Justin

After a yummy lunch at a nearby cafe, Justin introduced the concept of FAIR, and discussed InterMine’s plans for a FAIRer future (slides). Discussion topics included:

  • How to make stable URIs (InterMine object IDs are transient and will change between builds)
  • Enhanced embedded metadata in webpages and query results (data provenance, licencing)
  • Better Findablility (the F in FAIR) by registering InterMine resources with external registries
  • RDF generation / SPARQL querying

This was followed up by Daniela’s introduction to RDF and SPARQL, which provided a great basic intro to the two concepts in an easily-understood manner. I really loved these slides, and I reckon they’d be a good introduction for anyone interested in learning more about what RDF and SPARQL are, whether or not you’re interested in InterMine .

Extending the InterMine Core Data Model – Sergio

Sergio ran the final session, “Extending the InterMine Core Data Model“. Shared models allow for easier cross-InterMine queries, as demoed in the GO tool prototype:

This discussion raised several interesting talking points:

  • Should model extensions be created via community RFC?
  • If so, who is involved? Developers, community members, curators, other?
  • Homologue or homolog? Who knew a simple “ue” could cause incompatibility problems? Most InterMine use the “ue” variation, with the exception of PhytoMine. An answer to this problem was presented in the “friendly mine” section of Vivek’s talk earlier in the day.

Another great output was Siddartha Basu’s gist on setting up InterMine – outlining some pain points and noting the good bits.

Most of us met up for dinner afterwards at Kevin’s Noodle House – highly recommended for meat eaters, less so for veggies.

Exploring Blazegraph

While we’ve been testing Neo4j with all FlyMine data and with PhytoMine to verify how well it performs and scales with big databases, we started exploring another open source implementation for graph databases: Blazegraph.

Blazegraph overview

Blazegraph is a open source high-performance graph database supporting the RDF data model.

RDF is a model to describe and store data: in this model, you express facts, also known as “statements”, composed by three parts knowns as triples. Each triple is composed of a subject (the resource), the predicate (the property name of the resource) and the object (the property value). For this reasons, Blazegraph is also called a “triples store”.

Subject Predicate Object
http: //flymine.intermine.org/flymine/1007664 :hasSymbol “zen”

Blazegraph supports SPARQL (pronounced “sparkle”), a rich and expressive query language for RDF, which is extremely standardized. Using query operations like union, sort, filter and aggregation, the user can query the data in a very flexible way. With federated queries, the user can aggregate information executing queries distributed over different SPARQL endpoints and consequently discover more data across the web.

Blazegraph provides a SPARQL endpoint where the user can remotely explore, access, and download the data stored using SPARQL language; Blazegraph workbench provides a graphical interface for the REST APIs.

Blazegraph and Neo4j: different graph modelling

In Neo4j, a node in the graph corresponds to an entity in a domain. A node, but also the relationships between the nodes, can contain properties describing the object that it represents.

By contrast, in Blazegraph, the nodes don’t contain properties but primitive data like string, integer, date.

In Neo4j we’ve represented the gene entity and its relation with the organism in this way:

node1

neo4jrelation

In Blazegraph the same concept will be represented as:

blazegraph-post

with the following statements:

triplesOnly one statement represents the relation between the gene and the organism (that one containing the predicate hasOrganism), the others describe the properties of the two entities.

The resources represented in RDF are identified by unique HTTP URIs (in the example http: //flymine.intermine.org/flymine/1007664).

Exporting FlyMine data: Intermine-RDFizer

We have exported all FlyMine data using Intermine-RDFizer.

The Intermine-RDFizer can query any InterMine endpoint via InterMine API, download the tables in tsv files and transform them into RDF nquads based on the XML object model file.

Intermine-RDFizer

The InterMine-RDFizer script converts every row in a table into a RDF resource. The resource type is based on the class name (e.g. Gene, Organism) and the resource URI is built using the column “id”. The script converts the columns in resource properties and builds a RDF literal typed with the column’s name.

blazegrah-triplesFor FlyMine, we have created roughly 365 million triples and imported them into Blazegraph using the REST APIs provided.

Benchmarking

We’ve started testing Blazegraph performance using all FlyMine data imported via InterMine-RDFizer and comparing the results with Neo4j.

As usual, we will keep you updated!

 

Queryathlon: racing Neo4j against PostgreSQL

eb1911_greek_art_-_foot-race_-_panathenaic_vaseAs discussed in a previous post Exploring graph databases for biological data models, we’ve started evaluating Neo4j as a possible alternative to the current relational database for the InterMine system.

In the post we talked about the features provided by Neo4j we really liked and found to be a really good fit for our project, such as:

  • The Neo4j Browser UI,  which is very neat and clear;
  • The way in which biological data could be represented as a graph structure in an intuitive way that is easy to browse;
  • The fact that a gene node which is a “Gene” is also a “BioEntity” and a “SequenceFeature” (parent classes of “Gene”) — which is supported by the multi-labels feature. In the current InterMine PostgreSQL database, Gene, BioEntity and Sequence feature are three separate tables.

This is all very well, but in the end, we all know that once you start crunching the real data it’s all about performance. So, after several weeks spent exploring Neo4j features, it was time to start benchmarking Neo4j performance against PostgreSQL.

Use cases

We identified the following queries to be part of our benchmark:

  • Simple basic queries: return all genes, return genes given an organism;
  • Typical queries: return genes associated with a specific GO term, return GO terms applied to orthologues of a specific gene;
  • Overlapping queries: return the sequence features overlapping the coordinates of a specific gene.

We imported FlyMine data that is the subset involved in the queries used for benchmarking; we created 3.7 million nodes.

For the overlapping queries, we use a “view”, a sort of temporary table. For this test we only included genes (~ 600,000) and not all sequence features in FlyMine.

We created indexes only on properties relevant to the queries we run for the comparison. Unfortunately we couldn’t create either indexes using functions ( e.g. lower(gene.name) ) or composite indexes as this is not possible using the Cypher query language.

Method

Neo4j provides different tools and languages to retrieve the data stored. We used the Neo4j’s REST API endpoint allowing querying with Cypher, the Neo4j’s query language.

All the queries have been executed 5 times after warming up the Neo4j cache. The values are average values over the 5 executions.

We used some curl options to check how long queries took. The execution time has been calculated as time_starttransfertime_pretransfer.

For PostgreSQL, we’ve used psql and turned on the timing.

In some cases, we have not been able to compare Cypher and SQL queries on a strictly like-for-like basis; for example, in the current system, to retrieve the GO terms applied to orthologue genes, more than one SQL query is executed versus one only Cypher query executed in Neo4j.

In these cases, we wrote Neo4j server REST extensions using Neo4j Java APIs to implement the queries. We compared them with the InterMine web services. We clearly know that it’s not a fair comparison: the Neo4j server extension has been implemented to execute only a specific query where InterMine Web service (WS) is able to run any query, but we wanted to experiment and see how far apart Neo4j and Postgres are in term of performance. For Neo4J, we’d also eventually need to add a Java layer to manage dynamic models and queries. This will necessarily slow down the query execution time.

Scripts and server REST extensions wrote for benchmarking are in github.

Results

All genes

Show all genes.

psql (SQL) Neo4j endpoint (Cypher) Notes
1200 ms 5 ms Return all properties
1400 ms 1400 ms Return all properties order by primary identifier
360 ms 12 ms Return primary identifier and symbol
85 ms 5 ms Return genes count

Genes given an organism

Show all genes given a specific organism: Drosophila melanogaster.

gene-organism
Representative example of the gene query – the real one has thousands of results!
psql (SQL) Neo4j endpoint (Cypher) Notes
80 ms 4 ms Return all properties
110 ms 84 ms Return all properties order by primary identifier
20 ms 10 ms Return primary identifier and symbol

GOterm -> Gene

Show genes annotated with a specified GO term: protein binding, cellular_component and nucleoplasm.

gene-goterm

psql (SQL) Neo4j endpoint (Cypher) InterMine Web services Notes
15 ms 16 ms 37 ms protein binding
28 ms 15 ms 38 ms cellular_component
4.7 ms 6 ms 29 ms nucleoplasm

Gene -> Orthologue + Go term

Show GO terms applied to orthologues of a specific gene.

orthologue-gotermWe can not compare the complete queries exactly, but we can compare a simplified version of this. The table below shows the execution time to retrieve all the orthologues (and the organism which the orthologues belong to) of the gene with symbol “tws” but not the GO terms.

psql (SQL) Neo4j endpoint (Cypher) Notes
2 ms 3 ms No JOIN with organism
3 ms 4 ms JOIN with organism

To obtain the GO terms associated with the orthologues, we’ve run the Cypher query, using the Neo4j endpoint, and the server REST extension, implemented using Neo4j Java APIs and compared with the InterMine WS.

Neo4j endpoint (Cypher) Server extension (Java API) Intermine Web services
11.3 ms 12 ms 35 ms

As we said before, we have to keep in mind that InterMine WS accepts any query and the comparison is not the most appropriate.

Gene -> Overlapping Genes

For a particular gene, search for overlapping genes.

overlapping

 

Created 32405 OVERLAPS relationships (only for Gene) to replace the view in the current database. Using OVERLAPS relations is faster than doing calculations on the the query.

The table below shows the execution time using the constraint lookup=CG11566.

Neo4j endpoint (Cypher) Server extension (Java API) Intermine WS
3.5 ms 3.5 ms 30 ms

Conclusions

Given the way we were able to run the experiments, with the “runners” sometimes having to run different routes or under different conditions, we cannot really draw any definitive conclusion based on hard evidence; having said this, what we have seen is quite encouraging as Neo4j has performed well enough with real InterMine data and typical queries to warrant further and more thorough investigations.

 

Exploring graph databases for biological data models

graph

In order to keep InterMine updated to the latest technologies and integrated with the best solutions offered by the open source community, we always keep an eye on the emerging products and explore new tools/platforms. These days, our attention couldn’t not be caught by NoSQL databases.

What is NoSQL?

As the word says, NoSQL databases, refer, at least originally, to “non SQL” or “non relational” databases where the data are organised into one or more tables, however, most recently, the term NoSQL stands also for “not only SQL” because some tools have started introducing SQL-like query languages.

In NoSQL databases, there are many approaches to managing data using different structures:

key-value databases, the simplest NoSQL databases, where every single item is stored as an attribute name (or “key”), together with its value;

wide-column databases using tables, rows and columns, where the columns name and format can change from row to row within the same table;

document databases pairing each key with a complex data structure known as a document;

graph databases where the data are modeled into graphs, composed by nodes and edges (or “relations”).

As usual, there is no silver bullet and the best approach depends on the specific data model. So if we needed to implement a content management system or blogging platform, we would avoid using key-value databases, which are more suitable to store simple data (e.g. session information) and we’d be more inclined toward document databases.

In our specific case, because we have to handle complex biological data and relations, graph databases seem to be the most suitable candidate, worth considering as a possible alternative to the current relational database.

Experiment: InterMine + Neo4j

There are several open source implementations for graph databases; we have decided to start evaluating Neo4j, the most popular: very well established, good documentation, a big and active community supporting it, simple to use, regular meetups and events organized around the world.

The Neo4j Browser is a great tool to query data (using the simple Cypher language) and visualise them in different formats: graph, table, and text. In particular, the graph view is really neat and intuitive, in just few clicks you have a lot of information: clicking on any node or relationship you see the properties of that element and starting from a node you can expand all the relations associated to it. It is possible rearrange the graph, dragging or deleting nodes from the view, or to customize settings for colours, sizes and title nodes. Amazing!

Any time you run the Cypher queries in the editor at the top, the result is displayed in a new frame below; type another query, get another frame. Love it! And also the “history” command is so useful and persists across browser restarts. A really delightful and intuitive user interface.

But let us explain, in more detail, how the data are organized.

The Neo4j graphs are composed of nodes and relationships: the nodes, in general, represent the entities and they are connected by the relationships. Both of them can contain properties.

For example, the “zen” gene, represented as a row in the “gene” table in the current relational model, will be re-modeled as a node in the new graph model, and it’ll contain properties such as symbol, primaryidentifier, and secondaryidentifier. The same applies to the organism which the gene belongs to, it’s also now a node (in Postgres, organism is a separate table). The relationship PART_OF connects the gene node with its organism. Postgres requires a JOIN to query these two tables.

node1

Relationships can also have properties: the fact that a gene is located in a specific position within the chromosome could be represented by the relationship LOCATED_ON with properties: start, end and strand.

node2.png

Each node can have a label, so the node containing the gene will have label “Gene” and the node with the organism, the label “Organism”. Nice!

A node can have more than one label; so the node with genes will have labels: BioEntity, SequenceFeature, Gene. No more duplication of the same gene along the tables BioEntity, SequenceFeature, Gene, as we have in the current model, but just one node with several labels. This will save some database space, certainly.

Modelling the data

We have imported a part of FlyMine data into a new Neo4j database, using the Neo4j-shell tool and implementing new Cypher scripts.

Importing FlyMine data has been not only a necessary step before starting benchmarking, but also very useful to recognize the importance of re-thinking our data model.

  • Some associative tables have disappeared, replaced by relationships (e.g. the table genegoannotation has been replaced by the ANNOTATED_WITH relationship between the node Gene and the node GoAnnotation)
  • Some tables have been replaced by multiple relationships (e.g. the table homologue has been substituted by the relations IS_ORTHOLOGOUS, IS_PARALOGOUS, and IS_LEAST_DIVERGED_ORTHOLOGOUS depending on the type) while the table’s columns have become a relationship’s properties (e.g. LOCATED_ON in the picture above)
  • The view overlappingfeaturessequencefeature has been replaced by the OVERLAPS relationship between two genes.

Summary

These are just examples and maybe not the best approach to modelling our data, but they have helped us to imagine how our model could be represented in the Neo4j graph world and…we liked it!

graph2

Our first impressions of Neo4j have been very positive! We are very excited.

We are currently benchmarking the query execution times against PostgreSQL. We still have a lot of tuning and configuration settings to try out in order to obtain the best from Neo4j, which will be a challenge, but it is certainly worth the effort!

We will keep you updated.