Researchers connected in Berlin

researchersConnected.png

I really enjoyed attending the Neo4j Life & Health Sciences Workshop, organized in Berlin, this week, by Michael and Petra: a day rich with great presentations about the application and utility of graph technology in several research areas. Here are only few examples:

  • The Ontology Lookup Service, a repository for biomedical ontologies, implemented with the support of graph databases and Apache Solr for indexing, different technologies for different purposes.
  • In the Lamond lab (University of Dundee), they model proteomics data with graph databases in order to understand protein behaviour under different conditions and dimensions of analysis.
  • MetaProteomeAnalyzer (MPA), a tool for analyzing & visualizing metaproteomics, uses Neo4j as the backend for metaproteomics data analysis software.
  • Tabloid Proteome is a database of associated protein pairs, derived from mass-spectrometry based proteomics experiments, implemented using a graphdb, which can help also to discover proteins that are connected indirectly, or may have information that you are not looking for!
  • Reactome is a pathway database which has recently migrated from MySQL to Neo4j, with relevant performance improvement. You can access data via the GraphCore open source Java library, developed  with Spring Data Neo4j, or via Neo4j browser.

I’ve lost count of how many times I heard sentences like: “Biology systems are complex and growing and graphs are the native data model” or “Graph database technology is an effective tool for modelling highly connected data as we have in biology systems”. We already knew it, but it’s been very encouraging and promising hearing it again from so many researchers and practitioners with higher experience than us in graph technologies.

In the afternoon, I attended the workshops “Data modelling with Neo4j”; starting from the data sources we usually work with, we have tried to model the entities and the relationships in order to answer some relevant questions. Modelling can be very challenging and, in some cases, it might depend on the questions you have to answer!

Before the end, I had the chance to give a short presentation about our experience with Neo4j.

Thanks again Michael and Petra for organizing such a great event!

Advertisements

GraphConnect – a Neo4j conference

neo4jconference

We were in London to attend GraphConnect, the annual conference organised by Neo4j.
It was fantastic to meet so many people around the world enthusiastic about graph databases, and a lot of people that, like us, are prototyping/exploring Neo4j as possible alternative to relational databases.

They have announced the release of Neo4j 3.2 which promises to bring a huge improvement in term of performance; the compiled Cypher runtime has improved the speed by ~300% for a subset of basic queries and the introduction of native label indexes has also improved the write speed.

They have also added the composite indexes (that InterMine uses a lot) and the use of indexes with the OR operator. We highlighted the problem months ago on stackoverflow and we were so surprised to see it fixed. We have to update our “What we didn’t like about Neo4j” list by removing 2 items. We’re really happy about that!

It was a pleasure to attend Jesus Barrasa’s talk on debunking some RDF versus property graph alternative facts. He demoed how a RDF resource does not necessarily have to live in a triple store but can also be stored in Neo4j. Here are part1 and part2 of “Neo4j is your RDF store”, a nice post where he describes his work in more detail.

Another nice tool they have implemented is the ETL tool to import data from a relational database into Neo4j by applying some simple rules.

The solution-based talks demonstrated how Neo4j is being used to solve complex, real world problems ranging from travel recommendation engines to measuring the impact of slot machine locations on casino floors. While the topics were diverse, a common theme across their respective architectures was the use of GraphAware’s plugins, some of which are free. One plugin that looks particularly interesting is the Neo4j2Elastic tool which transparently pushes data from Neo4j to ElasticSearch.

During the conference, we discovered that there is a Neo4j Startup Program that allows to have Neo4j enterprise edition for free. Not sure if we count as a start up though!

Overall, we’re super happy with the improvements Neo4j has made, and super impressed with Neo4j’s growing community. Looking forward to meeting with Neo4j team in London, at their meetup, and sharing our small experience with the community!

The FAIR journey

backpack

It’s time to celebrate. After some hectic weeks preparing and organizing the InterMine conference we can now take a deep breath and and get ready for our new BBSRC funded project to make data more FAIR.

It will be a short rest, just the time to check that we have everything we need to start this long journey but also to savour the excitement before the departure towards FAIRness, a destination that will enhance InterMine compliance with FAIR principles; InterMine has been at the forefront of getting research data into the hands of scientists for over 10 years and we’re excited to support the formalisation of these principles.

As team, we recognized with no doubts, the need to implement the FAIR data principles, making biological data stored in InterMine instances more Findable, Accessible, Interoperable, and Reusable by both humans and machines,  as well as the huge impact that this achievement might have on the quality of biological data served by InterMine. Implementing some mechanisms that make data stored in InterMine FAIR, we provide a unique opportunity to make ALL data collection, served by nearly 30 public biological InterMine instances worldwide, FAIR.

This is a great chance and we didn’t want to miss it!

Here some important milestones we want to achieve along the journey:

  • Generate globally unique and stable URIs to identify InterMine data objects and register them in community bioinformatics repositories (for instance bio.tools and Identifiers.org) in order to provide more findable and accessible data.
  • Apply suitable ontologies to the core InterMine data model to make the semantic of InterMine data explicit and facilitate data exchange and interoperability
  • Embed metadata in InterMine web pages and add extra metadata to InterMine’s existing structured query results to make data more findable
  • Provide a RDF representation of data stored, lists and query results, and the bulk download of all InterMine in RDF form, in order to allow the users to import InterMine resources into their local triplestore
  • Provide an infrastructure for a SPARQL endpoint where the user can perform federated queries over multiple data sets
  • Improve accessibility of data licenses for integrated sources via web interface and REST web-service.

It will be an exciting challenge.  Follow us on this blog.

Where’s Wally?

Last week InterMine attended the first RSE (Research Software Engineer) conference (look at the picture…we are there!)

rsepeople

But what’s an RSE? In the Introduction main talk, the first day, Caroline Jay from the University of Manchester defines RSEs as “the coalface of ensuring that computational science is accurate, reliable and reproducible, and their views on making progress in this domain are therefore particularly valuable.”  Particularly valuable because, as a slogan that everyone loved says, “Software can exist without a paper but a paper and the results can’t. If the software is wrong, the science is wrong”.

As promised by the organizers, the conference focused exclusively on the issues that affect people who write and use software in research, not people who write papers. In two days there were a lot of interesting talks and workshops about how research software engineers can grow a project for science, best practices, software development process, docker…

As InterMine team, we’ve contributed to the conference, sharing our story, what “open” really means to us, why we choose open source and how we try to be open. The image below shows our vision about Open Source.

open-source-more-than-just-a-licence-on-github

We have also shared the best practices we’ve learned, over the years, in designing, writing and maintaining open source software for science, hoping that people embarking on their first open source project could benefit from these. [Slides from our talk]

We had a great time talking and meeting with a lot of very friendly and passionate people sharing idea, best practices, issues and doubts.

Thank UK-RSE folks for organizing a so great event!

See you next year!

Exploring Blazegraph

While we’ve been testing Neo4j with all FlyMine data and with PhytoMine to verify how well it performs and scales with big databases, we started exploring another open source implementation for graph databases: Blazegraph.

Blazegraph overview

Blazegraph is a open source high-performance graph database supporting the RDF data model.

RDF is a model to describe and store data: in this model, you express facts, also known as “statements”, composed by three parts knowns as triples. Each triple is composed of a subject (the resource), the predicate (the property name of the resource) and the object (the property value). For this reasons, Blazegraph is also called a “triples store”.

Subject Predicate Object
http: //flymine.intermine.org/flymine/1007664 :hasSymbol “zen”

Blazegraph supports SPARQL (pronounced “sparkle”), a rich and expressive query language for RDF, which is extremely standardized. Using query operations like union, sort, filter and aggregation, the user can query the data in a very flexible way. With federated queries, the user can aggregate information executing queries distributed over different SPARQL endpoints and consequently discover more data across the web.

Blazegraph provides a SPARQL endpoint where the user can remotely explore, access, and download the data stored using SPARQL language; Blazegraph workbench provides a graphical interface for the REST APIs.

Blazegraph and Neo4j: different graph modelling

In Neo4j, a node in the graph corresponds to an entity in a domain. A node, but also the relationships between the nodes, can contain properties describing the object that it represents.

By contrast, in Blazegraph, the nodes don’t contain properties but primitive data like string, integer, date.

In Neo4j we’ve represented the gene entity and its relation with the organism in this way:

node1

neo4jrelation

In Blazegraph the same concept will be represented as:

blazegraph-post

with the following statements:

triplesOnly one statement represents the relation between the gene and the organism (that one containing the predicate hasOrganism), the others describe the properties of the two entities.

The resources represented in RDF are identified by unique HTTP URIs (in the example http: //flymine.intermine.org/flymine/1007664).

Exporting FlyMine data: Intermine-RDFizer

We have exported all FlyMine data using Intermine-RDFizer.

The Intermine-RDFizer can query any InterMine endpoint via InterMine API, download the tables in tsv files and transform them into RDF nquads based on the XML object model file.

Intermine-RDFizer

The InterMine-RDFizer script converts every row in a table into a RDF resource. The resource type is based on the class name (e.g. Gene, Organism) and the resource URI is built using the column “id”. The script converts the columns in resource properties and builds a RDF literal typed with the column’s name.

blazegrah-triplesFor FlyMine, we have created roughly 365 million triples and imported them into Blazegraph using the REST APIs provided.

Benchmarking

We’ve started testing Blazegraph performance using all FlyMine data imported via InterMine-RDFizer and comparing the results with Neo4j.

As usual, we will keep you updated!

 

Queryathlon: racing Neo4j against PostgreSQL

eb1911_greek_art_-_foot-race_-_panathenaic_vaseAs discussed in a previous post Exploring graph databases for biological data models, we’ve started evaluating Neo4j as a possible alternative to the current relational database for the InterMine system.

In the post we talked about the features provided by Neo4j we really liked and found to be a really good fit for our project, such as:

  • The Neo4j Browser UI,  which is very neat and clear;
  • The way in which biological data could be represented as a graph structure in an intuitive way that is easy to browse;
  • The fact that a gene node which is a “Gene” is also a “BioEntity” and a “SequenceFeature” (parent classes of “Gene”) — which is supported by the multi-labels feature. In the current InterMine PostgreSQL database, Gene, BioEntity and Sequence feature are three separate tables.

This is all very well, but in the end, we all know that once you start crunching the real data it’s all about performance. So, after several weeks spent exploring Neo4j features, it was time to start benchmarking Neo4j performance against PostgreSQL.

Use cases

We identified the following queries to be part of our benchmark:

  • Simple basic queries: return all genes, return genes given an organism;
  • Typical queries: return genes associated with a specific GO term, return GO terms applied to orthologues of a specific gene;
  • Overlapping queries: return the sequence features overlapping the coordinates of a specific gene.

We imported FlyMine data that is the subset involved in the queries used for benchmarking; we created 3.7 million nodes.

For the overlapping queries, we use a “view”, a sort of temporary table. For this test we only included genes (~ 600,000) and not all sequence features in FlyMine.

We created indexes only on properties relevant to the queries we run for the comparison. Unfortunately we couldn’t create either indexes using functions ( e.g. lower(gene.name) ) or composite indexes as this is not possible using the Cypher query language.

Method

Neo4j provides different tools and languages to retrieve the data stored. We used the Neo4j’s REST API endpoint allowing querying with Cypher, the Neo4j’s query language.

All the queries have been executed 5 times after warming up the Neo4j cache. The values are average values over the 5 executions.

We used some curl options to check how long queries took. The execution time has been calculated as time_starttransfertime_pretransfer.

For PostgreSQL, we’ve used psql and turned on the timing.

In some cases, we have not been able to compare Cypher and SQL queries on a strictly like-for-like basis; for example, in the current system, to retrieve the GO terms applied to orthologue genes, more than one SQL query is executed versus one only Cypher query executed in Neo4j.

In these cases, we wrote Neo4j server REST extensions using Neo4j Java APIs to implement the queries. We compared them with the InterMine web services. We clearly know that it’s not a fair comparison: the Neo4j server extension has been implemented to execute only a specific query where InterMine Web service (WS) is able to run any query, but we wanted to experiment and see how far apart Neo4j and Postgres are in term of performance. For Neo4J, we’d also eventually need to add a Java layer to manage dynamic models and queries. This will necessarily slow down the query execution time.

Scripts and server REST extensions wrote for benchmarking are in github.

Results

All genes

Show all genes.

psql (SQL) Neo4j endpoint (Cypher) Notes
1200 ms 5 ms Return all properties
1400 ms 1400 ms Return all properties order by primary identifier
360 ms 12 ms Return primary identifier and symbol
85 ms 5 ms Return genes count

Genes given an organism

Show all genes given a specific organism: Drosophila melanogaster.

gene-organism
Representative example of the gene query – the real one has thousands of results!
psql (SQL) Neo4j endpoint (Cypher) Notes
80 ms 4 ms Return all properties
110 ms 84 ms Return all properties order by primary identifier
20 ms 10 ms Return primary identifier and symbol

GOterm -> Gene

Show genes annotated with a specified GO term: protein binding, cellular_component and nucleoplasm.

gene-goterm

psql (SQL) Neo4j endpoint (Cypher) InterMine Web services Notes
15 ms 16 ms 37 ms protein binding
28 ms 15 ms 38 ms cellular_component
4.7 ms 6 ms 29 ms nucleoplasm

Gene -> Orthologue + Go term

Show GO terms applied to orthologues of a specific gene.

orthologue-gotermWe can not compare the complete queries exactly, but we can compare a simplified version of this. The table below shows the execution time to retrieve all the orthologues (and the organism which the orthologues belong to) of the gene with symbol “tws” but not the GO terms.

psql (SQL) Neo4j endpoint (Cypher) Notes
2 ms 3 ms No JOIN with organism
3 ms 4 ms JOIN with organism

To obtain the GO terms associated with the orthologues, we’ve run the Cypher query, using the Neo4j endpoint, and the server REST extension, implemented using Neo4j Java APIs and compared with the InterMine WS.

Neo4j endpoint (Cypher) Server extension (Java API) Intermine Web services
11.3 ms 12 ms 35 ms

As we said before, we have to keep in mind that InterMine WS accepts any query and the comparison is not the most appropriate.

Gene -> Overlapping Genes

For a particular gene, search for overlapping genes.

overlapping

 

Created 32405 OVERLAPS relationships (only for Gene) to replace the view in the current database. Using OVERLAPS relations is faster than doing calculations on the the query.

The table below shows the execution time using the constraint lookup=CG11566.

Neo4j endpoint (Cypher) Server extension (Java API) Intermine WS
3.5 ms 3.5 ms 30 ms

Conclusions

Given the way we were able to run the experiments, with the “runners” sometimes having to run different routes or under different conditions, we cannot really draw any definitive conclusion based on hard evidence; having said this, what we have seen is quite encouraging as Neo4j has performed well enough with real InterMine data and typical queries to warrant further and more thorough investigations.

 

Exploring graph databases for biological data models

graph

In order to keep InterMine updated to the latest technologies and integrated with the best solutions offered by the open source community, we always keep an eye on the emerging products and explore new tools/platforms. These days, our attention couldn’t not be caught by NoSQL databases.

What is NoSQL?

As the word says, NoSQL databases, refer, at least originally, to “non SQL” or “non relational” databases where the data are organised into one or more tables, however, most recently, the term NoSQL stands also for “not only SQL” because some tools have started introducing SQL-like query languages.

In NoSQL databases, there are many approaches to managing data using different structures:

key-value databases, the simplest NoSQL databases, where every single item is stored as an attribute name (or “key”), together with its value;

wide-column databases using tables, rows and columns, where the columns name and format can change from row to row within the same table;

document databases pairing each key with a complex data structure known as a document;

graph databases where the data are modeled into graphs, composed by nodes and edges (or “relations”).

As usual, there is no silver bullet and the best approach depends on the specific data model. So if we needed to implement a content management system or blogging platform, we would avoid using key-value databases, which are more suitable to store simple data (e.g. session information) and we’d be more inclined toward document databases.

In our specific case, because we have to handle complex biological data and relations, graph databases seem to be the most suitable candidate, worth considering as a possible alternative to the current relational database.

Experiment: InterMine + Neo4j

There are several open source implementations for graph databases; we have decided to start evaluating Neo4j, the most popular: very well established, good documentation, a big and active community supporting it, simple to use, regular meetups and events organized around the world.

The Neo4j Browser is a great tool to query data (using the simple Cypher language) and visualise them in different formats: graph, table, and text. In particular, the graph view is really neat and intuitive, in just few clicks you have a lot of information: clicking on any node or relationship you see the properties of that element and starting from a node you can expand all the relations associated to it. It is possible rearrange the graph, dragging or deleting nodes from the view, or to customize settings for colours, sizes and title nodes. Amazing!

Any time you run the Cypher queries in the editor at the top, the result is displayed in a new frame below; type another query, get another frame. Love it! And also the “history” command is so useful and persists across browser restarts. A really delightful and intuitive user interface.

But let us explain, in more detail, how the data are organized.

The Neo4j graphs are composed of nodes and relationships: the nodes, in general, represent the entities and they are connected by the relationships. Both of them can contain properties.

For example, the “zen” gene, represented as a row in the “gene” table in the current relational model, will be re-modeled as a node in the new graph model, and it’ll contain properties such as symbol, primaryidentifier, and secondaryidentifier. The same applies to the organism which the gene belongs to, it’s also now a node (in Postgres, organism is a separate table). The relationship PART_OF connects the gene node with its organism. Postgres requires a JOIN to query these two tables.

node1

Relationships can also have properties: the fact that a gene is located in a specific position within the chromosome could be represented by the relationship LOCATED_ON with properties: start, end and strand.

node2.png

Each node can have a label, so the node containing the gene will have label “Gene” and the node with the organism, the label “Organism”. Nice!

A node can have more than one label; so the node with genes will have labels: BioEntity, SequenceFeature, Gene. No more duplication of the same gene along the tables BioEntity, SequenceFeature, Gene, as we have in the current model, but just one node with several labels. This will save some database space, certainly.

Modelling the data

We have imported a part of FlyMine data into a new Neo4j database, using the Neo4j-shell tool and implementing new Cypher scripts.

Importing FlyMine data has been not only a necessary step before starting benchmarking, but also very useful to recognize the importance of re-thinking our data model.

  • Some associative tables have disappeared, replaced by relationships (e.g. the table genegoannotation has been replaced by the ANNOTATED_WITH relationship between the node Gene and the node GoAnnotation)
  • Some tables have been replaced by multiple relationships (e.g. the table homologue has been substituted by the relations IS_ORTHOLOGOUS, IS_PARALOGOUS, and IS_LEAST_DIVERGED_ORTHOLOGOUS depending on the type) while the table’s columns have become a relationship’s properties (e.g. LOCATED_ON in the picture above)
  • The view overlappingfeaturessequencefeature has been replaced by the OVERLAPS relationship between two genes.

Summary

These are just examples and maybe not the best approach to modelling our data, but they have helped us to imagine how our model could be represented in the Neo4j graph world and…we liked it!

graph2

Our first impressions of Neo4j have been very positive! We are very excited.

We are currently benchmarking the query execution times against PostgreSQL. We still have a lot of tuning and configuration settings to try out in order to obtain the best from Neo4j, which will be a challenge, but it is certainly worth the effort!

We will keep you updated.