InterMine 4.0 – InterMine as a FAIR framework

We are excited to publish the latest version of InterMine, version 4.0.

It’s a collection of our efforts to make InterMine more “FAIR“. As an open source data warehouse, InterMine’s raison d’être is to be a framework that enables people to quickly and easily provide public access to their data in a user friendly manner. Therefore InterMine has always strived to make data Findable, Accessible, Interoperable and Reusable and this push is meant to formally apply the FAIR principles to InterMine.

What’s included in this release?

  1. Generate globally unique and stable URLs to identify InterMine data objects in order to provide more findable and accessible data.
  2. Apply suitable ontologies to the core InterMine data model to make the semantic of InterMine data explicit and facilitate data exchange and interoperability
  3. Embed metadata in InterMine web pages to make data more findable
  4. Improve accessibility of data licenses for integrated sources via web interface and REST web-service.

More details below!

How to upgrade?

This is a non-disruptive release, but there are additions to the data model. Therefore, you’ll want to increment your version, then build a new database when upgrading. No other action is required.

However, keep reading for how to take advantages of the new FAIR features in this release.

Unique and stable URLs

We’ve added a beautiful new user-friendly URL.

Example: http://beta.flymine.org/beta/gene:FBgn0000606

Currently this is used only in the “share” button in the report pages and in the web pages markup. In the future, this will be the only URL seen in the browser location bar.

For details on how to configure your mine’s URLs, see the docs here.

See our previous blog posts on unique identifiers.

Decorating the InterMine data model with ontology terms

InterMine 4.0 introduces the ability to annotate your InterMine data model with ontology terms.

While these data are not used (yet), it’s an important feature in that it’s going to facilitate cross-InterMine querying, and eventually cross-database analysis — allowing us to answer questions like “Is the ‘gene’ in MouseMine the same ‘gene’ at the EBI?”.

For details on how to add ontologies to your InterMine data model, see the docs here.

Embedding metadata in InterMine webpages

We’ve added structured data to web pages in format of JSON-LD to make data more findable, and these data are indexed by Google data search. Bioschemas.org is extending Schema.org with life science-specific types, adding required properties and cardinality on the types. For more details see the docs here.

By default this feature is disabled. For details on how to enable embedding metadata in your webpages, see the docs here.

Data licences

In our ongoing effort to make the InterMine system more FAIR, we have started working on improving the accessibility of data licences, retaining licence information supplied by the data sources integrated in InterMine, and making it available to humans via our web application and machines via queries.

See our previous blog post on data licences.

For details on how to add data licences to your InterMine, see the docs.

Future FAIR plans

  1. Provide a RDF representation of data stored, lists and query results, and the bulk download of all InterMine in RDF form, in order to allow the users to import InterMine resources into their local triplestore
  2. Provide an infrastructure for a SPARQL endpoint where the user can perform federated queries over multiple data sets

Upcoming Releases

The next InterMine version will likely be ready in the Fall/Winter and include some user interface updates.

Docs

To update your mine with these new changes, see upgrade instructions. This is a non-disruptive release.

See release notes for detailed information.

InterMine 3.1.2 – patch release

We’ve released a small batch of bug fixes and small features. Thank you so much to our contributors: Sam Hokin, Arunan Sugunakumar and Joe Carlson!

Features

  • Templates can be tagged by any user, not just the super user. (Via webservice only – for now)

Fixes

  • When searching our docs, some times the “.html” extension was dropped. This was fixed by our beautiful documentation hosters – readthedocs.org
  • Installing the “bio” project via Gradle does not fail if you do not have the test properties file.
  • Gradle logs error fixed
  • Removed old GAF 1.0 code
  • Fixed XML library issue:  java.lang.ClassCastException for org.apache.xerces
  • Set converter.class correctly
  • Updated the protein atlas expression graph
  • Handle NULL values returned by NCBI web services
  • Updated Solr to support new Solr versions
  • Removed unneeded Gretty plugin
  • Better error handling for CHEBI web services
  • Publication abstract is longer than postgres index
  • Removed phenotype key, it’s not in the core model and has conflicting key
  • Updated ObjectStoreSummary to handle ignored fields consistently.

Upcoming Releases

InterMine 4.0 is scheduled for release the week of 7 May 2019.

Docs

To update your mine with these new changes, see upgrade instructions. This is a non-disruptive release.

See release notes for detailed information.

Data integration and Machine Learning for drug target validation

Hi!

In this blog post I would like to give a brief overview of what I’m currently working on.

Knowledge Transfer Partnership: what & why?

First, in order to give context to this post, last year InterMine at University of Cambridge and STORM Therapeutics, a spin-out of University of Cambridge working on small modulating RNA enzymes for the treatment of cancer, were awarded a Knowledge Transfer Partnership (KTP) from the UK Government (read this post for more information). With this award, the objective is to help STORM Therapeutics advance their efforts in cancer research, and contribute to their ultimate goal of drug target validation.

As part of the KTP Award, a KTP Associate needs to be appointed by both the knowledge base (University of Cambridge) and the company (STORM). The role of the KTP Associate is to act as the KTP Project Manager and is in charge of the successful delivery of the project. For this project, I was appointed as the KTP Associate, with a Research Software Engineer / Research Associate role at the University of Cambridge, for the total duration of the project: 3 years.

Machine learning and a new mine: StormMine

Now that you know what the KTP project is about, and who is delivering it, let’s move on to more interesting matters. In order to successfully delivering this project, the idea is to use the InterMine data warehouse to build a knowledge base for the company, STORM, that enables their scientist to have all the relevant data for their research in a single, integrated, place. For this reason, several new data sources will be integrated into a STORM’s deployment of the InterMine data warehouse (StormMine, from now on), and appropiate data visualizations will be added.

Then, once the data is integrated, we can think towards analysing the data to gather insights that may help the company goals, such as applying statistical and Machine Learning methods to gather information from the data, as well as building computational intelligence models. This leads the way towards what I’ve been working on since my start in February, and will continue until July 2019.

In general terms, I’m currently focused on building Machine Learning models that are able to learn how to differentiate between known drug targets and non-targets from available biological data. This part of work is going to be used as my Master’s Thesis, which I hopefully will deliver in July! Moreover, with this analysis, we will be able to answer three extremely relevant questions for STORM, and which are the questions leading the current work on the project. These questions are

  1. Which are the most promising target genes for a cancer type?
  2. Which features are most informative in predicting novel targets?
  3. Given a gene, for which cancer types is it most relevant?

If you are interested in learning more about this work, stay tuned for next posts, and don’t hesitate contacting me, either by email (ar989@cam.ac.uk) or connect with me in LinkedIn (click here)!

 

Persistent identifiers (URI) and navigable URLs in InterMine

Local unique identifiers (LUIs)

A LUI (Local Unique Identifier) is  an identifier guaranteed to be unique in a given local context (e.g. a single data collection). [Ref. https://doi.org/10.1371/journal.pbio.2001414]. InterMine’s existing local identifiers are based on an internal database ID; they are unique but they are not preserved across database releases. For example, the ID 1007854 which currently identifies the gene zen in FlyMine, it’s not persistent; after the next build the link http://www.flymine.org/flymine/report.do?id=1007854 will be not valid.

In InterMine, we have implemented new persistent local unique identifiers which are preserved across releases; they are based on the class types, defined in the InterMine core model, and the external IDs from the main data source provider integrated.

Some examples are:
protein:P31946 (protein identifier)
publication:8829651 (PubMed identifier)
gene:MGI:1924206 (gene identifier)

Persistent URIs

An URI (Uniform Resource Identifier) is an identifier which is unique on the web, and not only within the local context as the LUI is, and actionable, so if you copy it in the web address bar you are redirected to the source. URIs need to be persistent in order to to provide reliable sources, always findable and accessible.

Some examples of persistent URIs are:
http://purl.uniprot.org/uniprot/P05455 where P05455 is the LUI for UniProt
http://identifiers.org/biosample/SAMEA104559033 where SAMEA104559033 is the LUI for biosample.

Where are Permanent URIs going to be used by InterMine?
1. To markup the web pages for search engines with Bioschemas.org or Schema.org types: set the identifier attribute with the persistent URI in DataCatalog, DataSeta and BioChemEntity types.
2. To generate RDF: we need persistent URI to set the subject in the triples generated.

We need to generate persistent URIs only if we create new entities. If a mine instance DOES NOT create new entities, it needs to re-use the existing URIs provided by the main source provider.
In FlyMine, for example, the RDF generated for the protein P05455 integrated from UniProt, which is the main resource provider for that data type, should be:

<http://purl.uniprot.org/uniprot/P05455> rdf:type <http://semanticscience.org/resource/SIO_010043> .
<http://purl.uniprot.org/uniprot/P05455> rdfs:label “Protein P05455” .

But how to generate persistent URIs?

There are different options to generate persistent URIs, and the mine administrator will choose the option which is more suitable to the mine instance.

Option 1: Generate Persistent URIs using third party resolvers

In order to provide permanent URIs, we can configure the mine instance to use Identifiers.org as PURL (permanent URI) provider. These the steps to follow:

1. register the mine instance in Identifiers.org as data collection

Namespace/prefix LegumeMine
URI (assigned by Identifiers.org) http://identifiers.org/legumemine
Primary Resource https://mines.legumeinfo.org/legumemine

2. set, in the mine instance, the property identifier.uri.base with the URI assigned by Identifiers.org (e.g. http://identifiers.org/legumemine).

The URI, generated by LegumeMine, for the entity GeneticMarker with primary identifier 118M3 will be: http://identifiers.org/legumemine:geneticmarker:118M3. This is persistent, unique and actionable: if you paste it in the web browser address you will redirected to the navigable URL: https://mines.legumeinfo.org/legumemine/geneticmarker:118M3 by identifiers.org.

Option 2 – Generate Persistent URIs setting a redirection system

A mine administrator might prefer to implement an in-house redirection system (couple of lines in apache or nginx configuration files) setting a purl system similar to purl.uniprot.org.

In LegumeMine, for example, the permanent URI, for the entity GeneticMarker with identifier 118M3 might be: http://purl.legumemine.org/legumemine/geneticmarker:118M3.

Option 3 – Use navigable URLs

A mine administrator might decide to use the navigable URLs (see the next section) as permanent URIs.
An example is given by ZFIN where the URI http://zfin.org/ZDB-GENE-040718-423 coincides with the navigable URL.

Navigable persistent URLs

The navigable or access URLs are the URLs of the web pages where the users are redirected.
Examples of the new navigable URLs in Flymine:
http://flymine.org/flymine/protein:P31946
http://flymine.org/flymine/publication:781465
http://flymine.org/flymine/gene:MGI:1924206

The new navigable URLs will not change at every build! We can guarantee they will be persistent setting redirection system which resolves old URLs.

Navigable URLs usage

1. Permanent link button in the current report page
2. Permanent link button in BlueGenes, InterMine’s new user interface.
3. To markup the web pages with Bioschemas.org type: url attribute will be set with the navigable URL
4. To generate RDF: the field schema:mainEntityOfPage will be set with the persistent URL. For example:

<http://identifiers.org/MGI:97490> schema:mainEntityOfPage  <https://mousemine.org/gene:MGI:97490>

The new permanent URIs and URLs have been scheduled to be released in InterMine 4.0 release.

BlueGenes OAuth2 Authentication: Community feedback requested!

BlueGenes development is at the point where we need to store BlueGenes specific data to a database. This is an important step because it paves the way for customisation, branding, and tool configuration, and an enhanced My Data section to let users manage all of their InterMine assets.

There are a few architecture and design decisions that need to be made now, and be made correctly. In particular: OAuth2 Authentication. If you’re up to speed on how InterMine and BlueGenes authenticate then feel free to skip to the bottom.

Background

The current InterMine web application is a monolith. Users login to the UI with a username and password and their identity gets stored in memory on the server (called the “session”). When they perform a query or upgrade a list the JSP code sends messages to the Java layer along with the user’s identity which is used to retrieve data from the object store and user profile.

For example, when Sally views her list page today, the workflow looks something like:

Figure 1

today.png

Everything you see in InterMine today lives somewhere layered between the JSP Web App and the Object Store.

BlueGenes works differently. It communicates with the Java layer, object store, and user profile entirely through web services known as the InterMine API. No exceptions. This cleaves the dependency between the visual tools that we develop and the lower level operations of InterMine such as handling queries.

When Sally views her list page in BlueGenes, the workflow looks more like this:

Figure 2

tomorrow.png

BlueGenes lives in the browser, not on the server. InterMine’s web services respond with raw data about her lists in JSON format and BlueGenes renders the page in the browser. This is equivalent to running Python scripts in your console to fetch your lists, resolve IDs, perform a search, etc.

Web services (InterMine or otherwise) are stateless by design. They can’t tell if requests are made by a new user or a revisiting one. In order for a web service to authorise a user the request must contain some sort of secret token as seen in Figure 2. Like any good web application, InterMine provides web services for authenticating a user and retrieving their identity token which can be used in future requests rather than a username and password.

BlueGenes Authentication

Now it gets a bit trickier. BlueGenes has its own small web server to provide the actual javascript application, and it requires database access to store BlueGenes specific information such as additional MyMine data, tool config, etc. It really looks more like this:

Figure 3

blugenes_server.png

 

A user can authenticate using InterMine’s web services via the browser, but if they want to save user specific data to BlueGenes’s database using BlueGene’s web services then they need to provide an identity. BlueGenes does not have access to the user profile directly, so the authentication request needs to be piped through the BlueGenes server.

Figure 4

auth.png

 

When Sally logs into BlueGenes she provides her username and password which is sent to the BlueGenes server rather than the InterMine server. If BlueGenes successfully authenticates as Sally then it sends her back her InterMine API token embedded in a signed JSON Web Token (JWT). All future requests between BlueGenes and InterMine will contain her API token, and all requests to the BlueGenes server will contain the signed JWT.

It sounds a bit complicated, but this only happens when logging in and remains hidden from the user. This configuration protects BlueGenes from storing passwords and doesn’t require direct access to the user profile.

The problem: OAuth2 Authentication

Logging into InterMine using your Google account uses the OAuth2 framework. For it to work you must configure Google’s developer console with a hardcoded URL that redirects users back to the application after they’ve authenticated. This redirection page is given a token that is exchanged by the servers for the user’s Google identity (email address and Google ID). We can do the same in BlueGenes:

  1. We put a Google Signin button in BlueGenes.
  2. Sally clicks it and is redirected to Google.
  3. Upon authentication Sally is sent back to BlueGenes with an authentication token.
  4. BlueGenes server exchanges the token for Sally’s Google ID.

So far so good. She can update her tool configurations and tags which are stored in the BlueGenes database.

Now Sally wants to save a list which is an action performed in InterMine, not BlueGenes. This requires an API token which she doesn’t yet have.

  • She can’t authenticate with InterMine using a username and password because she doesn’t have one (she’s a Google user).
  • She has no way of exchanging her Google ID with InterMine’s web services for an API token because InterMine has no way of trusting who she is. Anyone could access the end point and get a user’s API token if they knew their Google ID.
  • BlueGenes can’t fetch her API token from the user profile because it doesn’t have access (by design).

There are a few workaround solutions but they couple BlueGenes to a single InterMine instance with varying degrees.

Solution 1: JWTs and sharing secrets

InterMine server gets a new end point that accepts a user ID and a JSON Web Token. The user’s API token is returned only if the signature on the JWT is valid.

Pain point: Both BlueGenes server and InterMine server will need matching secret keys. A third party cannot host their own BlueGenes and point it at a remote mine while supporting OAuth2 without knowing that mine’s secret key (aka access to all accounts).

InterMine admins could potentially whitelist third party instances of BlueGenes by generating secret keys for them, but this would be an active process of curation and still give third parties full access to all Google accounts..

Solution 2: Shared database

BlueGenes accesses the user profile directly.

Pain point: This requires database access which entirely rules out remote instances of BlueGenes

Solution 3: Double Login

InterMine has a URL redirect for Google authentication. It accepts a URL of a BlueGenes instance and generates a link with an embedded API key.

  1. A user clicks Google Login on BlueGenes and is redirected to Google
  2. After authenticating the user is redirected back to the BlueGenes server.
  3. BlueGenes generates a JWT containing the user’s identity.
  4. A mandatory button is then shown to “Authorise My Account to use Remote Data Sources” (which means InterMine server).
  5. Clicking the button sends the user to a /service/google-auth end point on the remote mine with a return_to parameters containing the URL of BlueGenes.
  6. The return_to parameter is stored in the session and the user is sent back to Google Login where they authorise for the second time.
  7. After authenticating the user is redirected to an InterMine /service/google-auth-redirect end point.
  8. The /service/google-auth-redirect page automatically redirects the user back to the BlueGenes URL stored in the session with the API token as a parameter

A workflow would look something like this:

solution3.png

There are quite a few steps, but steps 5+ are automatic.

Pain point: Users will have to double authentication the first time they login to Bluegenes, but we can make this as painless as possible. Also, if an admin is running both InterMine server and BlueGenes server then they’ll need two OAuth2 projects in their Google developer console (also a one time activity).

Solution 4: Outsource

We use a third party single sign-on vendor such as https://auth0.com/

Pain point: We can’t guarantee that InterMine admins will remain within the Terms of Service for their free offering to open source projects. Otherwise it’s very expensive.

Solution 3 seems to be the most feasible and keeps InterMine and BlueGenes completely decoupled. (Thanks, Yo!)

Does anyone feel strongly about a particular solution, or have other advice for bridging the OAuth2 gap? Feel free to leave a comment or join in the discussion on our mailing list (mailing list subscription link is here: https://lists.intermine.org/mailman/listinfo/dev)

InterMineR package

InterMine data can be accessed via command line programs like cURL and client libraries for five programming languages (Java, JavaScript, Perl, Python and Ruby.) Aiming to expand the functionality of InterMine framework, an R package, InterMineR, had been started that provided basic access to InterMine instances through the R programming environment. (You could run template queries, but not much else!)

However, in order to fully utilize the statistical and graphical capabilities of the R language and make the InterMine framework available to an even greater number of life scientists, the goals were set to:

  1. Further develop and publish the InterMineR package to Bioconductor, a widely used, open source software project based in R, which aims to facilitate the integrative analysis of biological data derived from high-throughput assays.
  2. Add visualisation capabilities, e.g. “What features are close to my feature of interest?”
  3. Add enrichment analysis in InterMineR, a feature that will provide R users with access to the InterMine enrichment analysis widgets and can be effectively combined with the graphical capabilities of R libraries.

InterMineR performs a call to the InterMine Registry to retrieve up-to-date information about the available Mines. The information retrieved are then used to connect the Mines with the R environment using the InterMine web services.

Queries

The InterMineR package can be used to perform complicated queries on a Mine. The process is facilitated by the retrieval of the data model and the ready-to-use template queries of the respective Mine. The R functions setConstraints and setQuery have been created along with the formal class InterMineR, to create new or modify existing queries, store them as Intermine-class objects and apply them to the Mine with the runQuery method.

Genomic Coordinates

r_gviz

Figure 1: Gene visualisation done via InterMineR AND GVIZ

InterMineR can retrieve genomic coordinates and gene expression analysis data which can be converted to:

with the R functions convertToGRanges and convertToRangedSummarizedExperiment respectively. This way an interaction layer between InterMineR and other Bioconductor packages (e.g. GenomicRanges and SummarizedExperiment) is established, allowing for rapid analysis of the retrieved InterMine data.

Enrichment + GeneAnswers

InterMineR also retrieves InterMine enrichment widgets and facilitates the enrichment analysis on an InterMine instance using the R functions getWidgets and doEnrichment, respectively. With the usage of the R function convertToGeneAnswers the results of the enrichment analysis are converted to a GeneAnswers-class object, therefore allowing the visualization of:

  • Pie charts
  • Bar plots
  • Concept-gene networks
  • Annotation category (e.g. GO terms, KEGG pathways) – interaction networks
  • Gene interaction networks

by using R functions from the GeneAnswers R package.

geneanswers_go_structure_network

Figure 2: GeneAnswers GO structure network, generated via InterMineR

geneanswers_concept_gene_network_colors

Figure 3: GeneAnswers gene network generated using InterMineR

Final steps: Bioconductor & Vignettes

The updated InterMineR package complies to the instructions for submitting new packages to Bioconductor, has passed all automated checks (R CMD build, check and BiocCheck) and is currently under the process of manual review for Bioconductor submission.

Documentation of each function along with examples of its usage are available in the GitHub repo and as help files upon the installation of the package. Furthermore, a detailed vignette and tutorials concerning the new functionality of InterMineR package are currently available at the intermine/InterMineR/vignettes folder of the GitHub dev branch, and will be shortly available on the GitHub master branch as well.

This project is part of Google Summer of Code, still under development by me, Konstantinos Kyritsis, PhD student at the Aristotle University of Thessaloniki, under the mentoring of Julie Sullivan and Rachel Lyne. The GitHub repository of the InterMineR package can be found at https://github.com/intermine/InterMineR.

Commits made my Konstantinos can be found here: https://github.com/intermine/InterMineR/commits/master?author=kostaskyritsis

Google Analytics in BlueGenes: what should we track?

TL;DR: We’re implementing analytics tracking in BlueGenes. We can probably track anything you like, within reason. Leave a comment [comments now closed] or email us if you have anything you’d like to see! Must adhere our privacy policy.

Longer version:

InterMine’s JSP pages (the current, older UI) are set up with a couple of different types of tracking:

  1. Google Analytics, which currently anonymously records things like:
    1. Number of users and their locations
    2. Pages viewed
    3. With a bit of effort you can figure out what items were searched for by analysing query strings.
  2. InterMine home-brew internal analytics (to view in your own mine, log in as the super user and select the “usage” tab.) It tracks:
    1. Logins (anonymously)
    2. Keyword search terms
    3. Popular templates
    4. Count of custom queries executed
    5. List views by InterMine object type (but not list contents)
    6. Count of lists created, by type

So we have a couple of questions we’d love some feedback on, as we implement Google Analytics in BlueGenes:

  1. Do you use the current analytics? Which, or both?
  2. What would you *like* to record? Here’s a list of ideas

Things that are probably okay to track

  • Pageviews including counts and times – e.g. “17 views for /region-search on Monday the 13th at 10:pm”
  • Logins (anonymously)
  • Visitor location
  • Tools used (e.g. report page tools interacted with)
  • Popular templates
  • Mine used / switched to a different mine

Things we’re not sure about – what do you think?

  • Keyword search contents (anonymously). Pros: interesting analyses like this one. Cons: Could someone avoid InterMine out of fear someone would notice their gene is getting too much attention?
  • List contents (anon, as above).
  • What about mistyped identifier names in list upload?
  • Region search
  • Queries built in the query builder

I’m sure I’ve missed off quite a few things from both lists. We’d love to hear your input and feelings, both with regards to privacy and with ideas about useful trackable events and pages. Tweet us, comment on the web services tracking  github issue, email the dev group, or contact us some other way: http://intermine.readthedocs.io/en/latest/about/contact-us/

 

 

 

 

InterMine Registry

At the beginning of the development of this project, there was no place from where all the up-to-date InterMine instances information like name, url, description, versions, organism, colors, logo, could be retrieved at once. This lead to hard-coded information, and inefficient processes in order to get these data. Motivated by these problems, InterMine Registry idea was conceived. InterMine Registry is a place where all the up-to-date instances information is stored and can be consumed by applications like Blue Genes, iOS, InterMine R, the friendly mine tool or available to everyone who needs it.

The core of InterMine Registry is its RESTful API (http://registry.intermine.org/api-docs/). Running over Node.js integrated with MongoDB, it contains methods (endpoints) to administer the instances on the registry (add, update & delete) and search among them. Maintaining the registry up-to-date is critical. In order to achieve this goal, the Registry provides automatic updates of all the instances every 24 hours. In addition to this, all or one instances can be manually updated by using the API  synchronization methods. It should be noted that in order to administer instances, an authentication process must be done.

To complement the API, a fully responsive front-end web application is being developed (http://registry.intermine.org/), from which everyone can see all the InterMine instances and search among them. Instances are presented in a list and grid view, both of them having the same purpose but with different aspect. Moreover, a world view is presented, from which the users can see the InterMine instances location on a world map. In addition to this, authenticated users can administer the instances (add, update & delete) with a nice user interface.

This project is part of Google Summer of Code, still under development by me, Leonardo Kuffó, undergraduate student at ESPOL university (Guayaquil, Ecuador), under the mentoring of Daniela Butano. The source code of the application can be found at https://github.com/intermine/intermine-registry

 

GraphConnect – a Neo4j conference

neo4jconference

We were in London to attend GraphConnect, the annual conference organised by Neo4j.
It was fantastic to meet so many people around the world enthusiastic about graph databases, and a lot of people that, like us, are prototyping/exploring Neo4j as possible alternative to relational databases.

They have announced the release of Neo4j 3.2 which promises to bring a huge improvement in term of performance; the compiled Cypher runtime has improved the speed by ~300% for a subset of basic queries and the introduction of native label indexes has also improved the write speed.

They have also added the composite indexes (that InterMine uses a lot) and the use of indexes with the OR operator. We highlighted the problem months ago on stackoverflow and we were so surprised to see it fixed. We have to update our “What we didn’t like about Neo4j” list by removing 2 items. We’re really happy about that!

It was a pleasure to attend Jesus Barrasa’s talk on debunking some RDF versus property graph alternative facts. He demoed how a RDF resource does not necessarily have to live in a triple store but can also be stored in Neo4j. Here are part1 and part2 of “Neo4j is your RDF store”, a nice post where he describes his work in more detail.

Another nice tool they have implemented is the ETL tool to import data from a relational database into Neo4j by applying some simple rules.

The solution-based talks demonstrated how Neo4j is being used to solve complex, real world problems ranging from travel recommendation engines to measuring the impact of slot machine locations on casino floors. While the topics were diverse, a common theme across their respective architectures was the use of GraphAware’s plugins, some of which are free. One plugin that looks particularly interesting is the Neo4j2Elastic tool which transparently pushes data from Neo4j to ElasticSearch.

During the conference, we discovered that there is a Neo4j Startup Program that allows to have Neo4j enterprise edition for free. Not sure if we count as a start up though!

Overall, we’re super happy with the improvements Neo4j has made, and super impressed with Neo4j’s growing community. Looking forward to meeting with Neo4j team in London, at their meetup, and sharing our small experience with the community!

GSoCers Assemble! Announcing the InterMine GSOC 2017 students

Google Summer of Code is officially open as of 16:00 UTC today! This year InterMine will have five students coding over the summer, with five projects:

gsoc-icon-192

  • InterMineR will be getting better docs and hopefully submitted to R repos. Konstantinos Kyritsis will be working on this with the help of InterMine mentors Julie and Rachel.
  • Our Android App will get a younger sibling in the form of an iOS app, thanks to Nadia Yudina. I’ll be the primary mentor for this project.
  • We’ll finally have a proper registry of all the great InterMines out there, brought to you by Leonardo Kuffo with Daniela mentoring the project.
  • Samyadeep Basu will be looking at an ‘InterMine Similarity project’ – given a Gene (or other entity) from InterMine – are there any other interesting entities related to it in some way? Josh is the lead mentor on this project.
  • Yash Sharma will be working on creating Neo4j-InterMine API endpoints under Sam Hokin‘s mentorship.

We wish we could have accepted more of you. In total we had more than 40 students interested in GSoC 2017 with InterMine, resulting in around 30 finalised applications. Many of the applications were brilliant – far more than we could possibly have accepted. Deciding who to accept was really tough, and even if you didn’t get a place in GSoC with us you’re still entirely welcome to contribute to any of our projects if you had any ideas.

Suggestions for accepted students

Congratulations on being accepted. We’re really glad to have you on board. Please have a quick read through our GSoC guidelines to get started.

During the community bonding period, here are a few ideas for getting involved.

  • Find out more details that might pertain to your project (obviously) – investigate the API or work on bugs
  • Project management – in your project’s GitHub repo create milestones, tickets, project boards as appropriate.
  • Write an intro blog post about yourself & your planned work (to be posted here and/or a personal blog we could link to).
  • Come hang in the chat (below).

Non-GSoC InterMine community: you can play too!

We’ve created a couple of chat rooms at chat.intermine.org. We’ll be encouraging our GSoC students to hang out in the #general channel, and you’re welcome to, as well. The students are from all around the world – come make them feel at home!