GSOC 2018 – Improved InterMine Search with Solr

Currently InterMine uses Apache Lucene (v3.0.2) library to index the data and provide a key-word style search over all data. The goal of this project is to introduce Apache Solr in InterMine so that indexing and searching can happen even quicker. Unlike Lucene which is a library, Apache Solr is a separate server application which is similar to a database server. We setup and configure Solr (v.7.2.1) independently from the application. We use Solr clients to communicate between the application and the Solr instance.

Here, SolrJ (v.7.2.1), a java client for solr is used to communicate between the InterMine and Solr. We also removed the bobo facet library which is used with Lucene since Solr itself provides faceted search. The implementations has been designed in a manner that InterMine would not be heavily coupled with Solr. When you want to change your search engine to something else in future, you just have make different implementations for the interfaces defined.

Currently the search index and the autocomplete index processes use Solr to index the data. The index time has improved significantly with compared to previous indexing times. For example, currently FlyMine takes around around 1900 seconds (32 mins) to index the data. But with Solr we see that it takes only 1250 seconds (21 mins) which is 34% reduction in time. Query time has also improved with Solr where a query of  “*:*” in FlyMine would take around 30-40 seconds which with Solr takes less than 1 second. Previously with Lucene, the indexed data has to be retrieved from the database during the first search after starting the webapp. This took some time but with Solr, it is not the case and the results are instantly returned.

Addition to the above, two web services have been implemented. A Facet service has been implemented which will return only the facet counts for a particular query rather than returning all the results. The other web service is Facet List service which is similar to the previous one but it will return all the facets available in a mine. It will be useful when you want to know all the facets in a mine before you run an actual search.

All these changes are made against InterMine 2.0 version. These changes will be included in an InterMine release in near future, but for those who want to try these changes immediately, can checkout this branch in Github and follow these instructions. All these changes are tested with Apache Solr (v7.2.1).

References :

Advertisements

InterMine 2.0

We are excited to announce the official release of InterMine 2.0!

InterMine 2.0 includes some model updates, a big change in how InterMine itself is built, lots of new features, like a new UI, and a long list bug fixes. See the full list of updates here.

This release represents a large milestone for the InterMine team! Not only because we made big fundamental changes to the core InterMine data model and build system, but also because this release represents a major shift in philosophy for us. Previously InterMine was a big, monolithic, single piece of software. You downloaded the whole InterMine, you compiled the whole InterMine, you got the whole of InterMine. Instead, we are moving towards this idea of modularity and responsiveness. Smaller, independent libraries that are interconnected but can be used for tools and features separately or linked together.

Smaller decoupled InterMine packages will allow us to develop more features faster with less errors. InterMine maintainers might then have the flexibility to include (or not) the features in their mine, plug in their own tools, etc.

Version 2.0 represents a big step towards this goal!

A New Interface

A new feature in InterMine 2.0 is the ability to run our new UI, nicknamed “Blue Genes”. This app is in addition to the current webapp and offers a new and responsive search environment for your InterMine data.

Blue genes is a modern UI built in Clojure and provides a modern user experience.

  • Super fast response times
  • Interactive list upload
  • Redesigned “My account” section
  • Search autocomplete
  • Template and query builder result previews
  • .. lots more!

Once you have your InterMine updated to InterMine 2.0, there is a single command that will launch Blue Genes for your mine.

We are actively seeking feedback on Blue Genes, it’s still very much in the beta phase still, so please get in touch once you have some opinions!

Special Thanks

Thanks to everyone who helped test this release! Thanks Howie Motenko at MGI for your alpha testing and model insights. And a BIG thank you goes to Sam Hokin from the NCGR who spent a lot of time and effort helping improve InterMine! Thanks Sam and Howie! You are much appreciated.

Helpful Links

What exactly we changed (blog post)

Full list of GitHub tickets included this release

Docs on how to upgrade to 2.0

 

As always, please contact us if you have any questions or comments! We have an active twitter account, a discord server at chat.intermine.org, and a low traffic mailing list.

Google Summer of Code 2018 - Wrapping up the final results for the Data Browser

 

The GSoC 2018 is coming to its end, and after 3 months of hard work, I can proudly present a summary of all our achievements during this summer of code.

 

Summary of Project Goals

InterMine is a open source Data Warehouse intended to be used for the integration and analysis of complex biological data. With InterMine, you can explore organism and other research data provided by different organizations, moving between databases using criteria such as homology.

The existing query builder in InterMine requires some experience to obtain the desired data in a mine, which can become overwhelming for new users. For instance, for a user interested on searching data in HumanMine using its query builder, he or she would need to browse through the different classes and attributes, choosing between the available fields and adding the different constraints over each of them, in order to get the desired output.

For example, a simple query you might want to glean from InterMine might be as follows:

Query: Given the human gene symbol “GATA1”, show all homologous genes in other organisms. (More on Homologues, also spelled homologs: https://en.wikipedia.org/wiki/Homology_(biology)

Problem: This query sounds simple-ish, but building it in our query builder requires a strong familiarity with the data model, and can be confusing for anyone new to InterMine. We would like the data browser to be more complex than the simplicity of a simple keyword search, but less complex than the current query builder. For context, here’s the humanmine query builder: http://www.humanmine.org/humanmine/customQuery.do. We have attached a screenshot of what it looks like for the homology query mentioned above, where it can be seen why it looks a little intimidating.

This requires the user to have a decent knowledge of the model schema in order to successfully build a correct query for the expected query results. For new users this workflow can become, indeed, overwhelming to search for specific information in the data.

For this reason the goal of this project is to implement a faceted search tool to display the data from InterMine database, allowing the users to search easily within the different mines available around InterMine, without the requirement of having an extensive knowledge of the data model.

 

Summary of Project Achievements

In order to maintain a good workflow, the project was divided into three major versions or milestones, coinciding the deadline of each one with GSoC evaluation phases. The main developments in each milestone are listed below, and comprises a total of 67 closed issues with 194 commits.

Milestone 1 (June 11). In the first milestone (related GitHub issue here), the following features were added:

  1. Initial environment setup (#1)
  2. Server routes to query for ‘counts’ of data (#2)
  3. Use InterMine im-tables for dynamically loading the data on the views (#3)
  4. Unit testing for the server-side routes (#4)
  5. Functional general statistics of the data with graphics and plots (#6)
  6. Searching data on HumanMine (basic ontology concepts) + Unit testing (#8, #9)
  7. Further User Interface improvements (#22, #23, #24, #25, #26, #27, #28)
  8. Code documentation and optimization (#7)

Milestone 2 (July 9). In the second milestone (related GitHub issue here), the following features were added:

  1. Query builder with automatically filled selectable fields + Unit testing (#11)
  2. Filtering options when searching for data + Unit testing (#13)
  3. Improvements to adapt to small devices (#29)
  4. Handling SSL errors (#30, #31)
  5. User interface improvements (#32, #34, #35, #36, #38)
  6. Show counts beside typeahead filters (#39)
  7. Allow users to add and remove filters (only one) (#40)
  8. Making the Dataset filter to be multiple checkbox (#42)
  9. Adding formal documentation using documentation.js (#43)
  10. Interactions filter (#44)
  11. Chromosome Locations filter
  12. Code documentation and optimization

Milestone 3 (August 6). In the third milestone (related GitHub issue here), the following features were added:

  1. Finish main dashboard interface (#5)
  2. Use InterMine color palette (#10)
  3. Enable save as list functionality (#16, #15)
  4. Add a ‘switch’ functionality to change between mines (#21)
  5. Option to show multiple filters for the same filter type at once (#41)
  6. ClinVar filter (#52)
  7. OMIM Disease filter (#53)
  8. Expression: Illumina bodymap filter (#55)
  9. Protein localisation: Protein Atlas filter (#56)
  10. JSON config file for mines to handle extra filters (#57)
  11. Show only default filters for mines not defined in JSON config and available in the registry (#58)
  12. Add running instructions (#61)
  13. Deep link to specific mine (#62)
  14. Remember which mine you were looking at last time (#63)
  15. User interface improvements (#66, #68)
  16. Handle an InterMine being down (#67)

 

Brief Overview of the Final Product

In the following figure, the different elements available on the browser interface are displayed and further explained.

As depicted above, the user is able to change between the different mines available in the InterMine registry by using the dropdown box in (1). Next, in (2) the viewed class can be changed, and currently it can be either Genes or Proteins. Moreover, in (3), the different filters available in the currently explored mine are displayed, where the user can filter the data shown in the table at (6) that better fulfills his/her requirements. Furthermore, some plots regarding to the data in the table are displayed on (4). Currently it shows a pie chart of individuals per different organism, but it will be extended with more plots in the future. Finally, the user has the option to save the table as list, generate the code to embed it elsewhere, or to export the results by using the options in (5).

 

More Screenshots of the Final Product

 

 

Related Blog Posts (in chronological order)

 

Links to Final Tool Code and Deployments

 

Future Work and Plans

There are already some features to be added to the browser after GSoC, some of them are, for instance, to allow users to add their personal InterMine’s API tokens for each mine and use them for the Save as List functionality of the table (link). Another useful feature that I wasn’t able to implement due to a temporary disabling in the InterMine ontology was a Phenotypes filter (link). Next, a new histogram plot to the top section about Gene length will be added (link). Furthermore, a “current class” filter will be added to the sidebar (link). Finally, another desirable feature would be to refactor the per-mine filters to use path query (link).

 

As a conclusion, the fact that the final product has been tested and is going to be truly helpful for the target community of users, is enough for me to be proud of the developed tool during this Google Summer of Code. Also the results of this project will allow us to, hopefully, publish a paper describing the new InterMine browser.

FlyMine 46.0 Released!

FlyMine has been updated to the latest version of FlyBase. All other data sets have also been updated to the newest versions and we have fixed a few bugs. See the data sources page for a full list of data and their versions. All data can be accessed through our comprehensive library of template searches or by building your own queries using the query builder.

 

Data model changes!

Our data model has changed slightly to make querying easier between mines.

  • Protein molecular weight is a float instead of an integer
  • We’ve added URLs for GO term evidence codes
  • Sequence Ontology (the basis for the InterMine data model) is updated, so lots of new data types added.

See our previous blog post for complete details of updates.

No more Anopheles

After listening to community feedback, we have decided to stop loading Anopheles data into FlyMine. However, as always, if there is a specific data set you are interested in, please contact us!

 

We have docs and videos, and for a full list of data sources available in FlyMine see the data sources list.

However, please do not hesitate to contact us should you require any further assistance. For all types of help and feedback email info@intermine.org.

HumanMine 5.0 released!

HumanMine has been updated to the latest version of NCBI Entrez Gene. All other data sets have also been updated to the newest versions and we have fixed a few bugs. See the data sources page for a full list of data and their versions. All data can be accessed through our comprehensive library of template searches or by building your own queries using the query builder.

 

New Data Source: GTEx

We added a new expression data set, GTEx. Here’s an example search:

Gene –> Tissue Expression

Tissue –> Gene expression

Data model changes!

Our data model has changed slightly to make querying easier between mines.

  • Protein molecular weight is a float instead of an integer
  • We’ve added URLs for GO term evidence codes
  • Sequence Ontology (the basis for the InterMine data model) is updated, so lots of new data types added.

See our previous blog post for complete details of updates.

 

We have docs and videos, and for a full list of data sources available in HumanMine see the data sources list.

However, please do not hesitate to contact us should you require any further assistance. For all types of help and feedback email info@intermine.org.

Coming up soon: InterMine 2.0 release webinar, community calls, and GSoC presentations

What’s coming up soon in InterMineLand? Here are a few of the highlights:

Upgrading to 2.0 – Thursday 2nd August

With the release of InterMine 2.0 RC1, we’ll be dedicating the InterMine Developer call to an InterMine 2.0 Upgrade Webinar, spending around 20 minutes discussing how one upgrades an InterMine 1.x installation to use the newer (and much more easygoing) Gradle dependency management system. Q&A afterwards so you can learn everything you’ve been burning to know. [Call in information]

This call will be recorded so anyone who couldn’t make it can catch up.

GSoC Student project presentations – Thursday 16th August

Six students, six awesome projects. Our students have been blogging prolifically while working over the last three months, and they’ll be presenting their work on the developer call, with five minutes slots per student + time for Q&A afterwards. [Agenda here]

This call will be recorded so anyone who couldn’t make it can catch up.

Community Outreach Call – 6th September 

Once a quarter we host non-techie calls where we focus on interesting things the community has been doing as well as community engagement in general. This time we’ll be featuring Kevin Macpherson, who runs some fantastic community outreach at SGD, including amazing webinar video use-cases.  [Agenda, still work in progress]

Previous featured speakers include Jacqueline Campbell talk about her approach to community engagement, Wayne Decateur demonstrating InterMine code in Jupyter notebooks, and Abby Cabunoc Mayes, Mozilla’s Working Open Practice Lead.

We’re still looking for speakers for this call and the next one, in December – If you have a topic you’d like to share about InterMine, open science/source, or bioinformatics in general, ping yo@intermine.org to pitch the idea.

InterMine at #GCCBOSC Portland – 7 days of fun, sun, and code…

BOSC (the Bioinformatics Open Source Conference) is normally part of ISMB (Intelligent Systems for Molecular Biology), but for the first time this year, it teamed up with The Galaxy Community Conference (GCC) instead. For us, this presented an exciting opportunity – like a regular BOSC but with the added bonus of training days and the chance to interact with Galaxy contributors during the CollaborationFest hackathon (and the rest of the conference too).

Our agenda at the conference ended up being quite full:

Handling integrated biological data using Python (or R) and InterMine

We delivered a training session on the 26th of June: Handling integrated biological data using Python (or R) and InterMine. Leyla Ruzicka from ZFIN was kind enough to travel up from Eugene to Portland, to help us deliver the UI portion of the training. Once we’d familiarised users with how InterMine worked a little bit, Daniela introduced the API side of things, and then we spent the remainder of the session working through a series of exercises in Jupyter notebooks, live-coding on a projector so others could learn about our code and follow along themselves.

While we did recommend to people that they try to install the InterMine Python client, we also managed to work around the issue for anyone who didn’t have things installed, thanks to binder. You can still see the tutorial exercise notebooks and work through them, and we have the same set of notebooks with answers if you get stuck or need a hint. This was the first time we worked through the exercises interactively onscreen this way, but it seemed to work well! I’m hopeful we can continue providing the API portion of our tutorial this way in the future.

We had planned to do an R section, but actually ran out of time to do this – the tutorial was about two and a half hours in total. If an R tutorial is something of interest in the future though, please do let us know! You can do this via comments on this article, twitter, pop by chat.intermine.org, or email us at info – at – intermine – dot – org.

InterMine 2.0: More than fifteen years of open biological data integration

[Slides link] We were very pleased to have a talk accepted as well as the training, giving us a chance to introduce InterMine to others and talk about its history. While I was talking I mentioned that we were ranking at just under 300 stars on our main GitHub repo, and the audience kindly help bump it up and over 300!

intermine-stars

One of the topics I focused on during the talk included a massive thanks to all of the work our broader community does to help keep InterMine become and remain a great resource. Afterwards, Lorena Pantano raised the question: how do you get others to adopt your work and contribute to it?

Personally, I’ve been working at InterMine for three years now, so I certainly can’t attest to the entirely of the history – much of this is doubtless down to the team’s great work and Gos’s great vision (and grant writing!) – but I also think one of the most important parts is probably down to making it easy for others to use your work: good developer docs, tickets that explain issues clearly, help documentation for end-users, etc. I’d love to hear more thoughts about this in the comments!

Birds of a Feather sessions

Daniela and Yo both ran separate Birds of a Feather unconference-style sessions over lunch. Yo’s BoF focused on getting (and keeping) more open source contributors – Nicole Vasilevsky was kind enough to keep notes for this session. Thanks, Nicole!

Meanwhile Daniela shared  the InterMine approach to implement stable and persistent URIs and the possible related issues, inspired by other data integrators and the lessons learnt in the Identifiers for the 21st century paper; some attendees have also contributed providing their own solutions.

Hackathon

42394043775_eeb59807ee_o
Group meeting session at CoFest. Try to spot Daniela! 😉

During the CollaborationFest hackathon, Daniela and Yo were able to complete (yeahhhh!!) the integration between Galaxy and InterMine thanks to invaluable help of Daniel Blankenberg.
On the next Galaxy release, the new InterMine plugin will be available and will allow to import data (from InterMine) into Galaxy and export lists of identifiers (e.g. proteins, genes) from Galaxy (into InterMine) by selecting the mine instance from the InterMine registry. Watch this space – we’ll hopefully arrange to get some details on the Galaxy training network to explain how to run the data imports in each direction.

All GCCBOSC photographs in this post are from Berenice Batut’s Flickr album, under a CC-BY-SA licence