Google Summer of Code 2018 - Wrapping up the final results for the Data Browser

 

The GSoC 2018 is coming to its end, and after 3 months of hard work, I can proudly present a summary of all our achievements during this summer of code.

 

Summary of Project Goals

InterMine is a open source Data Warehouse intended to be used for the integration and analysis of complex biological data. With InterMine, you can explore organism and other research data provided by different organizations, moving between databases using criteria such as homology.

The existing query builder in InterMine requires some experience to obtain the desired data in a mine, which can become overwhelming for new users. For instance, for a user interested on searching data in HumanMine using its query builder, he or she would need to browse through the different classes and attributes, choosing between the available fields and adding the different constraints over each of them, in order to get the desired output.

For example, a simple query you might want to glean from InterMine might be as follows:

Query: Given the human gene symbol “GATA1”, show all homologous genes in other organisms. (More on Homologues, also spelled homologs: https://en.wikipedia.org/wiki/Homology_(biology)

Problem: This query sounds simple-ish, but building it in our query builder requires a strong familiarity with the data model, and can be confusing for anyone new to InterMine. We would like the data browser to be more complex than the simplicity of a simple keyword search, but less complex than the current query builder. For context, here’s the humanmine query builder: http://www.humanmine.org/humanmine/customQuery.do. We have attached a screenshot of what it looks like for the homology query mentioned above, where it can be seen why it looks a little intimidating.

This requires the user to have a decent knowledge of the model schema in order to successfully build a correct query for the expected query results. For new users this workflow can become, indeed, overwhelming to search for specific information in the data.

For this reason the goal of this project is to implement a faceted search tool to display the data from InterMine database, allowing the users to search easily within the different mines available around InterMine, without the requirement of having an extensive knowledge of the data model.

 

Summary of Project Achievements

In order to maintain a good workflow, the project was divided into three major versions or milestones, coinciding the deadline of each one with GSoC evaluation phases. The main developments in each milestone are listed below, and comprises a total of 67 closed issues with 194 commits.

Milestone 1 (June 11). In the first milestone (related GitHub issue here), the following features were added:

  1. Initial environment setup (#1)
  2. Server routes to query for ‘counts’ of data (#2)
  3. Use InterMine im-tables for dynamically loading the data on the views (#3)
  4. Unit testing for the server-side routes (#4)
  5. Functional general statistics of the data with graphics and plots (#6)
  6. Searching data on HumanMine (basic ontology concepts) + Unit testing (#8, #9)
  7. Further User Interface improvements (#22, #23, #24, #25, #26, #27, #28)
  8. Code documentation and optimization (#7)

Milestone 2 (July 9). In the second milestone (related GitHub issue here), the following features were added:

  1. Query builder with automatically filled selectable fields + Unit testing (#11)
  2. Filtering options when searching for data + Unit testing (#13)
  3. Improvements to adapt to small devices (#29)
  4. Handling SSL errors (#30, #31)
  5. User interface improvements (#32, #34, #35, #36, #38)
  6. Show counts beside typeahead filters (#39)
  7. Allow users to add and remove filters (only one) (#40)
  8. Making the Dataset filter to be multiple checkbox (#42)
  9. Adding formal documentation using documentation.js (#43)
  10. Interactions filter (#44)
  11. Chromosome Locations filter
  12. Code documentation and optimization

Milestone 3 (August 6). In the third milestone (related GitHub issue here), the following features were added:

  1. Finish main dashboard interface (#5)
  2. Use InterMine color palette (#10)
  3. Enable save as list functionality (#16, #15)
  4. Add a ‘switch’ functionality to change between mines (#21)
  5. Option to show multiple filters for the same filter type at once (#41)
  6. ClinVar filter (#52)
  7. OMIM Disease filter (#53)
  8. Expression: Illumina bodymap filter (#55)
  9. Protein localisation: Protein Atlas filter (#56)
  10. JSON config file for mines to handle extra filters (#57)
  11. Show only default filters for mines not defined in JSON config and available in the registry (#58)
  12. Add running instructions (#61)
  13. Deep link to specific mine (#62)
  14. Remember which mine you were looking at last time (#63)
  15. User interface improvements (#66, #68)
  16. Handle an InterMine being down (#67)

 

Brief Overview of the Final Product

In the following figure, the different elements available on the browser interface are displayed and further explained.

As depicted above, the user is able to change between the different mines available in the InterMine registry by using the dropdown box in (1). Next, in (2) the viewed class can be changed, and currently it can be either Genes or Proteins. Moreover, in (3), the different filters available in the currently explored mine are displayed, where the user can filter the data shown in the table at (6) that better fulfills his/her requirements. Furthermore, some plots regarding to the data in the table are displayed on (4). Currently it shows a pie chart of individuals per different organism, but it will be extended with more plots in the future. Finally, the user has the option to save the table as list, generate the code to embed it elsewhere, or to export the results by using the options in (5).

 

More Screenshots of the Final Product

 

 

Related Blog Posts (in chronological order)

 

Links to Final Tool Code and Deployments

 

Future Work and Plans

There are already some features to be added to the browser after GSoC, some of them are, for instance, to allow users to add their personal InterMine’s API tokens for each mine and use them for the Save as List functionality of the table (link). Another useful feature that I wasn’t able to implement due to a temporary disabling in the InterMine ontology was a Phenotypes filter (link). Next, a new histogram plot to the top section about Gene length will be added (link). Furthermore, a “current class” filter will be added to the sidebar (link). Finally, another desirable feature would be to refactor the per-mine filters to use path query (link).

 

As a conclusion, the fact that the final product has been tested and is going to be truly helpful for the target community of users, is enough for me to be proud of the developed tool during this Google Summer of Code. Also the results of this project will allow us to, hopefully, publish a paper describing the new InterMine browser.

Advertisements