Toxygates: exposing toxicogenomics datasets and linking with InterMine

This is a guest post from our colleague Johan Nyström-Persson, who works with ToxyGates and the NIBIOHN in Japan.

Toxygates (http://toxygates.nibiohn.go.jp) has been developed as a user-friendly toxicogenomics analysis platform at the Mizuguchi Lab, National Institutes of Biomedical Innovation, Health and Nutrition (NIBIOHN) in Osaka since 2012. The first public release was in 2013. At this time, the main focus of Toxygates was exposing the Open TG-GATEs dataset, a large, systematically organised toxicogenomics dataset compiled during more than a decade by the Japanese Toxicogenomics Project (http://toxico.nibiohn.go.jp). This dataset consists of over 24,000 microarray samples. To make use of such a large dataset without time-consuming data manipulation and programming, it is necessary to have a rich user interface and access to many kinds of secondary data.

Toxygates allows anyone with a web browser to explore and analyse this data in context. Various kinds of filtering and statistical testing are available, allowing users to discover and refine gene sets of interest, with respect to particular compounds. For a reasonably sized data selection, hierarchical clustering and heat-maps can be displayed directly in the browser. Through TargetMine (http://targetmine.nibiohn.go.jp) integration (based on the InterMine framework), enrichment of various kinds is possible. Compounds can also be ranked according to how they influence genes of interest.

To support all of these functions, we came up with the concept of a “hybrid” data model which recognises that, while gene expression values by themselves may be viewed as a large matrix with a flat structure, secondary annotations of genes and samples, such as
proteins, pathways, GO terms or pathological findings, have an open-ended structure. Thus, we combine an efficient key-value store (for gene expressions) with RDF and linked data (for gene and sample annotations) to allow for both high performance and a flexible data structure.

Today, the project continues to evolve in new directions as a general transcriptomics data analysis platform. We have integrated Toxygates not only with TargetMine, but also with HumanMine, RatMine and MouseMine. Recently, users can also upload their own transcriptomics data and analyse it in context alongside Open TG-GATEs data. We may
also add more datasets in the future.

P1000874The current project members are Kenji Mizuguchi (project leader) and Chen Yi-An (NIBIOHN), Johan Nyström-Persson and Yuji Kosugi (Level Five), and Yayoi Natsume-Kitatani and Yoshinobu Igarashi (NIBIOHN).

Advertisements

InterMine 2.0 – Gradle

NB: To upgrade to InterMine 2.0 you must not have custom code in the core InterMine repository.

We have been planning out the tasks for future InterMine, and there is a lot of exciting projects on the horizon. We’re making InterMine more FAIR, putting InterMine in Docker and the cloud, our beautiful new user interface, Semantic Web and so on.

However a prerequisite for these exciting features is to update our build system. We are still using ant and it’s grown, let’s say, “organically” over the years — making updates and maintenance expensive and tedious.

After careful consideration and looking very seriously at other build and dependency management systems we’ve decided on Gradle. Gradle is hugely popular with a great community, and it’s used by such projects as Android, Spring and Hibernate. We were really impressed with Gradle’s power and flexibility, being able to run scripts in Gradle will give us the power we need to accomplish all our lofty goals.

Our goals for moving to Gradle

Managed dependencies

Our dependencies currently are manually managed — meaning if we need a third party library, we copy the JAR manually into our /lib directory. This is unsupportable for modern software and has resulted in lots of duplication and general heartache. With Gradle we can instead fetch dependencies automatically from online repositories.

A smaller repository

Implementing Gradle will allow us to replace many of our custom Ant-based facilities with Gradle infrastructure and widely-supported plugins. Our codebase will become smaller and more maintainable as a result.

A faster build

Currently, due to the way that InterMine implemented a custom project dependency system in Ant,  every InterMine JAR is compiled on every build and every time a webapp is deployed. This is unnecessary and wastes developer time. We will use Gradle’s sophisticated dependency management system to make the InterMine build more robust and efficient.

Maintainable, extensible, documented

The current Ant-based InterMine build system has been extended over the years as needed in an ad-hoc manner, and unfortunately no documentation exists. Adding a new ant task is a challenge, and debugging the current build process is time consuming and difficult. Moving to Gradle will base InterMine on a well maintained, extensible, documented and widely-used build system.

Simpler to run test suite

Currently, developers have to create property files and databases to run the full system tests, steps that are not straightforward to perform or execute. With Gradle’s help we hope to make this much easier, so that the wider InterMine community can benefit from running the InterMine test suite on their installations and code patches.

Simplicity

Finally, Gradle’s tests are in the same project as the main directory, thus cutting the number of separate projects will be cut in half. In addition, when building, the tests will be run automatically.

As an example, here is a new standard Gradle directory layout:

src/main/java
src/main/resources
src/test/java
src/test/resources

Currently our main and test projects are in different packages but in InterMine 2.0 these will be unified under single projects, as per standard practice.

What does this mean for you and your InterMine?

If you are currently maintaining an InterMine, moving to InterMine 2.0 is going to require a bit of effort on your part.

Operationally, commands such as database building and web application publishing are very likely to use Gradle commands rather than Ant targets or custom scripts. Users who have scripts to manage InterMine installations will need to adjust them appropriately. This shouldn’t require too much work.

InterMine users who have custom projects in the bio/sources directory to load data sources will need to make more adjustments. Project structures in InterMine 2.0 will not be the same as in earlier versions, since they will follow Gradle conventions rather than custom InterMine ones. However, the changes will not be major and we will provide a script to do as much automatic updating of custom sources as possible.

The greatest migration work will come for the most sophisticated operators who have directly patched core InterMine code. In this case, there are two options. Firstly, they can continue to patch and build core InterMine JARs themselves, though they will need to make adjustments for the Gradle build process. Secondly, we can work with them to add new configuration parameters to core InterMine to make such patching unnecessary, wherever possible. In both cases work will be required but the effort should not be large, since it is largely the structure of code that is changing rather than core logic or functionality.

 

This is a significant transition but one that should put InterMine on a solid base that lowers long-term maintenance costs and makes lots of exciting stuff possible in the future. As ever, please contact us if you have any concerns and we look forward to discussing this and any other subjects on community calls, blog comments, in our Discord chat and on the mailing list!

Where to find InterMiners: September-December 2017 edition

We’re busy as ever, and Gos is away at the #biohack2017 in Japan right now – you can spot him in a gold shirt sitting towards the back of the room here:

Other places to find InterMiners over the next few months include:

September:

12 September: FAIR in practice focus group – Research support professionals. Daniela will be at the British Library participating in this consultation. You may also be interested in the researchers focus group on the 13th. It looks like tickets are still available! (More)

21 September: **Cancelled**The usual community call is cancelled this week. We’ll be back as normal with updates in October, though!

25-27 September: Justin and Yo will be attending the Cambridge Bioinformatics Hackathon.

October:

2-3 October: You’ll be able to find Justin at the Bioschemas Elixir implementation meeting in Hinxton.

5 October: InterMine dev community call – back to our normally scheduled calls. Agenda.

13-15 October: Find Yo at the 2017 GSoC mentors summit in Sunnyvale, California

19 October: It’s another community developer call, yay! 

21-25 October: Justin will be representing us at ISWC in Vienna.

25 October: Better Science through Better Data in London – we’ll be sharing the story of InterMine in a lightning talk. Open data is awesome and InterMine couldn’t exist without it!

27 October: We’ll be delivering an InterMine training course in Cambridge, including an all-new API training section. Please spread the word about this one!

November:

November 1-2: You’ll be able to spot Justin at the Elixir UK all hands in Edinburgh.

December:

December 4-7: Get your Semantic Web on with Daniela at SWAT4LS in Rome!

Phew, that’s a lot!

 

 

 

InterMine 2.0: Proposed Model Changes (III)

We have several new additions and changes to the InterMine core data model coming in InterMine 2.0 (due Fall 2017).

We had a great discussion on Thursday about the proposed changes. Below are the decisions we made.

Multiple Genome Versions, Varieties / Subspecies / Strains

 

We were not able to come to an agreement, but everyone still felt there might be a core data model that can allow for single and multiple genomes that will be useful for all InterMines.

The fundamental question is do we want organism to be per genome version, or allow organism to have several genome versions. In the latter case, we’d also need a helper class, e.g. “Strain”, that would contain information about the genome.

This topic is sufficiently complex that we’ve agreed to try a more formal process here, listing our different options, their potential impact etc. More information on this process soon!

Syntenic Regions

Proposed addition to the InterMine core data model

<class name="SyntenicRegion" extends="SequenceFeature" is-interface="true">
    <reference name="syntenyBlock" referenced-type="SyntenyBlock" reverse-reference="syntenicRegions"/>
  </class>
  
  <class name="SyntenyBlock" is-interface="true">    
   <collection name="syntenicRegions" referenced-type="SyntenicRegion" reverse-reference="syntenyBlock" />
   <collection name="dataSets" referenced-type="DataSet" />
   <collection name="publications" referenced-type="Publication" />
  </class>
  • We decided against making a SyntenyBlock a bio-entity, even though it would benefit from inheriting some references.
  • We also decided against the SyntenicRegion1 / SyntenicRegion1 format and instead they will be in a collection of regions.

GO Evidence Codes

Currently the GO evidence codes are only a controlled vocabulary and are limited to the code abreviation, e.g IEA. However UniProt and other data sources have started to use ECO ontology terms to represent the GO evidence codes instead.

We decided against changing the GO Evidence Code to be an ECO ontology term.

  • The ECO ontology is not comprehensive
  • Some mines have a specific data model for evidence terms

Instead we are going to add attributes to the GO Evidence Code:

  • Add a link to more information on the GO evidence codes
  • Add the full name of the evidence code.
  • Change GOEvidenceCode to be OntologyAnnotationEvidenceCode

We decided against loading a full description of the evidence code. The description on the GO site is a full page. We tried shortening but then it didn’t really add much information. Also there is no text file with the description available.

We are also going to move evidence to Ontology Annotation.

GOEvidenceCode will be renamed OntologyAnnotationEvidenceCode:

<class name="OntologyAnnotationEvidenceCode" is-interface="true">
 <attribute name="code" type="java.lang.String" />
 <attribute name="name" type="java.lang.String" />
 <attribute name="URL" type="java.lang.String" />
</class>

GOEvidence will be renamed OntologyEvidence:

<class name="OntologyEvidence" is-interface="true">
 <reference name="code" referenced-type="OntologyAnnotationEvidenceCode"/>
 <collection name="publications" referenced-type="Publication"/>
</class>

Evidence will move to OntologyAnnotation from GOAnnotation:

<class name="OntologyAnnotation" is-interface="true">
 <collection name="evidence" referenced-type="OntologyEvidence"/>
</class>

 

Ontology Annotations – Subject

Currently you can only reference BioEntities, e.g. Proteins and Genes, in an annotation. This is unsuitable as any object in InterMine can be annotated, e.g. Protein Domains. To solve this problem, we will add a new data type, Annotatable.

<class name="Annotatable" is-interface="true"> <collection name="ontologyAnnotations" referenced-type="OntologyAnnotation" reverse-reference="subject"/> </class> <class name="OntologyAnnotation" is-interface="true"> <reference name="subject" referenced-type="Annotatable" reverse-reference="ontologyAnnotations"/> </class> <class name="BioEntity" is-interface="true" extends="Annotatable"/>

This will add complexity to the data model but this would be hidden from casual users with templates.

Protein molecular weight

Protein.molecularWeight is going to be changed from an integer to a float.

Timeline

October

  • Julie makes changes to core InterMine data model and parsers
  • On ‘model-changes’ branch

November

  • Release beta FlyMine with new model changes for community review
    • Sam will help test Synteny changes
  • Finalise changes. Move changes from ‘model-changes’ branch to ‘release-candidate’ branch
  • InterMine 2.0 will be tested on a staging branch (‘release-candidate’) because the changes are so disruptive:
    • New software build system – Gradle
    • Require updated software dependencies, e.g. Java 8, Tomcat 8, Postgres 9.x
    • Model changes

December

  • “Code freeze”
    • All 2.0 changes tested on ‘release-candidate’ branch
    • Need help testing!
  • InterMine 2.0 release
    • Move changes from dev branch to master branch
    • Before Xmas

If you would like to be involved in these discussions, please do join our community calls or add your comments to the GitHub tickets. We want to hear from you!

InterMineR package

InterMine data can be accessed via command line programs like cURL and client libraries for five programming languages (Java, JavaScript, Perl, Python and Ruby.) Aiming to expand the functionality of InterMine framework, an R package, InterMineR, had been started that provided basic access to InterMine instances through the R programming environment. (You could run template queries, but not much else!)

However, in order to fully utilize the statistical and graphical capabilities of the R language and make the InterMine framework available to an even greater number of life scientists, the goals were set to:

  1. Further develop and publish the InterMineR package to Bioconductor, a widely used, open source software project based in R, which aims to facilitate the integrative analysis of biological data derived from high-throughput assays.
  2. Add visualisation capabilities, e.g. “What features are close to my feature of interest?”
  3. Add enrichment analysis in InterMineR, a feature that will provide R users with access to the InterMine enrichment analysis widgets and can be effectively combined with the graphical capabilities of R libraries.

InterMineR performs a call to the InterMine Registry to retrieve up-to-date information about the available Mines. The information retrieved are then used to connect the Mines with the R environment using the InterMine web services.

Queries

The InterMineR package can be used to perform complicated queries on a Mine. The process is facilitated by the retrieval of the data model and the ready-to-use template queries of the respective Mine. The R functions setConstraints and setQuery have been created along with the formal class InterMineR, to create new or modify existing queries, store them as Intermine-class objects and apply them to the Mine with the runQuery method.

Genomic Coordinates

r_gviz

Figure 1: Gene visualisation done via InterMineR AND GVIZ

InterMineR can retrieve genomic coordinates and gene expression analysis data which can be converted to:

with the R functions convertToGRanges and convertToRangedSummarizedExperiment respectively. This way an interaction layer between InterMineR and other Bioconductor packages (e.g. GenomicRanges and SummarizedExperiment) is established, allowing for rapid analysis of the retrieved InterMine data.

Enrichment + GeneAnswers

InterMineR also retrieves InterMine enrichment widgets and facilitates the enrichment analysis on an InterMine instance using the R functions getWidgets and doEnrichment, respectively. With the usage of the R function convertToGeneAnswers the results of the enrichment analysis are converted to a GeneAnswers-class object, therefore allowing the visualization of:

  • Pie charts
  • Bar plots
  • Concept-gene networks
  • Annotation category (e.g. GO terms, KEGG pathways) – interaction networks
  • Gene interaction networks

by using R functions from the GeneAnswers R package.

geneanswers_go_structure_network

Figure 2: GeneAnswers GO structure network, generated via InterMineR

geneanswers_concept_gene_network_colors

Figure 3: GeneAnswers gene network generated using InterMineR

Final steps: Bioconductor & Vignettes

The updated InterMineR package complies to the instructions for submitting new packages to Bioconductor, has passed all automated checks (R CMD build, check and BiocCheck) and is currently under the process of manual review for Bioconductor submission.

Documentation of each function along with examples of its usage are available in the GitHub repo and as help files upon the installation of the package. Furthermore, a detailed vignette and tutorials concerning the new functionality of InterMineR package are currently available at the intermine/InterMineR/vignettes folder of the GitHub dev branch, and will be shortly available on the GitHub master branch as well.

This project is part of Google Summer of Code, still under development by me, Konstantinos Kyritsis, PhD student at the Aristotle University of Thessaloniki, under the mentoring of Julie Sullivan and Rachel Lyne. The GitHub repository of the InterMineR package can be found at https://github.com/intermine/InterMineR.

Commits made my Konstantinos can be found here: https://github.com/intermine/InterMineR/commits/master?author=kostaskyritsis

Blog: InterMine Cloud + ISATools: coming to a cloud near you

As many InterMiners may remember, due to some unlucky timing, we had a grant deadline that occurred during the InterMine Developer Workshop earlier this year, with seemingly countless group-work sessions like the one pictured here:

Thankfully, it looks like all our caffeine-fuelled hard work paid off the way we hoped: we are extremely excited to announce that the Wellcome Trust awarded us the grant! Here are a few highlights of what we plan to work on when the project starts in April next year:

Collaboration and metadata

This grant was written with Susanna-Assunta Sansone (Oxford e-Research Centre, University Of Oxford) to support a collaboration with the ISA-Tools group. ISA provides tools to structure metadata, covering the Investigation, Study and Assay of experiments; we will integrate the ISA format into InterMine and the ISA team will develop web-based tools to make metadata creation easier than ever.

Make your own InterMine with less sweat and toil

InterMines can be a challenge to set up right now unless you’re a developer. We’d like to do better. For simpler data formats, we’re hoping to create a UI-based wizard that allows you to drag and drop your files, select a few settings, and start a build – no need to touch a text editor. The more advanced / custom data formats will probably still require hands-on developer time.

InterMine in the cloud

If setting up your own InterMine becomes pleasantly easy, why not do it online? It’d be awesome if you could go to a website, click the “New InterMine” button, upload a few files (or paste their urls!), and end up with my-new-experiment.some-cloud.org for all your lab’s InterMine needs. Data will be merged with external supplementary data sources, and you’ll be able to analyse your data using visualisations (we’ll dedicate a chunk of time specifically to adding more datavis), our famous results tables, and familiar tools like gene set enrichment.

Enhanced import and export

We’d like to smooth out your way to sharing data with the community. Maybe you want to import data from Galaxy, or perhaps you want to export your InterMine as a virtual machine to be powered up and re-examined at a later date. Maybe you’d prefer to export data for publication as an ISA archive, bundled neatly with metadata, or maybe even generate the scaffold of a data paper automatically from your datasets. These are all things we’ll be working on with Susanna’s group.

Tell us what you think

As ever we are keen for input from the community and as we gear up for the next phase of development, now would be a particularly good time to hear from you! Tweet us, leave a comment, email the developer list, or pop by for a chat.

Google Analytics in BlueGenes: what should we track?

TL;DR: We’re implementing analytics tracking in BlueGenes. We can probably track anything you like, within reason. Leave a comment [comments now closed] or email us if you have anything you’d like to see! Must adhere our privacy policy.

Longer version:

InterMine’s JSP pages (the current, older UI) are set up with a couple of different types of tracking:

  1. Google Analytics, which currently anonymously records things like:
    1. Number of users and their locations
    2. Pages viewed
    3. With a bit of effort you can figure out what items were searched for by analysing query strings.
  2. InterMine home-brew internal analytics (to view in your own mine, log in as the super user and select the “usage” tab.) It tracks:
    1. Logins (anonymously)
    2. Keyword search terms
    3. Popular templates
    4. Count of custom queries executed
    5. List views by InterMine object type (but not list contents)
    6. Count of lists created, by type

So we have a couple of questions we’d love some feedback on, as we implement Google Analytics in BlueGenes:

  1. Do you use the current analytics? Which, or both?
  2. What would you *like* to record? Here’s a list of ideas

Things that are probably okay to track

  • Pageviews including counts and times – e.g. “17 views for /region-search on Monday the 13th at 10:pm”
  • Logins (anonymously)
  • Visitor location
  • Tools used (e.g. report page tools interacted with)
  • Popular templates
  • Mine used / switched to a different mine

Things we’re not sure about – what do you think?

  • Keyword search contents (anonymously). Pros: interesting analyses like this one. Cons: Could someone avoid InterMine out of fear someone would notice their gene is getting too much attention?
  • List contents (anon, as above).
  • What about mistyped identifier names in list upload?
  • Region search
  • Queries built in the query builder

I’m sure I’ve missed off quite a few things from both lists. We’d love to hear your input and feelings, both with regards to privacy and with ideas about useful trackable events and pages. Tweet us, comment on the web services tracking  github issue, email the dev group, or contact us some other way: http://intermine.readthedocs.io/en/latest/about/contact-us/