GSoC Interview: InterMine Schema Validator with Deepak Kumar

This is our blog series interviewing our 2019 Google Summer of Code students, who working remotely for InterMine for 3 months on on a variety of projects. We’ve interviewed Deepak Kumar, who will be working on the InterMine Schema Validator.

Hi Deepak! We’re really excited to have you on board as part of the team this summer. Can you introduce yourself?

Hi, Thank you for this opportunity, Let me first talk about myself, My name is Deepak Kumar, I live in Ahmedabad, India with my family. I started coding when I was in 17, I had two great teachers in my school days who introduced me to computer programming, and from that time I got interested in this field.

I completed my graduation in Computer Applications from St. Xavier’s College, Ahmedabad and currently I’m doing  Post Graduate Program MSC.IT(Information Technology) at DA-IICT, Gandhinagar, India.

Now talking about my technical details, I love working on challenging projects, I’ve worked on several projects, One of my favourite project that I created while pursuing my bachelors was ‘Smallscript’, It’s a compiled programming language that compiles to bytecode and runs on JVM that makes it platform-independent. It’s my favourite project because It was challenging and when I started with the project I didn’t know any technical detail about compilers, so I had to start from very scratch.

I’ve also worked with a startup company, where I worked as a backend-developer with a team of 8 people and our team was really fantastic, I worked on two projects there, and I really enjoyed it, working with a big team wonderful experience.

I’ve recently started my open source journey with GSoC 2019. Though I’m new to open source, I’ve started contributing to ‘JabRef’ and as I’m selected for GSoC 2019, I’m also going to work with Intermine this summer, and have future plan to contribute to Intermine after completion of GSOC. I also regularly participate in coding contests and hackathon, In one of the AI contest, I built an AI game that ranked 68 among thousands of participants.

Currently, I’m working at OpenXcell Technolabs as an Intern, which is part of my MSC.IT Master’s program. I love reading, travelling, table-tennis and working with new technologies.

What interested you about GSoC with InterMine?

When GSoC 2019 was about to start, I had already bookmarked a few of the previous year organizations I was interested in, and hoping that Intermine will be part of GSoC 2019 too. When the organization list came out, I was super excited to see Intermine in the list. After going through the Intemine’s idea list, I found myself very interested in ‘Intemine Schema Validator Project’, So it was really the Intermine’s project that made me interested in the community.

Tell us about the project you’re planning to do for InterMine this summer.

I’ll be working on a project named ‘Schema Validator’ for Intermine this summer. Well, the project is quite simple to explain, it’s going to be a library that takes a file as input and outputs whether that file is following a particular schema or not. While working on the project my goal from the first day would be to create this project as general as possible, so that the project can be easily extended to support other schemas as well.

Are there any challenges you anticipate for your project? How do you plan to overcome them?

Yes, there are few challenges that I will face while working on this project, One of the biggest challenges which I’m currently trying to solve is about performance. As the purpose of this project is to validate schema files, then the problem is how will I handle larger files that are filled with the content of like 10GB or more. I need to discuss this problem with my mentors that what is their expectation about the performance of the library.

Currently, I’m thinking about the solution to this problem. Maybe I can boost the performance by concurrently running multiple instances of a Schema Validator, Although it doesn’t matter how I implement it If the library is validating a 10GB file that it is definitely going to take a little amount of time.

Then there are also a few challenges regarding the implementation of the schema rules.

GSoC Interview: Ankur Kumar on putting InterMine in the cloud

This is our blog series interviewing our 2019 Google Summer of Code students, who working remotely for InterMine for 3 months on on a variety of projects. We’ve interviewed Ankur Kumar, who will be working on the project “Intermine Cloud: Making Intermine cloud native and easing deployments”.

Hi Ankur! We’re really excited to have you on board as part of the team this summer. Can you introduce yourself?

Namaste everyone! I am a second-year undergraduate student at the Indian Institute of Engineering Science and Technology, Shibpur. I am pursuing a Bachelor’s degree in Mechanical Engineering. To properly introduce myself, honestly, It is always a hard thing to do for me. I do not associate myself with a single identity of a particular subject, a stream of study or profession. I do design bike frames, refrigeration systems and power generation plants. But I also code control algorithms for motors that power those bikes and path planning algorithms that are used by autonomous bikes and robots. I grow plants in controlled environment with help of various sensors and actuators to enhance their yield and study their response to different stresses and also connect those sensors to cloud as iot devices to do data analysis on collected data. I have huge interest in commerce, working of businesses and financial markets. I spend a good amount of my time learning about these things. This list is not exhaustive, But finally, as a mandatory disclaimer, I have not figured out everything yet, about the things that I just mentioned. I hope that one day I will and then I will move on to new projects. So, to put it in a poetic way, I am a curious explorer, who is ready to embark on any journey without even knowing the destination. As long as the journey has a lot of surprises to momentarily satisfy my curiosity. I know what are you thinking after reading this, Why and how you do all this? (Except that I am too ambitious, show off or just insane 😅) Well, I do not have a proper or detailed answer to these questions. I just keep trying to do things and they eventually happen. But, I have a better question for everyone instead of this one. Why not? It is too much fun to live this way. I promise!

What interested you about GSoC with InterMine?

I always wanted to work on a project that is at the intersection of computer science and biology. Both of these fields equally attract me. I had a really hard time choosing between them when I was filling my admission form for senior secondary. I eventually went for biology, if you are wondering. Intermine is a perfect place for me to explore both of these fields. But, this is not the most important thing that makes me choose Intermine. The most important thing is the people at Intermine. Intermine has an awesome and very friendly community. Mentors are very supportive and responsive. I had a great experience discussing the details of my project with mentors. Well, I can confidently say that my mentors are the best. If anyone thinks otherwise, I am ready for a debate!!

Tell us about the project you’re planning to do for InterMine this summer.

My project forms a part of larger efforts of Intermine team that will make Intermine more accessible to its users. More specifically, my project aims to create a service that offers managed intermine instances on the cloud. Also, the work done on my project will be used to create a cli tool that will ease the creation of intermine instances locally, using the same cloud technologies.

Are there any challenges you anticipate for your project? How do you plan to overcome them?

The most important one is time. I have a long list of tasks that needs to be completed. Also, I need to coordinate with two other projects, which can be tricky. To overcome these challenges, I worked hard to come up with a very detailed timeline and design documentation. So, now my plan for the coding period is simple, while tasks remain, pick one task at a time, work hard on it, complete tasks on time and then party hard on weekends.

Share a meme or gif that represents your project

Replacing a lightbulb - Imgur

InterMine 4.0 – InterMine as a FAIR framework

We are excited to publish the latest version of InterMine, version 4.0.

It’s a collection of our efforts to make InterMine more “FAIR“. As an open source data warehouse, InterMine’s raison d’être is to be a framework that enables people to quickly and easily provide public access to their data in a user friendly manner. Therefore InterMine has always strived to make data Findable, Accessible, Interoperable and Reusable and this push is meant to formally apply the FAIR principles to InterMine.

What’s included in this release?

  1. Generate globally unique and stable URLs to identify InterMine data objects in order to provide more findable and accessible data.
  2. Apply suitable ontologies to the core InterMine data model to make the semantic of InterMine data explicit and facilitate data exchange and interoperability
  3. Embed metadata in InterMine web pages to make data more findable
  4. Improve accessibility of data licenses for integrated sources via web interface and REST web-service.

More details below!

How to upgrade?

This is a non-disruptive release, but there are additions to the data model. Therefore, you’ll want to increment your version, then build a new database when upgrading. No other action is required.

However, keep reading for how to take advantages of the new FAIR features in this release.

Unique and stable URLs

We’ve added a beautiful new user-friendly URL.

Example: http://beta.flymine.org/beta/gene:FBgn0000606

Currently this is used only in the “share” button in the report pages and in the web pages markup. In the future, this will be the only URL seen in the browser location bar.

For details on how to configure your mine’s URLs, see the docs here.

See our previous blog posts on unique identifiers.

Decorating the InterMine data model with ontology terms

InterMine 4.0 introduces the ability to annotate your InterMine data model with ontology terms.

While these data are not used (yet), it’s an important feature in that it’s going to facilitate cross-InterMine querying, and eventually cross-database analysis — allowing us to answer questions like “Is the ‘gene’ in MouseMine the same ‘gene’ at the EBI?”.

For details on how to add ontologies to your InterMine data model, see the docs here.

Embedding metadata in InterMine webpages

We’ve added structured data to web pages in format of JSON-LD to make data more findable, and these data are indexed by Google data search. Bioschemas.org is extending Schema.org with life science-specific types, adding required properties and cardinality on the types. For more details see the docs here.

By default this feature is disabled. For details on how to enable embedding metadata in your webpages, see the docs here.

Data licences

In our ongoing effort to make the InterMine system more FAIR, we have started working on improving the accessibility of data licences, retaining licence information supplied by the data sources integrated in InterMine, and making it available to humans via our web application and machines via queries.

See our previous blog post on data licences.

For details on how to add data licences to your InterMine, see the docs.

Future FAIR plans

  1. Provide a RDF representation of data stored, lists and query results, and the bulk download of all InterMine in RDF form, in order to allow the users to import InterMine resources into their local triplestore
  2. Provide an infrastructure for a SPARQL endpoint where the user can perform federated queries over multiple data sets

Upcoming Releases

The next InterMine version will likely be ready in the Fall/Winter and include some user interface updates.

Docs

To update your mine with these new changes, see upgrade instructions. This is a non-disruptive release.

See release notes for detailed information.

InterMine 3.1.2 – patch release

We’ve released a small batch of bug fixes and small features. Thank you so much to our contributors: Sam Hokin, Arunan Sugunakumar and Joe Carlson!

Features

  • Templates can be tagged by any user, not just the super user. (Via webservice only – for now)

Fixes

  • When searching our docs, some times the “.html” extension was dropped. This was fixed by our beautiful documentation hosters – readthedocs.org
  • Installing the “bio” project via Gradle does not fail if you do not have the test properties file.
  • Gradle logs error fixed
  • Removed old GAF 1.0 code
  • Fixed XML library issue:  java.lang.ClassCastException for org.apache.xerces
  • Set converter.class correctly
  • Updated the protein atlas expression graph
  • Handle NULL values returned by NCBI web services
  • Updated Solr to support new Solr versions
  • Removed unneeded Gretty plugin
  • Better error handling for CHEBI web services
  • Publication abstract is longer than postgres index
  • Removed phenotype key, it’s not in the core model and has conflicting key
  • Updated ObjectStoreSummary to handle ignored fields consistently.

Upcoming Releases

InterMine 4.0 is scheduled for release the week of 7 May 2019.

Docs

To update your mine with these new changes, see upgrade instructions. This is a non-disruptive release.

See release notes for detailed information.

InterMine 3.1.1 – patch release

We’ve released a small batch of bug fixes and small features. Thank you so much to our contributors: Sam Hokin, Paulo Nuin and Joel Richardson!

Features

  • Added access to the GFF header in the GFF parser
  • GFF sequence handler has access to feature now
  • Added DOES NOT CONTAIN constraint
  • Added a few end points for BlueGenes

Fixes

  • InterPro source handles DTD correctly
  • Updated to new GitHub URL for Gretty Plugin
  • Fixed OMIM link outs
  • NCBI going to update their GFF files (at our request! Thanks Wayne!)
    • fix spelling on feature “DNaseI_hypersensitive_site” i
    • change “recombination_region” to “recombination_feature”
  • Updated external links on enrichment widget
  • Handle NULL search index correctly
  • Fix publication with NULL title
  • Fixed log library dependency conflict
  • Removed deprecated Yahoo login link
  • Fixed Panther source to handle proteins

Upcoming Releases

  • 3.1.2 – More small bug fixes
  • 4.0.0 – FAIR release

Docs

To update your mine with these new changes, see upgrade instructions. This is a non-disruptive release.

See release notes for detailed information.

Being FAIR – data licences in InterMine

licenceM
Image Licence: CC BY-ND 2.0, via flickr https://www.flickr.com/photos/juditk/4499834152/. No changes made to image.

In our ongoing effort to make the InterMine system more FAIR, we have started working on improving the accessibility of data licences, retaining licence information supplied by the data sources integrated in InterMine, and making it available to humans via our web application and machines via queries.

Open data licences

If you want to make your software open you need to:

  1. publish your software in a public space
  2. apply a suitable free and open source licence

The absence of licence means that nobody can legally use, copy, reproduce or distribute your software.

Same for data. If you want to make you data open you need to:

  1. publish your data
  2. apply a suitable open data licence

Without a licence the users don’t know how to use and re-use your data! Choosing licence is not always easy but there are already some open licences designed exclusively for data developed by Open Data Commons (https://www.opendatacommons.org) and Creative Commons (https://creativecommons.org).

Data licences in InterMine

InterMine provides a library of data parsers for 26 popular data sets, e.g. NCBI, UniProt etc. We went through each of these core InterMine data sources and recorded the data licence for each. During this process we identified 3 cases:

p[ie pie chart showing that 34.7% of data sources had licences, 53.18 has some licence info, and 11.5% had no licencing info at all.

Case 1: Data source had a data licence (34.6%)

Example: http://creativecommons.org/licenses/by/4.0/

Perfect, ideally all data sets would have licenced data!

Case 2: Data source had some information about how data can be reused (53.8%)

Example: https://www.ncbi.nlm.nih.gov/home/about/policies/

Good to have information on how to reuse the data, but these URLs might change. Also, in some cases the wording was vague or confusing, and the page itself was hard to find.

For example, one data provider has a statement “This work by our lab is licensed under …”, what does “this work” mean? Software? Data? Both? It wasn’t clear. Another data provider offers their data “free of all copyright restrictions”.  How do we represent that?

Case 3: Data source had no information about how data can be reused (11.5%)

Example: Experimental data which has no data licence.

In cases where no data licence is listed and there was no information about how data can be reused, we have emailed them and asked for clarification.

Solutions

We have to find a way to provide data licensing information even though these data are inconsistent. And regardless of how popular data licenses become in the future, due to the integrative nature of InterMine, we’ll always have to handle all three cases.

What’s the best way to present these data in InterMine so that data consumers can easily understand how they can re-use data?

Possible options:

  1. Only provide URL to official data licence as recommended by voiD, the “Vocabulary Of Interlinked Datasets”
    1. URL will not change, e.g. http://creativecommons.org/licenses/by/4.0/
    2. Easy to ascertain permissiveness
    3. Easy to compare across data sets
  2. Provide URL to data licence OR to more information
    1. URL might change
    2. Useful because people can get details on allowed usage, even if there is no data licence
  3. Provide licence text and URL. Would provide more information immediately to users where there isn’t a licence
    1. Danger of being inaccurate or out of date
    2. User would not have to leave the InterMine to see what’s allowed

What do you think? Please let us know your opinion – leave a comment on this post, pop by chat, or email our developer lists to discuss this further.

InterMine 3.1 – Extending the Core InterMine Data Model with Multiple Genome Versions, Strains

Advances in sequencing technologies mean that genome sequence and annotation data for multiple strains of a species are now often available. An update to the InterMine core data model was decided that would allow addition of Strain data should it be available without affecting InterMines which do not have this data.

It was decided that the addition of a new class, Strain, which is referenced by Organism and Sequence feature and vice versa, would allow both the flexibility required and allow for addition of further data and expansion if required.

strains

The Strain class has the following features/advantages:

  • SequenceFeature entities, such as Genes, would continue to reference Organism, but would also reference the new Strain class, allowing for queries returning SequenceFeatures for a specific strain.
  • Providing strain information as a separate class allows individual InterMine’s to reference other information as required, such as Genotype and Stocks.
  • The Strain class extends BioEntity so will include strain-relevant attributes such as PrimaryIdentifier and Name and will reference other collections such as synonym.
  • Minimal changes to the user interface will be required as, to our knowledge, SequenceFeatures in individual strains always have a unique identifier. With the help of templates if necessary, users will be able to identify particular SequenceFeatures and which strain they originate from.

To update your mine with these new changes, see upgrade instructions. This is a non-disruptive release.

See release notes and the notes from the community call for more details. Please join our community calls if you’d like to be part of future data model decisions! (Details of upcoming calls are available via our developer mailing list).

InterMine, Oracle and the Future of Java

There have been a few questions about Oracle’s announcements on the future of Java, so this post hopes to cover what actually has changed and how this impacts InterMine as a software package.

In short, these changes do not impact InterMine negatively, but we should be aware of these issues.

Oracle JDK 11 is not free for use in production; Use OpenJDK instead

Oracle changed its licencing a bit. Starting with Java 11, Oracle now releases its two JDKs under different licences:

  1. OpenJDK (open source under GPL)
  2. Oracle JDK (commercial licence)

(Previously, Oracle had released these both under the BCL licence which allows a mix of free and commercial use, so you only had to pay “sometimes”).

To use the Oracle JDK 11 in a production environment, you now need to purchase a commercial licence. You are still allowed to use this JDK in development, for demos etc but the Oracle JDK 11 is NOT free to use in production.

We develop InterMine against (and recommend people use) OpenJDK instead of the commercial JDK Oracle provides. As of Java 11, these two JDKs are now virtually identical so this is safe.

Oracle JDK 8 — “End of Public Updates”; Use OpenJDK instead

Oracle will provide public updates of Oracle JDK 8 through at least December 2020 for personal desktop use and January 2019 for commercial use. You can continue to use Oracle’s JDK indefinitely without updates, but that’s a bad idea for security and functionality reasons. If you want updates to Java 8, switch to OpenJDK, there are free OpenJDK builds from other providers like AdoptOpenJDK, Azul, IBM, Red Hat, other Linux distros etc.

OpenJDK binaries from Oracle will only be provided until the next JDK release; Use OpenJDK from a non-Oracle provider

Oracle changed their release schedule to be twice a year, and they will not provide a LTS release for OpenJDK. Oracle will not provide updates to older Open JDK versions, e.g. versions older than six months. This includes security fixes!

This is troubling as the InterMine release schedule is such that it’s not feasible to update Java versions every six months. But we can’t ignore needed security fixes.

However, RedHat announced in September that they would take a leadership role in this area. Some, e.g. https://adoptopenjdk.net, plan to offer an OpenJDK LTS releases for free. So there will be OpenJDK LTSs available, just not from Oracle.

What does this all mean for InterMine? Not Much!

We’ll keep monitoring the situation but this seems like the usual way that companies manage open source projects — providing open software and additional paid support. So nothing to be alarmed about. OpenJDK is open source, so we are safe.

People are (rightly?) concerned about Oracle’s true commitment to Java and open source going forward. What if they change their mind and don’t release updates to OpenJDK? For InterMine this isn’t too scary because worst case scenario we could use an older stable version of Java. However in this nightmare scenario it’s likely that Java would be forked and we could carry on.

Future InterMine plans

We have no plans to migrate away from Java and will continue to develop using the OpenJDK as normal. We develop against the Java specification not the version so we aren’t tied to a specific Java version. For now, we’re recommending staying with OpenJDK 8 but plan to start testing with Java 11 soon.

Although some are suspicious of Oracle due to past experiences, we are optimistic about the future of Java, as the community really seems to be responding to the need for a secure and open Java.

More reading:

 

 

InterMine Releases – Winter 2018 Update (Solr, Strains and being more FAIR)

Here’s a list of recent and upcoming InterMine releases.

InterMine 3.0 – Solr

Just released! This is the Solr project we discussed over the summer that was done as part of Google Summer of Code (Thanks again Arunan!). See our blog post for details.

InterMine 3.1 – Strains

This will be released next week. The release will include the data model changes we discussed on the last community call. We’ve added Strain to the core data model, with references to Organism and Sequence Feature.

Sam’s built a test mine you can query to preview the updates.

This will not be a disruptive release, except you may want to update your strains to match the core InterMine data model.

InterMine 3.1.1 – Bug fixes

3.1.1 is a small release comprised of a few very very small but useful bug fixes and features. If you have something specific you need done, please ask!

This will not be a disruptive release.

InterMine 4.0 – FAIR

We’ve been making InterMine more FAIR! This release will include things like adding licence information to data sets, adding ontologies to describe the data model etc. More details soon! We’re hoping this release is ready late January 2019.

This will not be a disruptive release.

Thanks for reading! As always, if you have any questions, please hop onto our discord server (chat.intermine.org) or drop us an email.

Helpful Links:

Release Notes

Upgrade Instructions

 

InterMine 3.0 – Solr search

InterMine 3.0 is now available and features a brand new search powered by Solr.

Default search configuration will work well, but Solr allows for endless configuration for your specific needs.

Now the first search after deployment is instant, you can inspect the search index directly (via http://localhost:8983/solr/) and there’s a facet web service (via /service/facet-list and /service/facets?q=gene). Certain bugs, e.g. searching for the gene “OR”, are also now fixed.

New Configuration Option – optimize

There is a new keyword search configuration setting: index.optimize. If set to `true`, reorganises the index so chunks are placed together in storage which might improve the search time. (Similar to defragmentation of a hard disk.) See the configuration docs for more details.

Docs

Installing Solr

Configuring the keyword search

InterMine 3.0 upgrade instructions | release notes

A big thank you to our clever and hard-working 2018 Google Summer of Code student Arunan Sugunakumar — who did the bulk of the work as part of his summer project. Great job!