Google Summer of Code: Let the enquiries commence!

Last month we applied for InterMine to join Google Summer of Code (GSoC) as a mentor organisation, and we’re pleased to report that we have officially been accepted!

Students: Interested in working with us for GSoC?

Our GSoC site has a project ideas list and the student application guidance, which hopefully will answer most of your questions.

Want to learn more?

  • You can also read our GSoC blog posts from last year to learn more about how things went.
  • If you still have questions:
    • If the question is project-specific: email both listed mentors of the given project.
    • If the question is about GSoC in general, see the student manual.
    • We’ll be running a GSoC question and answer video call session where students can learn more about the specific projects. Updates about the exact date and time will go out on this blog, our mailing lists, and twitter.

We’ll look forward to hearing from you!

 

Advertisements

Community Spotlight: Correlating fly gene expression with FlyCAGE – Interview with @codingbash

For today’s blog we’ve interviewed Basheer Becerra, an undergrad bioinformatics researcher at the Illinois State University, about the application he’s created, FlyCAGE – (Correlation Analysis on Gene Expression). (Live Demo, GitHub)

Hi Basheer! We love hearing about new and interesting uses of InterMine. Can you give us a brief non-technical intro to FlyCAGE? What inspired you to make it?

FlyCAGE is a web application that allow users to search for genes in Drosophila melanogaster that follow a specific mRNA expression pattern. The user can either enter a known gene name to find other genes with similar expression profiles, or the user can enter a custom expression pattern based on experimental data to find genes that follow the pattern. FlyCAGE would be useful in identifying candidate genes involved in a given process and discovering regulatory interactions in genetic networks.

Tell us a little about yourself.

My name is Basheer Becerra, and I am currently an undergraduate junior at Illinois State University double majoring in Computer Science & Statistics and minoring in Biological Sciences. Programming is something I love doing, specifically web development and data science. What makes me even more excited is using programming and mathematics to answer difficult questions in biology! When I’m not on the computer, you will usually see me reading, training for upcoming marathons, or spending time with friends and family.

You’ve been a friendly presence in our Twitter feed for a while now. How did you hear about InterMine originally?

InterMine was originally introduced to me by my advisor, Dr. Nathan Mortimer, as a tool to help scientists quickly review information about any gene. While InterMine is a helpful tool for scientists to browse gene information, I’ve also realized that InterMine is incredibly useful for developers and data scientists. When my advisor and I came up with the idea of FlyCAGE, InterMine was chosen to be the best solution for retrieving data due to its data integration features and ease of use.

Can you tell us a bit about the technical implementation of FlyCAGE?

The technologies used to implement FlyCAGE includes Spring Framework (Java) for the web back-end, Thymeleaf for template resolving, and HTML/CSS/Bootstrap and JS/jQuery for front-end development. After the expression information is extracted from the entered gene or pattern, Pearson’s correlation is performed on every gene stored in FlyBase for Drosophila melanogaster. Genes with the highest pearson’s correlation coefficient relative to the input expression is returned to the user. InterMine plays a significant role in the operation of FlyCAGE since FlyMine is the only resource used to retrieve the gene information and mRNA expression data. With FlyMine’s modern HTTP API, only a single query is needed to retrieve all the necessary data for FlyCAGE to operate. Without FlyMine’s data integration, FlyCAGE would have to manually integrate data from several different data-sources such as FlyBase, FlyAtlas, BDGP, etc., which would slow down development significantly.

What are your future plans for FlyCAGE? Are you going to expand to other organisms apart from flies?

As far as the program logic, the plan is to include more complex analysis of expression data such as including other data features to help determine “gene similarity”, predict regulatory interactions and unknown gene functions, and explain subtle differences in gene pairs with correlated expression patterns. There has also been a lot of interest to expand CAGE towards other organisms such as plant genomes. With InterMine’s standardized API interface across its several resources, I predict that scaling the functionality of CAGE towards other organisms should be a relatively feasible task.

FlyCAGE is currently in-alpha and can be accessed by this link. However, the link is likely to change as FlyCAGE gets close to releasing. If you would like to stay up-to-date with FlyCAGE or if you would like to help us with usability tests to improve FlyCAGE, please enter your email with this Google Form. If you’ve already looked at FlyCAGE and would like to send some feedback, send me a quick note at bbecer2@ilstu.edu. Any help is appreciated!

‘Twas the week before Christmas… [aka InterMine availability over the holiday period]

… and all through the lab, not an organism was stirring, not even a… crab?*

Emails and support: Just a quick blog reminder that the office will be pretty empty from around now until the second of January, so don’t be surprised if we take a while to reply to messages. Some of us may be in the office or working from home, but it’s pretty patchy over the holiday season, and I don’t think any of us will be answering emails between the 23rd and 26th of December, nor on the first of January.

Developer calls: There is no developer call this week (normally it would be scheduled for Thursday the 21st). I’m not sure at this point, but the call on the 4th of January may (or may not) be cancelled as well.

Be good, have fun, and we’ll see you next year!

*I can only apologise for the terrible rhyme. Apparently nothing rhymes with “Cambridge” except “drainage” (and even then it’s somewhat weak), so I tried “uni” and that only rhymed with “loony” and “goonie”. Finding rhymes for the word “office” was no better.

Looking ahead: InterMine+Google Summer of Code 2018. Could you be a mentor?

2017 is coming to an end, and I have to say it’s been a fabulous one! I’ll probably post a “cool things InterMine did this year” round-up in a week or two – but in the meantime, here’s my final Google Summer of Code blog for you all!  We’ll cover the InterMine swag just sent out across the globe, as well as plans for next year – and how you can help out.

Thank-you gifts for mentors and students

Last week, we posted care packages to all our GSoC mentors and summer students, in the form of t-shirts, stickers, and pens. The postal-service-wrinkled shirt shown above is the women’s fit shirt printed on black; unisex shirts are a slightly lighter grey colour. If you filled out the swag survey when it was sent to you, your gift should be with you soon! Tweet us your images of the items in use for extra InterMine Cool Points 😎.

GSoC 2018 – call for project ideas and mentors!

Early 2017, we put together an ideas list for GSoC projects – InterMine’s projects are numbers 3 to 9. If you want to get more of an idea what it’s like to apply, (or be a mentor), read our application guidance from last year.

Do you have a nifty idea, or an InterMine itch you’d like to scratch?

Please share it with us! Add it to our 2018 Google Summer of Code ideas list, or if you need to sound things out and discuss them a little bit, comment on the GitHub issue, or email the dev list. You can even propose several ideas, if you like! Please add all ideas by the end of 14th of December (end of this week).

Would you like to try mentoring?

Fancy a chance to earn some nifty exclusive swag like pictured above? Add your name as a possible mentor to an existing idea (or your own new idea). You can always drop us a line if you want to discuss things first. We like projects to have more than one mentor if possible.

Maybe you’re a student thinking of GSoC?

Awesome! If you have your own InterMine project idea (whether it’s brand new or you’ve already started it), or if one of the ideas on our ideas list lights your fire, it’s not too early to start talking with potential mentors about it. The application guidance we mentioned above would be a good read, too.

 

 

#OpenConCam: Where open (science | access | source | data) meet.

What is OpenCon?

OpenCon is a yearly event designed to bring together people who are dedicated to open in all its incarnations. It’s in such high demand, the only way to get in is by application, and most attendees are provided with scholarships to help with travel/accommodation costs.

We weren’t able to attend the international event, but thankfully there was a great satellite event running in Cambridge – OpenConCam.

OpenConCam was in itself a day filled with memorable talks and worthwhile collaborations, including:

PeerJ – (Sierra Williams)

PeerJ is an open access journal which focuses on methodological rigour  when publishing, rather than preferring groundbreaking new science – something particularly important for early career researchers. One of my favourite points from her talk was when she demonstrated the checklist that PeerJ uses to help authors disseminate their content effectively:

Open access in developing nations (Tapoka Mkandawire)

Many of us know from personal experience that accessing scientific publications even in wealthier western countries can be controversially difficult, so it’s hard to imagine how much more difficult this must be in developing countries. Thankfully, there are initiatives such as Africa Information Highway, Eifl, and Hinari which aim to make data and publications more accessible. She also discussed the cultural concept of ubuntu – sharing and caring for each other as a concept that works hand-in-hand with the open* movement.

Bullied into Bad Science (Laurent Gatto)

Bullied Into Bad Science is a campaign to help early career researchers who may be under pressure to omit or tweak their scientific results in order to gain a desired outcome or exciting publication. Laurent was clearly passionate about this subject: Sometimes the system pressures mean that successful academics are not necessarily good scientists – and things really shouldn’t be this way.

Queen B

This session was frantic! The basic premise was that the room divided into groups of 4, nominated a “queen bee” who presented a problem (in one minute), and then the group broke up and discussed possible solutions with others in the room for three minutes, reporting back over the span of two minutes. Lather, rinse, repeat until all members in a group have been queen bees. Topics I recall discussing included getting humanities more involved in open science, open source code in science, how to inspire people to publish in journals with strict open policies when they could go for a less principled journal more easily, and how to sell open* to the disinterested.

Hitting a moving target in Open Access advocacy  (Danny Kingsley)

Danny shared something dear to our hearts: Getting others involved in open. While she was specifically referring to open access, most points could easily be applied to open science, data, and source too. Her focus was on figuring out how to get the most “bang for buck” – that is, find and influence people who will pay off the most for the least effort.

Undergrads, for example, aren’t great targets as they mostly don’t continue in academia, but PIs, and government bodies may be more useful, because they have much more influence if they’re sold on open access. Similarly, sometimes it makes more sense to influence decision makers and get them to evangelise for you, if you don’t have enough authority to impress people. Make sensible decisions, and don’t run up against brick walls repeatedly if it isn’t paying off!

Focus Groups

After lunch, we had an unconference-style set of sessions, where everyone nominated topics they were interested in, and added stars beside ideas they themselves were interested in attending. The resulting sessions were:

  • Self-care in Open: Many of us volunteer time outside a normal 9-5 job to help promote open, and the environment can be discouraging or rough sometimes – not everyone is as keep on open as we are! Suggestions presented by Kirstie Whitaker included working with micro-ambitions (turning your work into small, achievable chunks rather than trying to conquer everything), and thinking of success as a spectrum. A small win is still a win!
  • Open + inclusive: Laurent Gatto pointed out in a blog post earlier this year that the Open movements aren’t always as…. open as they should be. Sometimes Open Science can fall down in the same places less open science falls down – not making sure to have a decent balance of ethnicities, genders, sexual orientation, etc. Can we do better?

  • Open source code in science: If you’re an InterMiner, you’re probably already pretty keen on open source scientific software and can see the benefit of it – but not everyone does. Many, many papers that use code to produce their scientific results don’t expose that code. But if the code isn’t in the paper, or linked to it openly in some way… how was it peer reviewed? If the code is wrong, so is the science it produces. I proposed this discussion topic, and really enjoyed perspectives from my team mates. Some of the ideas generated included:
    • Share dummy data to run your code on, if the data are proprietary or there are privacy issues.
    • Try to encourage journals to have software availability statements
    • Encouraging researchers to share their code, even if it’s only a few lines. After all, if you’ve written 6 lines of code to configure an R plot, whilst it might seem insignificant – that’s actually really easy to peer review and correct mistakes! By comparison, bigger software packages can be hundreds, thousands, or even millions of lines of code. The thought of trying to review that (beyond reviewing quality metrics like testing, documentation, and commenting) makes me a bit scared.
  • Open in the humanities: This is a fascinating subject, and I don’t think many (any?) of the audience members were in the humanities. We raised a lot of questions about the shape of humanities data.

Opening the lab door (Christie Bahlai)

After the focus groups, Christie Bahlai skyped in to talk about running an open lab. She shared some of the different types of pushback against open science:

  • Those who consider themselves too busy to share
  • People who have been pushed from ‘busy’ status to actively hostile against open science, perhaps when they were asked to participate further and didn’t wish to
  • The worried –  people who have legitimate concerns about open science (I’m sure I’m not the only person who doesn’t really believe in “anonymised personal data”).
  • The unheard – those who are disadvantaged and marginalised already worry that practising open will marginalise them further. How can we protect these people?

She also talked about getting people involved in open as early as possible, including introductions to open as part of the undergrad curriculum:

A few more of her tips:

  • Get students’ feet wet in open science by slowly introducing them to the concepts using examples in their own fields – examples they’ll care about.
  • Share your lab policies openly and don’t tolerate the “brilliant jerk” – at the end of the day no matter how productive they are, they’re still jerks.
  • Keep science a kind place. Show others that you too can fail publicly, and fail often.
  • Share your lesson plans openly, too! Christie’s “Reproducible quantitative methods” curriculum is designed to provide a good introduction to open, reproducible data wrangling using R and GitHub.

The open source investigation revolution (Eliot Higgins)

This talk was an out-of-the-blue surprise. Rather than focusing on academia like most of the previous talks, Eliot shared how open videos, photos, and “facts” on the web can be verified for journalism. If you’ve heard of doxxing, you’ll know a bit about the techniques Eliot described, using social media, satellite imagery, and other online tools to track people who don’t want to be tracked – but this time, for Good. He described how some of the white supremacist rally leaders were identified, as well as verifying missile attacks in Syria – including who perpetrated them and who was lying about it.

This talk stilled twitter’s usually vibrant #OpenConCam discussions to a halt, probably due to the riot of emotions it induced in most of the participants. We’d been shown highly disturbing images, felt fear wondering how these techniques could be misused, and we awed by the massive importance of what we’re seeing, no matter how awful it was. I’m sure I wasn’t the only person torn between wishing I’d never seen it and knowing that I had to watch it, because burying our heads in the sand isn’t an option either.

Wrap-up

OpenCon 2018 hasn’t been announced yet, but this year, all around the world, there are still satellite events like the one I attended. If you haven’t attended a conference about working openly before, this is a great way to get a taste – or if you’re a die-hard enthusiast, you’ll get the chance to meet like-minded individuals and be inspired!

Community Outreach: What we’re up to & how you can participate

A large part of working in open source and science is sharing what you do with others – it’s not just about code and papers. We have quite a bit going on and coming up that we’d like to share and get your ideas about.

Community outreach calls

We’ll experimentally be trialling a community outreach call on December 7th at 5PM GMT. This happens at the same time as our normal developer call usually would, but we’re specifically focusing on community members and ways to communicate and help them out. It will not have a focus on technical issues or code.

Developers are still entirely welcome to come along, but please encourage your curators, enthusiastic users, and outreach people to come along too! Agenda

Open outreach repo on GitHub

We’ve created a GitHub repository dedicated to outreach-related topics. The idea is to take discussions out to the open about what we’re doing so others can chime in and/or re-use or work. Examples include:

Science Festival – March 2018

We’ll be participating in the Cambridge Science Festival, teaching about better data enabling better science. The basic idea is teach this through gameplay with puzzles, rewarded with candy and stickers. Do you have kids who might be willing to playtest our ideas? Let us know!

Webinars and tutorials

We’ve done workshops in person, we did a developer workshop: we’d like to try something online this time! What formats interest you / your users the most?

  • A series of short 5-minute-ish webinars covering various topics
  • A longer training session, covering querying InterMine via website and/or API? Perl, Python or R?
  • Other? Share your feelings in a comment, contact us, or add to the GitHub issue
  • Maybe you’d like to volunteer to run one!

Google Summer of Code

Do you have an idea for a fun InterMine project that would only take a couple of months? Or maybe you would like to mentor a project over the summer? We had a great time during GSoC this year, and we’re planning to apply to do it again next year. Interested? More info on GitHub.

Rachel’s world tour of the UK

As part of the upcoming ISA-InterMine cloud grant, Rachel will be visiting bioinformatics cores and labs to try and solicit use-cases from people who are working with biological data right at the front. Want to help our or invite us to your lab? Get in touch.

Guest blogging

Come tell our followers about the awesome thing InterMine thing you just did. A conference? a talk? a new features or exciting dataset in your mine? We’d love to be the platform for your voice!

 

 

BlueGenes OAuth2 Authentication: Community feedback requested!

BlueGenes development is at the point where we need to store BlueGenes specific data to a database. This is an important step because it paves the way for customisation, branding, and tool configuration, and an enhanced My Data section to let users manage all of their InterMine assets.

There are a few architecture and design decisions that need to be made now, and be made correctly. In particular: OAuth2 Authentication. If you’re up to speed on how InterMine and BlueGenes authenticate then feel free to skip to the bottom.

Background

The current InterMine web application is a monolith. Users login to the UI with a username and password and their identity gets stored in memory on the server (called the “session”). When they perform a query or upgrade a list the JSP code sends messages to the Java layer along with the user’s identity which is used to retrieve data from the object store and user profile.

For example, when Sally views her list page today, the workflow looks something like:

Figure 1

today.png

Everything you see in InterMine today lives somewhere layered between the JSP Web App and the Object Store.

BlueGenes works differently. It communicates with the Java layer, object store, and user profile entirely through web services known as the InterMine API. No exceptions. This cleaves the dependency between the visual tools that we develop and the lower level operations of InterMine such as handling queries.

When Sally views her list page in BlueGenes, the workflow looks more like this:

Figure 2

tomorrow.png

BlueGenes lives in the browser, not on the server. InterMine’s web services respond with raw data about her lists in JSON format and BlueGenes renders the page in the browser. This is equivalent to running Python scripts in your console to fetch your lists, resolve IDs, perform a search, etc.

Web services (InterMine or otherwise) are stateless by design. They can’t tell if requests are made by a new user or a revisiting one. In order for a web service to authorise a user the request must contain some sort of secret token as seen in Figure 2. Like any good web application, InterMine provides web services for authenticating a user and retrieving their identity token which can be used in future requests rather than a username and password.

BlueGenes Authentication

Now it gets a bit trickier. BlueGenes has its own small web server to provide the actual javascript application, and it requires database access to store BlueGenes specific information such as additional MyMine data, tool config, etc. It really looks more like this:

Figure 3

blugenes_server.png

 

A user can authenticate using InterMine’s web services via the browser, but if they want to save user specific data to BlueGenes’s database using BlueGene’s web services then they need to provide an identity. BlueGenes does not have access to the user profile directly, so the authentication request needs to be piped through the BlueGenes server.

Figure 4

auth.png

 

When Sally logs into BlueGenes she provides her username and password which is sent to the BlueGenes server rather than the InterMine server. If BlueGenes successfully authenticates as Sally then it sends her back her InterMine API token embedded in a signed JSON Web Token (JWT). All future requests between BlueGenes and InterMine will contain her API token, and all requests to the BlueGenes server will contain the signed JWT.

It sounds a bit complicated, but this only happens when logging in and remains hidden from the user. This configuration protects BlueGenes from storing passwords and doesn’t require direct access to the user profile.

The problem: OAuth2 Authentication

Logging into InterMine using your Google account uses the OAuth2 framework. For it to work you must configure Google’s developer console with a hardcoded URL that redirects users back to the application after they’ve authenticated. This redirection page is given a token that is exchanged by the servers for the user’s Google identity (email address and Google ID). We can do the same in BlueGenes:

  1. We put a Google Signin button in BlueGenes.
  2. Sally clicks it and is redirected to Google.
  3. Upon authentication Sally is sent back to BlueGenes with an authentication token.
  4. BlueGenes server exchanges the token for Sally’s Google ID.

So far so good. She can update her tool configurations and tags which are stored in the BlueGenes database.

Now Sally wants to save a list which is an action performed in InterMine, not BlueGenes. This requires an API token which she doesn’t yet have.

  • She can’t authenticate with InterMine using a username and password because she doesn’t have one (she’s a Google user).
  • She has no way of exchanging her Google ID with InterMine’s web services for an API token because InterMine has no way of trusting who she is. Anyone could access the end point and get a user’s API token if they knew their Google ID.
  • BlueGenes can’t fetch her API token from the user profile because it doesn’t have access (by design).

There are a few workaround solutions but they couple BlueGenes to a single InterMine instance with varying degrees.

Solution 1: JWTs and sharing secrets

InterMine server gets a new end point that accepts a user ID and a JSON Web Token. The user’s API token is returned only if the signature on the JWT is valid.

Pain point: Both BlueGenes server and InterMine server will need matching secret keys. A third party cannot host their own BlueGenes and point it at a remote mine while supporting OAuth2 without knowing that mine’s secret key (aka access to all accounts).

InterMine admins could potentially whitelist third party instances of BlueGenes by generating secret keys for them, but this would be an active process of curation and still give third parties full access to all Google accounts..

Solution 2: Shared database

BlueGenes accesses the user profile directly.

Pain point: This requires database access which entirely rules out remote instances of BlueGenes

Solution 3: Double Login

InterMine has a URL redirect for Google authentication. It accepts a URL of a BlueGenes instance and generates a link with an embedded API key.

  1. A user clicks Google Login on BlueGenes and is redirected to Google
  2. After authenticating the user is redirected back to the BlueGenes server.
  3. BlueGenes generates a JWT containing the user’s identity.
  4. A mandatory button is then shown to “Authorise My Account to use Remote Data Sources” (which means InterMine server).
  5. Clicking the button sends the user to a /service/google-auth end point on the remote mine with a return_to parameters containing the URL of BlueGenes.
  6. The return_to parameter is stored in the session and the user is sent back to Google Login where they authorise for the second time.
  7. After authenticating the user is redirected to an InterMine /service/google-auth-redirect end point.
  8. The /service/google-auth-redirect page automatically redirects the user back to the BlueGenes URL stored in the session with the API token as a parameter

A workflow would look something like this:

solution3.png

There are quite a few steps, but steps 5+ are automatic.

Pain point: Users will have to double authentication the first time they login to Bluegenes, but we can make this as painless as possible. Also, if an admin is running both InterMine server and BlueGenes server then they’ll need two OAuth2 projects in their Google developer console (also a one time activity).

Solution 4: Outsource

We use a third party single sign-on vendor such as https://auth0.com/

Pain point: We can’t guarantee that InterMine admins will remain within the Terms of Service for their free offering to open source projects. Otherwise it’s very expensive.

Solution 3 seems to be the most feasible and keeps InterMine and BlueGenes completely decoupled. (Thanks, Yo!)

Does anyone feel strongly about a particular solution, or have other advice for bridging the OAuth2 gap? Feel free to leave a comment or join in the discussion on our mailing list (mailing list subscription link is here: https://lists.intermine.org/mailman/listinfo/dev)