InterMine at #GCCBOSC Portland – 7 days of fun, sun, and code…

BOSC (the Bioinformatics Open Source Conference) is normally part of ISMB (Intelligent Systems for Molecular Biology), but for the first time this year, it teamed up with The Galaxy Community Conference (GCC) instead. For us, this presented an exciting opportunity – like a regular BOSC but with the added bonus of training days and the chance to interact with Galaxy contributors during the CollaborationFest hackathon (and the rest of the conference too).

Our agenda at the conference ended up being quite full:

Handling integrated biological data using Python (or R) and InterMine

We delivered a training session on the 26th of June: Handling integrated biological data using Python (or R) and InterMine. Leyla Ruzicka from ZFIN was kind enough to travel up from Eugene to Portland, to help us deliver the UI portion of the training. Once we’d familiarised users with how InterMine worked a little bit, Daniela introduced the API side of things, and then we spent the remainder of the session working through a series of exercises in Jupyter notebooks, live-coding on a projector so others could learn about our code and follow along themselves.

While we did recommend to people that they try to install the InterMine Python client, we also managed to work around the issue for anyone who didn’t have things installed, thanks to binder. You can still see the tutorial exercise notebooks and work through them, and we have the same set of notebooks with answers if you get stuck or need a hint. This was the first time we worked through the exercises interactively onscreen this way, but it seemed to work well! I’m hopeful we can continue providing the API portion of our tutorial this way in the future.

We had planned to do an R section, but actually ran out of time to do this – the tutorial was about two and a half hours in total. If an R tutorial is something of interest in the future though, please do let us know! You can do this via comments on this article, twitter, pop by chat.intermine.org, or email us at info – at – intermine – dot – org.

InterMine 2.0: More than fifteen years of open biological data integration

[Slides link] We were very pleased to have a talk accepted as well as the training, giving us a chance to introduce InterMine to others and talk about its history. While I was talking I mentioned that we were ranking at just under 300 stars on our main GitHub repo, and the audience kindly help bump it up and over 300!

intermine-stars

One of the topics I focused on during the talk included a massive thanks to all of the work our broader community does to help keep InterMine become and remain a great resource. Afterwards, Lorena Pantano raised the question: how do you get others to adopt your work and contribute to it?

Personally, I’ve been working at InterMine for three years now, so I certainly can’t attest to the entirely of the history – much of this is doubtless down to the team’s great work and Gos’s great vision (and grant writing!) – but I also think one of the most important parts is probably down to making it easy for others to use your work: good developer docs, tickets that explain issues clearly, help documentation for end-users, etc. I’d love to hear more thoughts about this in the comments!

Birds of a Feather sessions

Daniela and Yo both ran separate Birds of a Feather unconference-style sessions over lunch. Yo’s BoF focused on getting (and keeping) more open source contributors – Nicole Vasilevsky was kind enough to keep notes for this session. Thanks, Nicole!

Meanwhile Daniela shared  the InterMine approach to implement stable and persistent URIs and the possible related issues, inspired by other data integrators and the lessons learnt in the Identifiers for the 21st century paper; some attendees have also contributed providing their own solutions.

Hackathon

42394043775_eeb59807ee_o
Group meeting session at CoFest. Try to spot Daniela! 😉

During the CollaborationFest hackathon, Daniela and Yo were able to complete (yeahhhh!!) the integration between Galaxy and InterMine thanks to invaluable help of Daniel Blankenberg.
On the next Galaxy release, the new InterMine plugin will be available and will allow to import data (from InterMine) into Galaxy and export lists of identifiers (e.g. proteins, genes) from Galaxy (into InterMine) by selecting the mine instance from the InterMine registry. Watch this space – we’ll hopefully arrange to get some details on the Galaxy training network to explain how to run the data imports in each direction.

All GCCBOSC photographs in this post are from Berenice Batut’s Flickr album, under a CC-BY-SA licence

Advertisements

Cambridge Science Festival 2018: A fruity crime of passion 🍏🍋🍊🍓

TL;DR: Science Festival was great & kids loved it. You can re-use our materials, here.

Longer version: Last weekend was InterMine’s very first year at Cambridge’s famous Science Festival, an event designed to enthuse younger people and adults alike with awe for science. We split our time across two locations,working at our home department, Genetics, on the Saturday, and at the Cambridge Guildhall on the Sunday.

Our theme was around open science, with an activity designed to reinforce the idea that shared data (and therefore more data from different sources) results in better science. For adults we had a couple of great posters about the importance of data sharing, designed by Julie and Rachel. The posters are available freely online for re-use under a CC0 licence.

The Story: A party is rudely interrupted

Meanwhile, for kids (and some adults too!) we had a crime-solving activity. In our scenario, a dastardly fruit villain had stolen the passionfruit in the midst of an otherwise enjoyable soirée. In their haste to flee, the culprit knocked over a tin of blue paint, leaving tracks behind, as well as injuring themselves and leaving DNA evidence behind as they jumped out the window. We had four fruity suspects:

suspects-sheet.png

Solving the crime

Step 1: footprints in the paint

In order to solve the crime using science, our young detectives were invited to examine the footprints left by the culprit:

Fruit tracks at the crime scene. Excuse the glare from the plastic!
Fruit tracks at the crime scene. Excuse the glare from the plastic!

It was usually pretty easy to rule out the apple, and after thinking a little more, the strawberry could be ruled out too, but the orange and the lemon both looked rather similar.

Step 2: Juice found at the scene

Since the devilish thief had hurt themselves, we had samples to analyse. Our criminal investigators took strips of litmus paper and carefully examined the evidence:

20180318_105143

Once again, the evidence wasn’t quite conclusive (and was very sticky). Still, it was fun! Let’s move on to the next bit of evidence…

Step 3: the skin

With sample fruits to compare, our enterprising criminologists got a step closer to the solution. Could the skin be from a lemon? Hmmm.

20180318_105151

Step 4: We have samples, so let’s sequence the DNA!

Okay, so you may have guessed that we didn’t sequence the DNA of the suspects ourselves – but thankfully the lab had four profiles for us to compare to and they managed to quickly provide a DNA fragment from the crime scene evidence, too. This fragment was far more conclusive than the others, pointing unequivocally to the shadiest character of the bunch – Lithium Lemon.

Step 5: Putting the puzzle pieces together, and sabotage!

As our sleuths solved each different activity, we gave them a puzzle piece. At this stage they had four pieces of the puzzle, but they were still missing a couple of critical bits: the two central pieces. It turns out there had been some CCTV footage – but it had been stolen! After looking around, our vigilant investigators discovered where the crime scene video had been hidden (under the table) and managed to put the entire story together. Once again, shown front and centre of the puzzle was our suspect, Lithium Lemon.

fruit-bowl

 

Wrap up

While the shady character wad hauled off in cuffs to the county jail, successful detectives were rewarded with candy, some awesome stickers,  and a handout that had a child-oriented activity sheet on one side, with a small copy of our open knowledge posters on the other side, for the slightly more grown-up folks.

What we learned

Our tables were generally very busy, and the kids seemed to have a great time examining the evidence and putting together the puzzle pieces one by one. I’m not sure how many of them quite perceived the data sharing theme, but some of the adults definitely did, and appreciated the posters as well.

I think one of the biggest surprises for use was how busy we all were! Genetics had a steady flow of people, but the Guildhall had even more. We haven’t heard numbers for this year yet, but in 2017 apparently there were around 3,000 people. What that meant in practical terms for us: Two tables with identical versions of the activity, two InterMine team members acting as detective wranglers at each table, and often two separate groups of people working through the activity simultaneously at each table. After several hours of this we were all ready for a nap! Next time, six staff might be better to allow people to have a breather.

We also learned to keep a good eye on our puzzles: Five puzzles left the office on Sunday morning but only four returned. Hopefully it’ll be cherished at someone’s house as memories of a great activity…. ?

Our materials are open!

Given that our activity was designed to advocate openly sharing your science, we’ve shared our materials online too, and you’re welcome to re-use them.

https://github.com/intermine/science-festival/

This includes:

  • The fruit images (lovingly created by Rachel’s daughter!)
  • Handouts
  • Posters
  • Guidance sheets and in-depth “sciencey details” about each activity.

If you do re-use them, we’d love to hear about it! You can email info@intermine.org, tweet @intermineorg, or even open an issue on the GitHub repository.

Finally, I’d like to thank Rachel again for all the work she put into designing this scenario. It was creative, exciting, and overall seemed to be a hit!

 

#OpenConCam: Where open (science | access | source | data) meet.

What is OpenCon?

OpenCon is a yearly event designed to bring together people who are dedicated to open in all its incarnations. It’s in such high demand, the only way to get in is by application, and most attendees are provided with scholarships to help with travel/accommodation costs.

We weren’t able to attend the international event, but thankfully there was a great satellite event running in Cambridge – OpenConCam.

OpenConCam was in itself a day filled with memorable talks and worthwhile collaborations, including:

PeerJ – (Sierra Williams)

PeerJ is an open access journal which focuses on methodological rigour  when publishing, rather than preferring groundbreaking new science – something particularly important for early career researchers. One of my favourite points from her talk was when she demonstrated the checklist that PeerJ uses to help authors disseminate their content effectively:

Open access in developing nations (Tapoka Mkandawire)

Many of us know from personal experience that accessing scientific publications even in wealthier western countries can be controversially difficult, so it’s hard to imagine how much more difficult this must be in developing countries. Thankfully, there are initiatives such as Africa Information Highway, Eifl, and Hinari which aim to make data and publications more accessible. She also discussed the cultural concept of ubuntu – sharing and caring for each other as a concept that works hand-in-hand with the open* movement.

Bullied into Bad Science (Laurent Gatto)

Bullied Into Bad Science is a campaign to help early career researchers who may be under pressure to omit or tweak their scientific results in order to gain a desired outcome or exciting publication. Laurent was clearly passionate about this subject: Sometimes the system pressures mean that successful academics are not necessarily good scientists – and things really shouldn’t be this way.

Queen B

This session was frantic! The basic premise was that the room divided into groups of 4, nominated a “queen bee” who presented a problem (in one minute), and then the group broke up and discussed possible solutions with others in the room for three minutes, reporting back over the span of two minutes. Lather, rinse, repeat until all members in a group have been queen bees. Topics I recall discussing included getting humanities more involved in open science, open source code in science, how to inspire people to publish in journals with strict open policies when they could go for a less principled journal more easily, and how to sell open* to the disinterested.

Hitting a moving target in Open Access advocacy  (Danny Kingsley)

Danny shared something dear to our hearts: Getting others involved in open. While she was specifically referring to open access, most points could easily be applied to open science, data, and source too. Her focus was on figuring out how to get the most “bang for buck” – that is, find and influence people who will pay off the most for the least effort.

Undergrads, for example, aren’t great targets as they mostly don’t continue in academia, but PIs, and government bodies may be more useful, because they have much more influence if they’re sold on open access. Similarly, sometimes it makes more sense to influence decision makers and get them to evangelise for you, if you don’t have enough authority to impress people. Make sensible decisions, and don’t run up against brick walls repeatedly if it isn’t paying off!

Focus Groups

After lunch, we had an unconference-style set of sessions, where everyone nominated topics they were interested in, and added stars beside ideas they themselves were interested in attending. The resulting sessions were:

  • Self-care in Open: Many of us volunteer time outside a normal 9-5 job to help promote open, and the environment can be discouraging or rough sometimes – not everyone is as keep on open as we are! Suggestions presented by Kirstie Whitaker included working with micro-ambitions (turning your work into small, achievable chunks rather than trying to conquer everything), and thinking of success as a spectrum. A small win is still a win!
  • Open + inclusive: Laurent Gatto pointed out in a blog post earlier this year that the Open movements aren’t always as…. open as they should be. Sometimes Open Science can fall down in the same places less open science falls down – not making sure to have a decent balance of ethnicities, genders, sexual orientation, etc. Can we do better?

  • Open source code in science: If you’re an InterMiner, you’re probably already pretty keen on open source scientific software and can see the benefit of it – but not everyone does. Many, many papers that use code to produce their scientific results don’t expose that code. But if the code isn’t in the paper, or linked to it openly in some way… how was it peer reviewed? If the code is wrong, so is the science it produces. I proposed this discussion topic, and really enjoyed perspectives from my team mates. Some of the ideas generated included:
    • Share dummy data to run your code on, if the data are proprietary or there are privacy issues.
    • Try to encourage journals to have software availability statements
    • Encouraging researchers to share their code, even if it’s only a few lines. After all, if you’ve written 6 lines of code to configure an R plot, whilst it might seem insignificant – that’s actually really easy to peer review and correct mistakes! By comparison, bigger software packages can be hundreds, thousands, or even millions of lines of code. The thought of trying to review that (beyond reviewing quality metrics like testing, documentation, and commenting) makes me a bit scared.
  • Open in the humanities: This is a fascinating subject, and I don’t think many (any?) of the audience members were in the humanities. We raised a lot of questions about the shape of humanities data.

Opening the lab door (Christie Bahlai)

After the focus groups, Christie Bahlai skyped in to talk about running an open lab. She shared some of the different types of pushback against open science:

  • Those who consider themselves too busy to share
  • People who have been pushed from ‘busy’ status to actively hostile against open science, perhaps when they were asked to participate further and didn’t wish to
  • The worried –  people who have legitimate concerns about open science (I’m sure I’m not the only person who doesn’t really believe in “anonymised personal data”).
  • The unheard – those who are disadvantaged and marginalised already worry that practising open will marginalise them further. How can we protect these people?

She also talked about getting people involved in open as early as possible, including introductions to open as part of the undergrad curriculum:

A few more of her tips:

  • Get students’ feet wet in open science by slowly introducing them to the concepts using examples in their own fields – examples they’ll care about.
  • Share your lab policies openly and don’t tolerate the “brilliant jerk” – at the end of the day no matter how productive they are, they’re still jerks.
  • Keep science a kind place. Show others that you too can fail publicly, and fail often.
  • Share your lesson plans openly, too! Christie’s “Reproducible quantitative methods” curriculum is designed to provide a good introduction to open, reproducible data wrangling using R and GitHub.

The open source investigation revolution (Eliot Higgins)

This talk was an out-of-the-blue surprise. Rather than focusing on academia like most of the previous talks, Eliot shared how open videos, photos, and “facts” on the web can be verified for journalism. If you’ve heard of doxxing, you’ll know a bit about the techniques Eliot described, using social media, satellite imagery, and other online tools to track people who don’t want to be tracked – but this time, for Good. He described how some of the white supremacist rally leaders were identified, as well as verifying missile attacks in Syria – including who perpetrated them and who was lying about it.

This talk stilled twitter’s usually vibrant #OpenConCam discussions to a halt, probably due to the riot of emotions it induced in most of the participants. We’d been shown highly disturbing images, felt fear wondering how these techniques could be misused, and we awed by the massive importance of what we’re seeing, no matter how awful it was. I’m sure I wasn’t the only person torn between wishing I’d never seen it and knowing that I had to watch it, because burying our heads in the sand isn’t an option either.

Wrap-up

OpenCon 2018 hasn’t been announced yet, but this year, all around the world, there are still satellite events like the one I attended. If you haven’t attended a conference about working openly before, this is a great way to get a taste – or if you’re a die-hard enthusiast, you’ll get the chance to meet like-minded individuals and be inspired!

Researchers connected in Berlin

researchersConnected.png

I really enjoyed attending the Neo4j Life & Health Sciences Workshop, organized in Berlin, this week, by Michael and Petra: a day rich with great presentations about the application and utility of graph technology in several research areas. Here are only few examples:

  • The Ontology Lookup Service, a repository for biomedical ontologies, implemented with the support of graph databases and Apache Solr for indexing, different technologies for different purposes.
  • In the Lamond lab (University of Dundee), they model proteomics data with graph databases in order to understand protein behaviour under different conditions and dimensions of analysis.
  • MetaProteomeAnalyzer (MPA), a tool for analyzing & visualizing metaproteomics, uses Neo4j as the backend for metaproteomics data analysis software.
  • Tabloid Proteome is a database of associated protein pairs, derived from mass-spectrometry based proteomics experiments, implemented using a graphdb, which can help also to discover proteins that are connected indirectly, or may have information that you are not looking for!
  • Reactome is a pathway database which has recently migrated from MySQL to Neo4j, with relevant performance improvement. You can access data via the GraphCore open source Java library, developed  with Spring Data Neo4j, or via Neo4j browser.

I’ve lost count of how many times I heard sentences like: “Biology systems are complex and growing and graphs are the native data model” or “Graph database technology is an effective tool for modelling highly connected data as we have in biology systems”. We already knew it, but it’s been very encouraging and promising hearing it again from so many researchers and practitioners with higher experience than us in graph technologies.

In the afternoon, I attended the workshops “Data modelling with Neo4j”; starting from the data sources we usually work with, we have tried to model the entities and the relationships in order to answer some relevant questions. Modelling can be very challenging and, in some cases, it might depend on the questions you have to answer!

Before the end, I had the chance to give a short presentation about our experience with Neo4j.

Thanks again Michael and Petra for organizing such a great event!

Out and about: where to find InterMiners over June and July 2017

We recently added a public google calendar you can subscribe to if you’re interested in knowing what we’re up to, or when public holidays might mean we’re out of the office. Here’s a quick lowdown on upcoming events:

20 June 2017: InterMine community dev call.

21 June 2017: Neo4j Life and Health sciences day in Berlin. Keep your eyes peeled for Daniela!

28 June 2017: Daniela will be presenting on our experiences with Neo4j at the London Neo4J GraphDB meetup.

4 and 18 July 2017: InterMine community dev calls.

22-23 July 2017: I’ll be presenting a poster at BOSC/ISMB about BlueGenes, with the fantastically witty title “Forever in BlueGenes: a next-generation genomic data interface powered by InterMine”. 👖


If you’re a GSoC student or mentor, there will also be the evaluation periods at the end of each month, but you’re doubtless well aware of those!

Further in the future, you may find us at SWAT4LS, ISWC, and further Bioschemas events. We’ll keep you posted!

Are you attending any fun events? Let us know!

If you’re going to be at an event this year where you’ll be telling others about your work with InterMine and might like some InterMine stickers or handouts – or perhaps you’d like to guest-blog about it or share your slides – please ping us.

 

 

 

Bioschemas Summer Progress and InterMine

A couple of weeks ago we took part in the May ELIXIR Bioschemas meeting, along with representatives from Google, the European Bioinformatics Institute (EBI) and other participating organizations from the UK and beyond.

To give some background, Bioschemas is based on schema.org, an initiative to produce schemas that can be directly embedded in websites to give more structure to data. Search engines can understand this more easily than simple text, and it’s the stuff that powers a proportion of Google snippets (those box-outs you see on Google search results when you search for something popular). For example, let’s suppose I wanted to tell search engines more about my Jazz event. This is what I would embed in the webpage for the event.

<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "Event",
  "name": "Hot Digits Jazz Afternoons",
  "startDate": "2017-04-24T14:30-17:00",
  "location": {
    "@type": "Place",
    "name": "Hot Digits",
    "address": {
      "@type": "PostalAddress",
      "streetAddress": "444 Trumpington St",
      "addressLocality": "Cambridge",
      "postalCode": "CB2 1QA",
      "addressCountry": "UK"
    }
  },
  "image": "http://www.example.com/event_image/12345",
  "description": "Join us for an afternoon of Jazz with Tom Colborn (aka 'Delta Tom').",
  "performer": {
    "@type": "PerformingGroup",
    "name": "Tom Colborn"
  }
}

Bioschemas wants to do the same but for biological information (like genes, proteins, samples, etc.). So in InterMine, for the CHEY_BACSU protein report page in SynBioMine we might have something like this:

<script type="application/ld+json">
{
  "@context":"http://schema.org",
  "@type":"BiologicalEntity",
  "biologicalType":"protein",
  "name":"CHEY_BACSU",
  "url":"http://beta.synbiomine.org/synbiomine/report.do?id=111921899",
  "about":"Integrated InterMine information for Protein CHEY_BACSU",
  "keywords":"protein, CHEY_BACSU",
    "inDataset": {
      "@type":"Dataset",
      "url":"http://beta.synbiomine.org/synbiomine/release-5"
    },
  "crossReference": {
    "@type":"Thing",
    "url":"http://beta.synbiomine.org/synbiomine/report.do?id=6010402"
  },
  "taxon":"https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=224308&lvl=3&lin=f&keep=1&srchmode=1&unlock",
  "taxon":"http://www.uniprot.org/taxonomy/224308"
  "sequence":"MAHRILIVDDAAFMRMMIKDILVKNGFEVVAEAENGAQAVEKYKEHSPDLVTMDITMPEM
 DGITALKEIKQIDAQARIIMCSAMGQQSMVIDAIQAGAKDFIVKPFQADRVLEAINKTLN",
  "datePublished":"2017-05-26",
  "citation": {
    "@type":"CreativeWork",
    "name":"UniProt",
    "url":"http://www.uniprot.org"
  },
  "citation": {
    "@type":"CreativeWork",
    "name":"Ecocyc",
    "url":"http://ecocyc.org"
  },
}

A search engine (or a specialized life sciences search tool) can then crawl and aggregate the structures embedded in a wide range of life sciences websites (particular those with lots of small sites such as biological samples in biobanks). The goal is to make it considerably easier for scientists to find information relevant to their research without having to visit lots of sites individually.

The job of Bioschemas is to go through the existing schema.org schemas and decide what existing stuff we can use (such as Dataset) and what we need to propose as new schemas (such as BiologicalEntity). schema.org schemas are big bags of attributes with no cardinality constraints as they need to satisfy a lot of different use cases, so another job of Bioschemas is to recommend which attributes to use and at what cardinality, both for data in general (DataSet, for example) and for specific life sciences entities, such as proteins and biological samples.

We made some great progress at this meeting and the results, such as draft schemas specifications, are going up on the Bioschemas groups page. The next phase is for specific resources, such as Uniprot and the Protein Data Bank in Europe to try out these schemas on real data and catch the obvious problems so that we can refine the specifications further. At InterMine we’ve also done some extremely prototype work on testing these ideas and we’ll continue to participate enthusiastically, particularly as this is an important component of our coming work to make InterMine-hosted data more Findable, Accessible, Interoperable and Reuseable.

Bioschemas work is at an early and draft stage, but it’s an open community that welcomes anybody who wants to join in the effort. You can find more details on how to participate in our mailing list and issue tracker at Bioschemas.

GraphConnect – a Neo4j conference

neo4jconference

We were in London to attend GraphConnect, the annual conference organised by Neo4j.
It was fantastic to meet so many people around the world enthusiastic about graph databases, and a lot of people that, like us, are prototyping/exploring Neo4j as possible alternative to relational databases.

They have announced the release of Neo4j 3.2 which promises to bring a huge improvement in term of performance; the compiled Cypher runtime has improved the speed by ~300% for a subset of basic queries and the introduction of native label indexes has also improved the write speed.

They have also added the composite indexes (that InterMine uses a lot) and the use of indexes with the OR operator. We highlighted the problem months ago on stackoverflow and we were so surprised to see it fixed. We have to update our “What we didn’t like about Neo4j” list by removing 2 items. We’re really happy about that!

It was a pleasure to attend Jesus Barrasa’s talk on debunking some RDF versus property graph alternative facts. He demoed how a RDF resource does not necessarily have to live in a triple store but can also be stored in Neo4j. Here are part1 and part2 of “Neo4j is your RDF store”, a nice post where he describes his work in more detail.

Another nice tool they have implemented is the ETL tool to import data from a relational database into Neo4j by applying some simple rules.

The solution-based talks demonstrated how Neo4j is being used to solve complex, real world problems ranging from travel recommendation engines to measuring the impact of slot machine locations on casino floors. While the topics were diverse, a common theme across their respective architectures was the use of GraphAware’s plugins, some of which are free. One plugin that looks particularly interesting is the Neo4j2Elastic tool which transparently pushes data from Neo4j to ElasticSearch.

During the conference, we discovered that there is a Neo4j Startup Program that allows to have Neo4j enterprise edition for free. Not sure if we count as a start up though!

Overall, we’re super happy with the improvements Neo4j has made, and super impressed with Neo4j’s growing community. Looking forward to meeting with Neo4j team in London, at their meetup, and sharing our small experience with the community!