The University of Queensland organises every year the Mathematical and Computational Biology Winter School, which designed to introduced to give advanced undergraduate and postgraduate students, postdoctoral researchers and others working in the fields of life sciences an overview of the discipline.
With big-name international presenters, the school regularly attracts 280+ attendees, and this year was at is absolute capacity. This attests to the increasing popularity of bioinformatics and related skills in modern life science analysis.
Denis was invited to deliver the "Future of Bioinformatics" talk, where where an "eminent" person in the field gives their view on the direction the field is taking. It comes as no surprise to anyone that Denis' topic was "The dawn of cloud native bioinformatics". Denis' presentation can be found here.
Denis' main message was Bioinformatics has become increasingly collaborative because the demands have increased dramatically so that no single group can excel in all domains. Specifically, workflows need to fulfil reproducibility/compliance standards, data set sizes are ever increasing and algorithms become more complex and interconnected.
Working together to satisfy this, will increasingly become only sustainable in the cloud, which in turn will create more opportunities for people to contribute and help build something that is larger than its parts.
In fact, the cloud has already been demonstrated to create new jobs in industry: "48% of businesses using cloud services reported an increase in IT staff and 41% reported a rise in non-IT staff since using cloud services" [ComputerWeekly]. This trend is likely especially positive for the research space.
Denis also showcased how serverless technology can bring communities together online by computing shared phenotypes cost-effectively. The fun demo-application for this, of course, is CSIRO's Hitchhiker's Thumb app.
Machine learning on high-dimensional data
One of the main themes of this year's school was Machine Learning, where Arash presented about Random Forests and its application to genome-wide association studies (presentation). This presentation focuses on the core operation of the Random Forest algorithm to describe the process in which interactions between genetic markers are taken into account. Arash discussed some of the weaknesses of Random Forests and pointed out possible solutions for them. The slides include the implementation details of VariantSpark where the proper partitioning and advance parallelisation allow processing large scale genomic data.
Arash also gave a live demo on VariantSpark: A cloud-based machine learning approach for big genomic data, which we will make available shortly. In the meantime, the technical paper for VariantSpark is currently available on BioRxiv. In this demonstration, Arash illustrated the process of deploying VariantSpark on AWS and Databricks clouds. On AWS a cloud formation template facilitates the configuration of an EMR compute-cluster with VariantSpark and Hail installed. Arash also introduces ViGWAS, that is an analysis pipeline for quality control of genomic data.
Image credit: interestedbystandr