VariantSpark is a genomic analysis platform made for the cloud environment. VariantSpark performs genome-wide association study (GWAS) analysis using Random-Forest machine learning algorithm so that complex genomic interactions are taking into account when scoring individual SNPs (See our recent technical pre-print).
Even though VariantSpark runs on a personal computers or high-performance compute clusters (HPC), to harness its full potential in analysing large scale datasets we recommend using distributed computing systems, specifically through the cloud. Configuring and managing such system requires system-level configurations that can seem onerous for scientists and bioinformatician.
To address this blocker and make using VariantSpark in the cloud a breeze, we created two short video tutorials (<5min). These tutorials show how to use VariantSpark on AWS and Databricks cloud with a few clicks.
VariantSpark on Databricks
Databricks is a platform that allows researchers to get access to a web-based analytics notebook. The notebook is backed by a computer cluster from either AWS or Azure. Users can choose the language of the notebook (R, Python, etc...) and the size of the cluster for their analysis. We have prepared below video that shows you how to import VariantSpark library to your Databricks account, how to create a cluster and how to run our example notebook.
VariantSpark has several interfaces. You can run VariantSpark in Linux terminal and Python or Scala notebook. It is also integrated with Hail library. Our example Databricks notebook uses VariantSpark Scala interface and shows you how to import data, process it and visualise the result. Don't worry if you don't have any data ready. The notebook uses a publicly available synthetic dataset called HipsterIndex. The dataset and all analysis steps are commented in the notebook. Edit your notebook for your own dataset and see how VariantSpark perform on your data.
Using VariantSpark on Databricks is the easiest way to get started.
In the above video, we use the Databricks community account, which is free and hence only allows small-sized cluster to be created. You may choose a far larger cluster if you use Databricks enterprise account. Please note that the VariantSpark on Databricks's documentation page is a previous version, which is not compatible with the new configuration. While we work with Databricks on resolving this issue, we recommend you following the instructions in our video tutorial rather than clone the Databricks notebook version.
VariantSpark on AWS
Amazon Web Service (AWS) is one of the largest cloud providers with a long list of services. Elastic Map Reduce (EMR) is an AWS service that let the user create and access a managed clusters of computers with Spark pre-installed. Configure an EMR can be a complicated task, we hence developed a "one-button" deploy mechanism though CloudFormation, which is an AWS service that allows hiding all the complexity into a template.
Below video shows how to use VariantSpark CloudFormation template. The template creates an EMR cluster with VariantSpak, Hail v0.1 and Jupyter Notebook installed. The template copies an example notebook into your notebook directory and the video shows how to run that notebook.
In this example, we use VariantSpark Hail interface (https://hail.is). Hail is a library that allows performing genomic operations and annotate samples and variants in a dataset. You may use the notebook as a template to process your own datasets.
Spin up a VariantSpark cluster on AWS to analyse your data with ease, in-time and securely.
For legacy reasons, the Hipster dataset used in our AWS example is different from the one used in our Databricks example. They both use the same 4 SNPs to simulate phenotype but different models are used for phenotype simulation.
Looking for a more managed approach?
If you prefer a more managed approach for security or convenience reasons over the here described DYI-approach, please check our VariantSpark product page.