VariantSpark

New approaches are needed to process modern datasets that are big and “wide” where millions of samples (rows) can have millions of information points each (columns, features).

We developed VariantSpark, a tailored Apache Spark-based machine learning framework that creates insights from high-dimensional data.

VariantSpark: Creating insights from high dimensional data

Value Proposition

Fast and sensitive end-to-end approach for high-dimensional data

VariantSpark is 90% faster than traditional compute frameworks and requires 80% fewer samples to detect statistical significant signal. The benefits of speed and higher sensitivity enables VariantSpark to open up the usage of advanced, efficient machine learning algorithms on high dimensional data.

Performance — Runtime vs accuracy of six available implementations showing that VariantSpark has the highst accuracy and is substantially faster than its competitors, enabling point-of-care diagnostics witin 30 minutes instead of 24h.

Detect stronger predictive markers

VariantSpark’s machine learning method overcomes the limitation of traditional approaches that requires data to be eliminated or identifying only independent markers. Especially complex events are triggered by multiple contributing factors. VariantSpark is able to detect such sets of interacting features thereby identifying more accurate predictive markers.

Explainable ML

VariantSpark builds on the Random Forest Machine Learning method, which allows to interrogate the tree-based models and identify which features contributed in what proportion to the overall prediction outcome. We also provide a visualization engine that shows the interplay between features and their label association.

Awards

Pricing

DIY

Free

Open Source access
Full functionality
Documentation access

GitHub

SaaS

Free

MarketPlace service
Full functionality
Example notebooks
Managed updates

AWS

R&D Support

On request

Custom implementation
Bespoke solutions
Product workshops
Managed updates
Full support

Enquire

Do business with us

Let us be your innovation catalyst by helping you understand the health space, solve your pain-points and innovate to keep you ahead. Read more

Our Approach

Traditionally the columns (features) of large “raw” datasets needed to be scaled down to allow advanced analytics to take place. This process of eliminating information potentially biases the result and impacts on the sensitivity of the analysis. VariantSpark is different, it can deal with the raw data directly resulting in a more accurate output.

Variant Spark was tested on datasets with 3000 samples, each containing 80 Million features, in either unsupervised clustering approaches (e.g. k-means) or supervised applications (e.g. Random Forests) with target/truth values that are categorical (classification) or continuous (regression). It can cluster data according to required profiles or identify predictive markers of events in just 30 minutes.
By building a model on the full dataset, this allows, for the first time, to identify sets of markers that together have a stronger predictive power than the previously identified independent markers.

Application cases

Disease gene detection where genomic loci can be detected that contribute jointly or individually to disease.
Polygenic Risk Scores where genomic loci need to be prioritized for building a polygenic risk model, VariantSpark can select the top X features that individually or jointly influence outcome.
Internet-of-Things where data extracted from IoT devices provides datasets with millions of observed features per sample, VariantSpark can provide companies with fresh insights into predictive markers about their customers and behaviours.
Processing plant optimisation where millions of observations about the product property and supply chain can influence the quality of the resulting element, VariantSpark can identify specific components that contribute to optimal outcome.
Churn rate optimisation where millions of factors can play into a customer's decision to churn. VariantSpark can predict which factors or combination of factors are most predictive of churn for your marketing and salesforce department to create an early warning sign indicator to prevent customer churn.

Case Studies

Disease gene detection with the Hipster data on Databricks

VariantSpark is developed for 'big' (many samples, n) and 'wide' (large feature vector per sample, p) data. It was tested on datasets with n=3000 samples each containing p=80Million features in either unsupervised clustering approaches (e.g. k-means) or supervised applications with target/truth values that are categorical (classification) or continuous (regression). Though VariantSpark was originally developed for genomic variant data, where p={0,1,2}, it can cater for every feature-based dataset, e.g. methylation, transcription, non-biological applications.

Accessibility

Click on the icons to see all our activities with these vendors.

Collaboration Partners

Getting Started

To help you get started we've put together a couple of videos.

1) Using VariantSpark on AWS Marketplace

Video coming soon. Meanwhile read Lynn Langit's Medium article.

2) Using VariantSpark on Databricks

3) Spinning up VariantSpark on AWS using cloudformation

Arash Bayat, Piotr Szul, Aidan R O’Brien, Robert Dunne, Brendan Hosking, Yatish Jain, Cameron Hosking, Oscar J Luo, Natalie Twine, Denis C Bauer, VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data, GigaScience, Volume 9, Issue 8, August 2020 DOI: doi.org/10.1093/gigascience/giaa077

Bayat et al. VariantSpark, A Random Forest Machine Learning Implementation for Ultra High Dimensional Data BioRxiv 2019

O'Brien et al. VariantSpark: population scale clustering of genotype information BMC Genomics 2015

VariantSpark

Value Proposition

Fast and sensitive end-to-end approach for high-dimensional data

Detect stronger predictive markers

Explainable ML

Awards

Pricing

Do business with us

Our Approach

Application cases

Case Studies

Disease gene detection with the Hipster data on Databricks

Accessibility

Collaboration Partners

Getting Started

Case study: Identifying Population-specific Genetic Markers of Metabolic Syndrome in South Koreans

BitEpi, a software tool to find hidden epistatic interactions in the genome

The Lighthouse: World-first software could revolutionise how we understand disease

CSIRO awarded NSW iAward for VariantSpark Innovation

Trillion data points to identify disease-causing genes

GenomeWeb: Machine Learning Tool Enables Analysis of Complex, Polygenic Phenotypes With Epistastic Interactions

ZDNet: Aussie researchers leverage compute power to analyse genomic data and match donors

itNews: CSIRO uses AI to crunch a trillion genomic data points

Bioinformatics AWS Podcast episode

CSIRO a global pioneer in offering health solutions on the AWS Marketplace

Value Proposition

Fast and sensitive end-to-end approach for high-dimensional data

Detect stronger predictive markers

Explainable ML

Awards

Pricing

Do business with us

Our Approach

Application cases

Case Studies

Disease gene detection with the Hipster data on Databricks

Accessibility

Collaboration Partners

Getting Started

Subscribe

Subscribe to Transformational Bioinformatics