New approaches are needed to process modern datasets that are big and “wide” where millions of samples (rows) can have millions of information points each (columns, features).

We developed VariantSpark, a tailored Apache Spark-based machine learning framework that creates insights from high-dimensional data.

VariantSpark: Creating insights from high dimensional data

Value Proposition

Fast and sensitive end-to-end approach for high-dimensional data

VariantSpark is 90% faster than traditional compute frameworks and requires 80% fewer samples to detect statistical significant signal. The benefits of speed and higher sensitivity enables VariantSpark to open up the usage of advanced, efficient machine learning algorithms high dimensional data.

Performance
Runtime vs accuracy of six available implementations showing that VariantSpark has the highst accuracy and is substantially faster than its competitors, enabling point-of-care diagnostics witin 30 minutes instead of 24h.

Detect stronger predictive markers

VariantSpark’s machine learning method overcomes the limitation of traditional approaches that requires data to be eliminated or identifying only independent markers. Especially complex events are triggered by multiple contributing factors. VariantSpark is able to detect such sets of interacting features thereby identifying more accurate predictive markers.

Explainable ML

VariantSpark builds on the Random Forest Machine Learning method, which allows to interrogate the tree-based models and identify which features contributed in what proportion to the overall prediction outcome. We also provide a visualization engine that shows the interplay between features and their label association.

Our Approach

Traditionally the colums (features) of large “raw” datasets needed to be scaled down to allow advanced analytics to take place. This process of eliminating information potentially biases the result and impacts on the sensitivity of the analysis. VariantSpark is different, it can deal with the raw data directly resulting in a more accurate output.

Variant Spark was tested on datasets with 3000 samples, each containing 80 Million features, in either unsupervised clustering approaches (e.g. kmeans) or supervised applications (e.g. random forest) with target/truth values that are categorical (classification) or continuous (regression). It can cluster data according to required profiles or identify predictive markers of events in just 30 minutes.
By building a model on the full dataset, this allows, for the first time, to identify sets of markers that together have a stronger predictive power than the previously identified independent markers.

Application cases

  • Disease gene detection where genomic loci can be detected that contribute jointly or individually to disease.
  • Polygenic Risk Scores where genomic loci need to be prioritized for building a polygenic risk model, VariantSpark can select the top X features that individually or jointly influence outcome.
  • Internet-of-Things where data extracted from IoT devices provides datasets with millions of observed features per sample, VariantSpark can provide companies with fresh insights into predictive markers about their customers and behaviours.
  • Processing plant optimization where millions of observed about the product property and supply chain can influnece the quality of the resulting element, VariantSpark can identify specific components that contribute to optimal outcome.

Case Studies

Disease gene detection with the Hipster data on Databricks

VariantSpark is developed for 'big' (many samples, n) and 'wide' (large fature vector per sample, p) data. It was tested on datasets with n=3000 samples each containing p=80Million features in either unsupervised clustering approaches (e.g. kmeans) or supervised applications with target/truth values that are categorical (classification) or continuous (regression). Though VariantSpark was originally developed for genomic variant data, where p={0,1,2}, it can cater for every feature-based dataset, e.g. methylation, transcription, non-biological applications.

Accessibility

Amazon Web Services logo Terra.Bio logo Databricks logo

Click on the icons to find out more.

Collaboration Partners

Macquarie University logo Australian National University logo


Bayat et al. VariantSpark, A Random Forest Machine Learning Implementation for Ultra High Dimensional Data BioRxiv 2019

O'Brien et al. VariantSpark: population scale clustering of genotype information BMC Genomics 2015