CSIRO Bioinformatics

Logo

Genomic information is increasingly being used for medical research, giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. We developed VariantSpark, a machine learning analysis framework for genomic data, utilizing the BigData Spark engine to enable real-time analysis.

Logo

Performance

Runtime vs accuracy of six available implementations showing that VariantSpark has the highst accuracy and is substantially faster than its competitors, enabling point-of-care diagnostics witin 30 minutes instead of 24h.

Application area

VariantSpark is developed for 'big' (many samples, n) and 'wide' (large fature vector per sample, p) data. It was tested on datasets with n=3000 samples each containing p=80Million features in either unsupervised clustering approaches (e.g. kmeans) or supervised applications with target/truth values that are categorical (classification) or continuous (regression). Though VariantSpark was originally developed for genomic variant data, where p={0,1,2}, it can cater for every feature-based dataset, e.g. methylation, transcription, non-biological applications.

Subprojects

Cursed Forest

Current Spark-based machine learning libraries are optimized for Big customer data, which has very different properties than genomic data, most notably, genomic data has vastly more information per sample (curse of dimensionality). We hence developed a new random forest implementation that allows the dataset to be split not only vertically (per sample) but also horizontally (within samples), while ensuring appropriate information exchange. This allows the efficient utilization of memory and compute resources and makes advanced multivariate machine learning application possible for genomics’s ‘wide’ and big data.

Download

See VariantSpark's GitHub page

Team

Previous memebers

In the Press