CSIRO Bioinformatics


Genomic information is increasingly being used for medical research, giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. We developed VariantSpark, a machine learning analysis framework for genomic data, utilizing the BigData Spark engine to enable real-time analysis.



Runtime vs accuracy of six available implementations showing that VariantSpark has the highst accuracy and is substantially faster than its competitors, enabling point-of-care diagnostics witin 30 minutes instead of 24h.

Application area

VariantSpark is developed for 'big' (many samples, n) and 'wide' (large fature vector per sample, p) data. It was tested on datasets with n=3000 samples each containing p=80Million features in either unsupervised clustering approaches (e.g. kmeans) or supervised applications with target/truth values that are categorical (classification) or continuous (regression). Though VariantSpark was originally developed for genomic variant data, where p={0,1,2}, it can cater for every feature-based dataset, e.g. methylation, transcription, non-biological applications.


A fun illustrative example is our hipster-index, where we use VariantSpark's association testing feature to find genes associated with a 'Hipster' phenotype (i.e. coffee consuption, facial hair,ect.). See more at our DataBricks page.

Want to have a high resolution image? Join the mailinglist


Cursed Forest

Current Spark-based machine learning libraries are optimized for Big customer data, which has very different properties than genomic data, most notably, genomic data has vastly more information per sample (curse of dimensionality). We hence developed a new random forest implementation that allows the dataset to be split not only vertically (per sample) but also horizontally (within samples), while ensuring appropriate information exchange. This allows the efficient utilization of memory and compute resources and makes advanced multivariate machine learning application possible for genomics’s ‘wide’ and big data.

Cursed Forest will be the next version of VariantSpark.


GitHub Repositories

Stay in touch

Join the mailinglist.

In the News



Previous memebers