Genetic analyses aim to deepen our understanding of the genetic architecture of complex diseases including Alzheimer’s disease (AD) and cardiovascular disease, thereby improving the clinical care and treatment of these diseases.  The aims of genetic analyses tend to fall into two categories: inference and prediction. An example of a question of inference is “which genes are associated with Alzheimer’s disease?”, whereas one of prediction might be “what is the likelihood of a particular person developing Alzheimer’s disease?”. Inference focus on understanding and formalising genetic relationships while prediction focus on determining the next best steps to combat a disease.

These two questions – of inference and prediction – can be resolved using two different approaches: statistical analysis or machine learning. Traditionally, statistical analyses focus on inference, filtering out the true disease gene (i.e., the signal) from the rest of the genome (i.e., the noise). On the other hand, machine learning algorithms were developed with prediction in mind, learning and finding patterns in big datasets.

The key difference between the two approaches is the interpretability of their results. While statistical analyses return interpretable parameters and values like the β coefficient and p-values and how they related, machine learning algorithms are ‘black-boxes’ and how the algorithm arrived at the predictions cannot be fully understood. Thus, conventionally, the line between the two methodologies is clear: statistics for inference and machine learning for prediction.

With that being said, we have developed a novel approach called RFlocalfdr that blurs that line. Published in the Computational and Structural Biotechnology Journal and in collaboration with Rob Dunne from CSIRO's Data61, RFlocalfdr is a statistical approach for thresholding variable importance measures that identifies significant associations while reducing false positives by building on the empirical Bayes argument of Efron. The RFlocalfdr approach only requires a single fitting of random forest and does not use ‘shadow variables’ making it applicable to extremely high dimensional datasets, including genomic data.

In concert with our machine learning GWAS platform, VariantSpark, which is capable to processing one trillion datapoints, this is a step towards interpretable machine learning in genomics research. Indeed, we have applied this approach to AD (published in Nature's Scientific Reports) and have found two novel genes associated to the disease, which are likely to work in epistasis with the well-established AD risk gene, APOE.

By integrating machine learning and statistics, we have added an important tool to our arsenal in the fight against genetic disease.

The RFlocalfdr approach as an R package or as a python script, and is fully incorporated into VariantSpark’s API.

PS: A joint article by Letitia Sng & Lewis Vincent


Dunne et al. Thresholding Gini variable importance with a single-trained random forest: An empirical Bayes approach Computational and Structural Biotechnology Journal 2023

Lundberg et al. Novel Alzheimer’s disease genes and epistasis identified using machine learning GWAS platform Scientific Reports 2023