Trillion data points to identify disease-causing genes

Researchers from CSIRO, Australia’s national science agency, have accomplished a world-first by processing one trillion points of genomic data through VariantSpark, an artificial intelligence-based platform, which can help pinpoint the location of specific disease-causing genes in the human genome.

The human genome is a person's complete set of DNA, which contains more than three billion DNA base pairs.

CSIRO Bioinformatics Group leader Dr Denis Bauer said artificial intelligence (AI) could give a deeper understanding of complex diseases, in a fraction of the time compared to traditional approaches, by analysing immense genomic datasets.

"Our VariantSpark platform can analyse traits, such as diseases or susceptibilities, and uncover which genes may jointly cause them," Dr Bauer said.

"This can provide valuable information about how the disease works on a molecular level, which can ultimately lead to better treatments.

"VariantSpark is already being used to help determine what genes might be linked to cardiovascular disease, motor neurone disease, dementia, and Alzheimer's disease."

In a new study published in the technical journal Giga Science  , the researchers analysed a synthetic dataset of 100,000 individuals, enabled by Amazon Web Services (AWS).

Dr Bauer said no other technology platform had been able to process one trillion data points of genomic data, over ten million variants and 100 thousand samples at once.

"Our research shows VariantSpark is the only method able to scale to ultra-high dimensional genomic data in a manageable time," Dr Bauer said.

"It was able to process this information in 15 hours while it would take the fastest competitors likely more than 100,000 years to process such a volume of data.

"This is a significant milestone, as it means VariantSpark can be scaled up to analyse population-level datasets and drive better healthcare outcomes."

CSIRO's Australian e-Health Research Centre CEO Dr David Hansen said AI technologies were crucial to the future of healthcare in Australia.

"Artificial intelligence is a critical component of understanding genomic information, which is increasingly being used to guide healthcare delivery in Australia and around the world," Dr Hansen said.

"Despite recent technology breakthroughs with whole genome sequencing studies, the molecular and genetic origins of complex diseases are still poorly understood which makes prediction, application of appropriate preventive measures and personalised treatment difficult."

VariantSpark was developed by CSIRO's digital health research team at the Australian e-Health Research Centre with support from CSIRO's digital specialist arm, Data61.

It is also one of the first machine-learning based health products which allows researchers around the world to access important data critical for developing treatments and accelerating research capabilities, that is available on the AWS Marketplace.

VariantSpark is available for download on AWS Marketplace

Originally run on CSIRO.au

Arash Bayat, Piotr Szul, Aidan R O’Brien, Robert Dunne, Brendan Hosking, Yatish Jain, Cameron Hosking, Oscar J Luo, Natalie Twine, Denis C Bauer, VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data, GigaScience, Volume 9, Issue 8, August 2020 DOI: doi.org/10.1093/gigascience/giaa077

Image from https://www.maxpixel.net/