Genomic research is growing fast, and so is the amount of data it produces. This boom in data requires innovative solutions to unlock meaningful insights while also protecting sensitive information like people’s genetic data. To address this, large biobanks such as the UK Biobank (UKB) are now sharing their data through secure cloud-based Trusted Research Environments (TREs).

In November 2023, the UKB released the world’s largest single unified whole genome sequencing (WGS) data of 500,000 people. This groundbreaking project generated 27.5 petabytes of data (that’s 27.5 million gigabytes!). Approved researchers can now access and analyse this data through UKB’s TRE, the Research Analysis Platform (RAP), which is powered by DNAnexus and Amazon Web Services.

While this new way of accessing data through the RAP improves security, it also brings new challenges.Researchers unfamiliar with cloud computing or bioinformatics may face a steep learning curve. Additionally, running large-scale analyses on the cloud can get expensive, particularly with large sample sizes like the UKB.

To aid researchers in overcoming these challenges, we developed RAPpoet (RAP parallelisation orchestration engine template). RAPpoet enables massively parallel workloads with centralised coordination.

In a case study on coronary artery disease (CAD), RAPpoet reduced runtime by 94% (from 30 minutes to just 2 minutes). Because faster computing on the cloud usually means lower costs, our study also demonstrated the importance of compute optimisation which resulted in a 44% cost reduction (£0.052 to £0.029 per file, saving at least £1,260 across the whole dataset). RAPpoet also makes better use of available computing resources, which helps lower cost even further.

By enabling faster and cheaper analysis on the RAP, new scientific insights from the unprecedented data can be found.

1. Machine learning approaches were more sensitive at identifying genetic associations, finding significant CAD risk variants that conventional methods overlooked.

2. Through fine mapping we were able to pinpoint the likely causal genetic variant, rs10757274, in a known CAD risk gene on chromosome 9.

As more mega-biobanks are established, secure cloud-based TREs integrated with tools like RAPpoet will be essential. They’ll help ensure that researchers can run large, powerful, and cost-effective studies with these growing datasets.


Sng, L.M., Kaphle, A., O’Brien, M.J. et al. Optimizing UK biobank cloud-based research analysis platform to fine-map coronary artery disease loci in whole genome sequencing data. Sci Rep 15, 10335 (2025).