Case study: Synthetic data for quality control in genomic pathology

The client

The Royal College of Pathologists of Australasia (RCPA) is a leading non-profit organisation representing pathologists and senior scientists in Australasia, with the aim of improving the use of pathology testing to achieve better healthcare. Since its inception in 1956, the college has been dedicated to the training and professional development of pathologists. The Quality Assurance Programs (QAP) initiative began in 1968, with the goal of working collaboratively with clients and advisory committees to develop new Quality Assurance Programs that meet the evolving needs in the field of pathology. The RCPA Quality Assurance Programs (RCPAQAP) Pty Ltd, a company that operates independently yet in alignment with the Royal College of Pathologists of Australasia, provides a comprehensive range of external quality assurance services for pathology laboratories in Australia and over 80 countries internationally.

The challenge

The bioinformatics pipelines used by clinical laboratories to analyse next-generation sequencing (NGS) data need to constantly evolve to match wet-lab innovation and incorporate new knowledge gained. These continuous changes increase the risk for error and make standardisation between laboratories challenging. The quality control (QC) of these bioinformatics pipelines is therefore essential, providing the clinical laboratories with checkpoints to ensure that the analysis results of clinical genetic datasets are accurate and reliable.


However, running QC checks requires the creation of test datasets, which model the different types of variants and simulate the evolving complexities of genomic testing. Furthermore, the selection of genomic data for quality control pipelines must adhere to ethical and privacy considerations, since real genomic samples cannot be shared freely between labs for standardised assessment.

The solution

The Commonwealth Scientific and Industrial Research Organisation (CSIRO) developed an automated QC pipeline to create synthetic patients, enabling RCPAQAP administrators to customise the synthetic genome to fit any desired form of inheritance. This pipeline initiates by generating synthetic genomes that are indistinguishable from real genomes and are free of mutations known to cause Mendelian disorders, as reported in the ClinVar database. Then, based on these "healthy" genomic backbones, artificial mutations can be spiked in at specified genomic locations, allowing laboratories to evaluate their detection capabilities against known targets. The genomic backbones represent different ethnicities to simulate the diversity of the Australian population. Additionally, the pipeline offers the possibility to work with trio sets (both parents and one child) to allow clinical testing against inherited genomic traits.

The quality control pipeline

The pipeline enables RCPAQAP administrators to spike in a desired number of specified variants as well as different types or levels of complex variants to a "healthy" synthetic patient, and thereafter generate the FASTQ files necessary for laboratory testing and downstream QC analyses. To facilitate easy use of the pipeline, it has been packaged into a container, complete with the necessary data and software dependencies.

The outcomes

CSIRO was able to contribute to RCPAQAP's genomic QC pipeline by enabling:

  • Genomic Data privacy - The QC pipeline generates ethnically diverse privacy-protected synthetic genomes without known pathogenic variants, as reported in the ClinVar database.
  • Genomic data complexity - The QC pipeline provides the capability to "spike" different types of variants at specific locations, creating challenge datasets with increasing complexity.
  • Real-world dataset - The generation of trios, representing a family of three individuals: both parents and a child (male or female), mimics a real-world setting where parents are often tested in the clinic.
  • Reproducibility and Flexibility - The containerised pipeline allows RCPAQAP's administrators to easily generate multiple challenge datasets with different levels of complexity for laboratory testing. The generation of FASTQ files enables the spiked variants to be indistinguishable from real reads.