This Explainer series aims to describe the 'Beacon Protocol' and provide a brief introduction to what a Beacon is and how it can help with genomic data sharing.
In the context of bioinformatics, a 'Beacon' refers to a tool that facilitates the discovery of genomic variant information, without revealing any identifiable information about the data sources. This makes it easier for researchers and clinicians to access and use genomic data for research and clinical purposes. The 'Beacon Protocol' is a set of guidelines put forward by the Global Alliance for Genomics and Health (GA4GH), which specifies the requirements for a Beacon implementation, i.e., the actual software that should be used.
History of Beacon
The Beacon Protocol defines the schemas and specifies how they can be queried using an API. The first release of the Beacon API dates back to 2016. However, the first Beacon to be approved by the GA4GH Product Approval Process was not released until late 2018, and was labeled as Beacon V1, or the first version of the Beacon Protocol.
Beacon V1
The Beacon V1 was limited in its capabilities to provide genomic variant discovery. This means users were only able to query the API for:
- Exact search using variant positions (start and end) and alleles (reference and alternate). For example start=[100] end=[200].
- Query using base ranges, using intervals for each start and end positions. For example start=[80,120] end=[250,300].
- Query using an exact start position and a range for end base range. For example start=[100] end=[200,300].
- Using query intervals where at least one base lies within queried interval. For example start=[100,200] end=[100,200].
A Beacon V1 API will respond to these queries using a simple response with a boolean indicating the presence of the variant and its frequency in each dataset. The complete response objects are presented in detail in the official documentation.
Limitations of Beacon V1
The development of Beacon V1 centered around the idea of making genomic variant information accessible. However, the sharing of associated metadata was not defined.
Beacon V2
Beacon V2 was published and approved by GA4GH in 2022. It implements extensive metadata querying capabilities using ontology terms through CURIEs (Compact Uniform Resource Identifiers) on top of genomic variant discovery. Beacon V2 extends genomic variant querying to retrieve the actual variant information along with variant annotations. Beacon V2 has two aspects to it:
-
The Beacon Framework
The Framework outlines the format of API requests, responses, parameters, and other common components. -
The Beacon Model
The Model describes the entities of a Beacon, such as individuals, biosamples, runs, analyses, cohorts, and datasets. The entities in Beacon V2 are related as follows:
Beacon V2 entities
Beacon V2 outshines V1 in terms of metadata completeness. The Beacon V2 model has a broader scope of data associated with genomic variants, and as a result, the entities in the model are defined to have relations as depicted below.
Datasets organise the physical data organisation and grouping, while cohorts can handle logical groupings of data based on clinical studies, among other things. The entity relations can be used to query any aspect of a Beacon V2 instance by specifying filters along with a scope. For example, one can query for genomic variants while subjecting associated individuals to a set of filters.
Beacon V2 filters
Beacon V2 can handle several types of filters, including:
-
Bio-ontology filters: These filters use ontology terms to query data. Beacon implementers have the flexibility to choose an ontology that suits their needs and data.
-
Custom terms: A Beacon instance can define its own terms based on the data and domain.
-
Numerical values: These filters perform numeric comparisons.
-
Alphanumeric values: Filters that can take the form of strings, including numbers and English characters.
Beacon V2 networks
A Beacon Network is an interface that extends the querying capabilities beyond a single beacon instance. Individual beacon instances can register themselves in a beacon network. The queries received by the beacon network are distributed across the registered beacon instances and responds the user with a combined result. Therefore, Beacon networks provides a single convenient interface to interract with multiple beacon instances without having to replicate the requests across different endpoints.
Beacon Serverless Implementation
There are several Beacon implementations that support the Beacon V2 protocol. These are freely available, with documentation, for genomic consortia to deploy and share data.
We have implemented the Serverless Beacon (sBeacon), which uses serverless technologies to implement the Beacon V2 specification. sBeacon is the only serverless Beacon V2 implementation available at the time of writing. sBeacon implements a large number of features outlined in the Beacon V2 specification and has zero idle compute costs due to its serverless architecture. Furthermore, sBeacon handles VCF files as they are, without transforming them into intermediate formats. Hence, sBeacon has minimal storage costs compared to other database methods.
sBeacon is fast, scalable, and easy to deploy due to its terraform architecture definitions. We aim to further enhance the Beacon V2 experience for clinicians and the scientific community by delivering a highly scalable cloud-native solution.
Try sBeacon Today!
https://github.com/aehrc/terraform-aws-serverless-beacon/