This Explainer-series is aimed at describing the “many ways to the cloud”. It help you choose the best way of porting your research onto the cloud.
Running your research on public cloud infrastructure (AWS, Azure, GCP) has many benefits such as cost-efficiency, access to latest technology, and high security standards (for more please see our cloud-native Explainer). But getting started can often be a steep learning curve. Especially, since there is not one single way of tackling this issue.
In fact, there are many “ways to the cloud”, ranging from installing existing code on cloud-based virtual machines (lift-and-shift) to completely (re-)designing into cloud-native workflows. Choosing one, the other or a middle way depends on the situation. We tailor our approach for every project to efficiently set up a solution that produces outcomes fast and is scalable to future challenges.
In this Blog post we start with the simple “lift-and-shift” using the example of our COVID-19 cloud platform.
The research problem
We have developed a web service that can analyse - on a daily basis - the new samples collected around the world of the virus causing COVID-19. Like other RNA viruses, the COVID-19 virus is mutating, so understanding how much it is changing and where the “mutational drift” is heading is important for clinical decisions and vaccine development.
Our approach creates a unique fingerprint from the genome of the virus. From this we can quantify the distance between any two viruses, which allows us to place them on an 2D map. Researchers around the world can then inspect how much of the “evolutionary space” the virus has already claimed.
The cloud architecture
Analysing the thousands of new samples every day and integrating them with the millions of of virus information already in the database is a memory intensive task. We therefore use moderately beefy virtual machines (AWS EC2 instances) to support the main computational load. To accommodate a typical lift-and-shift we installed all the relevant pipeline dependencies and packaged them in a virtual image, which can be called on-demand.
A successful lift-and-shift approach allows user to easily port their workflow onto the cloud to meet the compute demands yet still operate at an acceptable cost. With the above lift-and-shift approach we were able to successfully scale up compute resources with the growing data sizes.
Cloud-native automation to lift and shift
The workflow is a sporadic task, i.e. completing the analysis of the newly submitted sample and then waiting until the next sample has been submitted. We hence do not need the EC2 instances running 24-7 and instead can use the power of cloud architecture to ramp up and down the compute resources.
To maintain the idle cost of $0 on our COVID-19 cloud platform, we wanted to trigger and terminate the pipeline automatically, and only pay for the analysis, the validation of the data and the governance of information flow.
COVID-19 cloud pipeline initially used AWS batch with a docker image. Although this allowed to spin-up resources on demand, all parts of this workflow were given the same resources, it wasn’t needed. A more cost-effective solution was to break the pipeline into two parts: each with different resource requirements and controlling them via lambda functions.
For this we are using so called “serverless” components (AWS Lambda functions), which can monitor the sample submission feed, kick-off the analysis and shut down the architecture upon completion.
The resulting data stays on the cloud (AWS S3 bucket) and is permanently query-able through the webpage, which in turn is a static page in an S3 bucket. The data itself is not particularly sensitive so we did not encrypt it; the S3 security protocol itself is secure enough for this application.
Moving to Infrastructure as code
Bioinformatics solutions are constantly evolving and thus the frequent updates to the pipeline or to the webpage need to be staged and tested in the development environment before being pushed out to the production environment. Moreover, all the bioinformatics solutions we develop need to be reproducible for clients. To be able to easily reproduce the solution infrastructure in different environments, we write infrastructure as code using Terraform. The full architecture is implemented as code so everything from the roles and permission to the actual analytics code can be defined and rolled out by a single command.
Lambda functions are designed, updated and deployed using Terraform, which allows easier debugging and validation.
Navigating security as a researcher
The webservice needs to provide security in two areas, firstly around protecting the data and secondly providing a safe web-page for users.
COVID-19 cloud platform uses GISAID copyrighted data, which is downloaded securely to an S3 bucket via credentials secured in parameter store of AWS systems manager. GISAID allows analysis of the data as long as the dataset is not distributed or exposed to re-creation. Our static webpage doesn’t display any genomic sequence information and adheres to GISAID guidelines for acknowledgements and data distribution.
The traditional lift-and-shift approach allows researchers to port their automated pipelines to the cloud and have something up-and-running very rapidly, such as our bioinformatics analysis pipeline for COVID-19 viral evolution. With this you can benefit from the cloud almost instantly, such as efficiency, scalability and and reproducible, such as scaling to hundreds of thousands of samples. As you gain more cloud-savviness, improvement to the architecture can be added even retrospectively, such as making it more cost-efficient by using serverless to orchestrate resource allocations.