Guest post by Dr Christopher Stewart, Sydney University, accompanying Denis' talk at the International Science School (ISS) 2019 symposium. The ISS enables 140 top science students from Australia and overseas to hear talks by world-renowned scientists. This year's lineup included science titans like Dr. Karl and Prof. Kathryn North.
Each of us is genetically unique, with millions of tiny differences in our genes that make us who we are. Then there are the mutations that cause disease: some of them are simple, a single altered base-pair at a specific site; others are much more complex, with multiple mutations acting in concert.
Start doing the maths and you quickly realise that the computation needed to understand our genomes is staggeringly complex. Old methods are giving way to new, cloud-based, massively-distributed and entirely abstracted computation, with machine learning at its core.
Dr Denis Bauer is CSIRO’s Principal Research Scientist in transformational bioinformatics and an internationally recognised expert in machine learning and cloud-based genomics. Her achievements include developing open-source machine-learning cloud services that accelerate disease research, used by 10,000 researchers annually.
BigData and Genomics
ISS: Denis, you sent me some information before we started this interview, which contained the statement, “Genomics produces more data than astronomy, Twitter and YouTube combined.” Now, YouTube claims that 300 hours of video are uploaded to their site every minute. You’re saying that genomics is bigger than YouTube, and Twitter—and throw modern astronomy in there too, which is producing ridiculous amounts of data. How is that possible?
DENIS: There are three billion letters in the genome, and stretched out, the genome is two metres long. There are 100 trillion cells in an average human body. If you do the calculation, two metres times 100 trillion—that’s larger than the universe!
Here’s another statistic for you. By 2025, the global market research firm Frost and Sullivan estimates that 50% of the world’s population will have been sequenced.
Half of the world, in just six years?
Half of the world’s population. And the reason for this is that the genome holds the blueprint for everything—our future disease risks and previous conditions are encoded in the DNA. If we want to know about our risk for Alzheimer’s disease, or our ability to metabolise a certain drug, analysing the genome is the key.
So, in terms of the future of health education, the genome will be one of the first steps for almost everything. It will be standard procedure to read out the genome in order to get your general health profile. With that in mind, and the amount of effort that it involves, it’s quite easy to see how this will end up as more data than people uploading videos to YouTube about their pets and their hobbies.
In a way, it’s nice to think that this vast amount of data is being created for something more useful than cat videos. Don’t get me wrong, I think the astronomical data is important too! But I’m happy to know that YouTube and Twitter are being outdone by genomics.
And on that comparison to astronomy—there’s one universe, which you can measure, sure. But there are several billion people on Earth, and all of them have their own genome. Not only that, but every cell in their body will have some quirk in the genome—so it’s a staggering amount of information that you have to pull together if you want to look at the ecosystem of human health from the point of view of genomics.
When you consider that the Human Genome Project was not that long ago, in 2000.
Right, 19 years ago.
That’s a lifetime ago for our ISS students, but it’s really not so long ago in the history of science, and in the history of computing. You’re talking about sequencing 50% of the world’s population within the next six years. That’s a huge change!
It definitely is. When you look at other breakthroughs like X-rays and CT scans, they have taken decades to be used commonly in the health ecosystem—whereas genomics is almost a routine application already, just within 20 years.
Everything’s accelerating. For example, gene therapies like CRISPR are being adopted even faster, moving from basic science to application even more rapidly. That’s how the world works today.
Sequencing the very first human genome was a very difficult and expensive process. Now we’re able to analyse vast quantities of data for both statistical differences across populations, and individual differences for specific human beings. Can you talk about how that’s actually accomplished? How do you work with these vast amounts of data to extract the information that you need about someone’s health?
The processing of the raw data is quite routine by now—taking the bio-specimen, running it through the machine, pushing it through the computations is relatively straightforward by now. There’s a procedure around that. And any identified differences in that individual are compared to a reference genome. On average there are two million differences between individuals, and therefore there are two million mutations that play a role in who you are, that define the way you look, your ethnicity, and then of course your risk of disease and other predispositions that you might have.
So as I said, identifying these differences is relatively straightforward now—you can use high performance compute clusters, the Broad Institute have released an analytics toolkit on a public cloud provider, which many people have adopted as their method of choice—but the point is, it’s routine.
Where research still needs to be done is in how to pinpoint locations in the genome that are statistically different between populations. This is important for two areas: the first is the rare genetic diseases, which the ISS2019 speaker Kathryn North is working on, and where relatively well-established processes exist.
How does that work, then?
On average we have around two hundred mutations in our genome that break things, though most of us are healthy individuals because there is so much redundancy in the system. If we want to identify the genetic driver of a rare genetic disease, you go about eliminating those two hundred mutations that you found, one by one.
That can be done by comparing to other large cohorts, because if you see a particular mutation appearing in another person without the rare genetic disease, then it’s unlikely to be the causative mutation for your patient. So that is pretty straightforward.
The area that I’m looking into is in complex genetic diseases, where the disease is associated with multiple mutations in the genome—so you can’t rule out anything! Any mutation, in combination with another resilience factor, or another exacerbating factor, might be the driver of a disease for that individual.
And this is where machine learning comes in.
OK, so with these more complex diseases, how do you use machine learning to work out what’s going on?
The typical approach is that you go location by location and look at how mutations in your cases compare to your controls—that is, how sick individuals compare to healthy ones. And then you select the ones that have the highest skewness—the ones that are most enriched in one or the other class—and then from there you would build your multi-genic model.
The problem with that is that your initial screening of the mutations was based on individual contribution to the disease—and as I said before, a mutation might not be the driver of the disease individually, but in combination with resilience factors or exacerbating factors it might be the one that is actually causing the disease. So individually ruling out or ruling in mutations like this is a very crude approach.
The difference with machine learning methods is that they can build models over the full data set at the same time. You don’t have to preselect information anymore because you can build your machine learning model over the full data set and see which combinations of mutation together are driving the disease.
The machine learning algorithm is able to consider the whole data set, looking for the patterns. It’s churning away, looking for connections that wouldn’t be obvious, or would be much harder to find.
Yes, that’s exactly right. Clearly there needs to be a bit of trickery, because typical machine learning methods cannot deal with the kind of data that we are talking about here. We don’t have many samples, but we have huge amounts of describing points per sample.
Can you explain that a bit more? What’s the issue with the genomic data that makes this difficult?
Typically with machine learning, the rule of thumb is that you have to have as many samples as you have features or describing elements about that sample. For us, our cohort size is typically, if we’re lucky, in the thousands, maybe two thousand individual genomes to work with. But then the amount of data we’re looking at in the full genomic profile is three billion letters. Clearly not all of those are different—the mutations are the same, because that’s why we’re looking at them—but there are typically two million differences between individuals.
For a cohort size of two thousand, that averages up to around 80 million features, which is orders of magnitude larger than the number of samples that we’re dealing with.
You’re saying the data is way out of the normal parameter range for machine learning problems, where you’d normally have as many samples as features. What do you have to do in order to get around that?
Yes, these problems are very deep it what is called the ‘curse of dimensionality’, where machine learning methods would not typically be useful, because there are so many combinatorial factors that they can’t robustly discriminate.
The way that we go about it is, first of all, we choose a method known to be less prone to this curse of dimensionality, known as ‘random forests’.
OK, I don’t think I have ever heard of that, so I need to ask you to explain: what’s a random forest?
A random forest is basically a forest of decision trees, so let’s start with those.
You have a sample of data, and you’re trying to find the best descriptor you can for that sample. In our case, you have cases of disease and people without the disease, and you’re looking for the combinations of mutations that best accounts for the disease in the sample.
A decision tree is a classifier. You pick a feature, a describing element about the sample—for example, if I observe a G at a particular position in the genome, and not an A, then I find 50% of the individuals have the disease compared to those who don’t have it—that sort of thing. You try to find locations in the genome with the most discriminative power.
OK, got you so far—so a G in that location has a bit of descriptive power for cases of disease in your sample.
Now, we know that multiple locations in the genome work together with these diseases, so we look again and find that particular position in the genome, together with a different position, a mutation somewhere else, gives an even better split. It’s even better at describing the sample.
And you traverse through different features and try to split your data set, to sort it out as purely as possible, case versus control. That is one decision tree.
So, one decision tree is working through the data, trying out differences in the genome and sorting out the sample from those differences, trying to find which group or chain of mutations in the genome are able to differentiate cases of the disease against those who don’t have the disease.
Right. Then you can build multiples of those, giving different portions of the data to different trees. To get around the dimensionality problem, each tree is not dealing with the whole genome, it’s given a chunk of the genome—but if you bring a whole forest of trees together, then the forest has seen the full genome! We can get the results effectively over the full genome, because you’re looking at the aggregate result from the forest, rather than any one tree.
OK, I think I get this. You’ve got an enormous amount of activity happening in parallel, each of these trees sorting through its subset of the data and finding connections between locations in the genome. None of them is complete, but combined together, the picture that the whole forest produces gives you information about the entire sample, over the entire genome. No individual bit will give you the answer—but taken all together, you find a pattern of mutations that account for the disease.
Exactly. That’s a random forest.
The other piece of the puzzle is that all of this depends on machine learning methods—that only with the computational power we have today do we have the capacity to build that many trees in parallel. In the ‘60s when the idea of random forests was invented, and in the ‘80s when they were actually first applied, the number of trees that were used ranged up to a thousand, maximum—which means the maximum number of features you would typically look at was at most ten thousand. That’s two orders of magnitude smaller than the data that we have to deal with here.
Now we’re able to use cloud computing and things like Hadoop Spark*, these routine parallelisation methods that give us the last piece of the puzzle to build huge numbers of decision trees—ten thousand, a hundred thousand—to almost exhaustively search the space for combinations in the data.
A key part of this is that we’re offloading the computational power to the cloud, rather than needing to have an actual bit of dedicated hardware ourselves. Even the high-performance compute clusters that we used just five years ago, their way of parallelising was very bespoke. You had to encode every communication between the nodes, and orchestrate the parallel computation yourself.
The huge advancement that Google's MapReduce framework, and later Apache Spark, have brought in is being able to use commodity hardware—not specialised high-performance multi-CPU nodes, just a regular CPU off the shelf—and stringing hundreds, or thousands, or tens of thousands of those together, and then writing an instruction library that dissolves the boundary between those nodes, allowing the CPUs to exchange information seamlessly.
This feels like a theme in a few of the ISS topics this year—the incredible pace of advancement in modern computing technology converging with scientific applications hungry for processing power. If there are enough groups in the world who have a need for high intensity processing power, then someone like AWS or Google or Azure can happily make good profits building huge clusters, and selling that computing power to you, relatively cheaply.
Exactly. The cost of innovation is drastically reduced if you don’t have to buy our own compute cluster—we can just rent it!
That makes sense.
And then there’s serverless computing.
As in, no more servers. It’s a new kind of cloud architecture that has the potential to change everything yet again. It’s sort of like prefabrication, which was a game-changer in the building industry—similarly, serverless will be a game-changer in the cloud and IT industry, with the potential to be a 20 billion dollar market in the next two years.
Right, we should talk about that then! Tell me about serverless computing.
The concept is a step further along the path of wanting to utilise every CPU on our cluster. If you have your own machines, you have to know how they network and communicate. With Docker containers, for example, we abstract away the application from any of the environments they will run on. Serverless is another step further along, by abstracting away other elements of the computer as well. It’s like the servers don’t exist.
The idea is that you write your code, and you just say, go execute that in the cloud. I don’t care where it’s executed, or on what processors it’s executed, or what the communication is between disc and processor. All I want to have back is the result.
I don’t want to manage any databases, I don’t want to manage the communications between the user and the cloud infrastructure—everything becomes modular, a set of services that you can seamlessly string together to solve your problems in the cloud.
OK, that’s the analogy to prefabrication in the building industry—where you can say, I don’t care how you make it, I just want concrete panels of this size and shape, I want doors and windows and floors, send them to me and I’ll build a house.
If you’re Google or Amazon, it must be an enormous amount of work to set all this up—but once you’ve got the framework in place, you can roll it out to anyone who has an interesting research question and say, look, we’ve done all the background framework for you! All you have to do is throw us your question and we can give you an answer.
It’s not only in the research space—it’s basically everyone.
Everyone can become a builder, because the tools of how to build have become generalised. In the cloud, anyone can start up a web service in a matter of minutes now—and that web service is not just a static webpage, it can be anything you can think of.
It could have a sophisticated compute engine behind it like our genome search engine, which uses serverless technology to sift through the whole genome in order to find the best target site for CRISPR, for example. And all of that is scalable—if one researcher or a hundred thousand researchers want to search one gene or a whole genome, the system itself scales automatically. And if no one is using it, we don’t pay anything for it!
It’s a true game-changer for the IT space, and how people will think about setting up and building projects going forward. I like to look out for technological advances that really are a step-change, that leapfrog over multiple generations of linear improvements—and cloud computing, serverless, ultimately quantum computing too, will be some of those step-changes that will make things possible that were definitely impossible before. That’s what I’m looking for, that’s what I get excited about.
Can you give me an example of the sort of thing you’re talking about? Something that was impossible before, but serverless computing, the cloud technologies, now makes possible?
We talked about sequencing becoming so cheap now that you can easily identify genetic mutations that might cause a disease. But in order to really verify that they do cause the disease, you have to prove it in the lab—and this is typically now done with CRISPR base nucleotide editing. There are millions of diseases out there suspected to have a genetic component, and huge numbers of laboratories interested in validating that—but they’re not currently capable of doing so because there’s this problem around finding reliable, efficient ways of inserting that mutation into the genome of their experimental model.
Our genome search engine is targeting that exact use case. Any researcher in the world can go to our web service and say, I want to edit this particular gene—what is the most reliable, most effective way of doing it? It’s basically democratising access to high quality, safe options for editing a genome—and by ‘safe’, I mean finding places on the genome that will work well, that are not off-target where CRISPR would do more harm than good, or just make a hash out of your precious samples because it would not do the cutting properly.
Previously, in order to find that sort of information, researchers would need to have a high-performance compute cluster themselves—or, for us to host the search engine service for them, we would have to foot a bill of thousands of dollars a month, which we could not have afforded. So we would have to make it a pay-by-service approach, which would be expensive for the users, or we just wouldn’t have been able to provide the search engine in the first place. Either way, people would not have had access to safe CRISPR predictions the way they have now.
Which means that the research would have been much harder, or much slower, or just wouldn’t have happened at all.
Exactly. But with serverless technologies, all those laboratories that want to help in verifying genetic diseases can now go on our webpage and find the best way of doing it, and then write a paper showing that this mutation is producing the symptoms associated with that genetic disease.
Implications of Genome Engineering
Given the pace that genomics has been advancing—and you mentioned CRISPR, which is moving even faster!—these very new capabilities are revolutionising the ways we think about human health, the ways we diagnose, treat and think about disease.
It’s really easy to get caught up in how exciting and powerful these new technologies are—but has there been enough time for us to consider the ramifications and the ethical implications of what we’re doing?
It’s a tough question and there’s no right or wrong answer, because depending on which perspective you’re looking at, you can easily make arguments either way.
If you look from the point of view of the parent of a child with a rare genetic disease, these advances can’t come fast enough! Clearly they want to be able to help that child—and technically, that sort of thing is possible today.
It’s very difficult to argue against a parent who just wants their child to be healthy. If we can do something about it, there’s an ethical argument that we should do something about it.
Exactly, it almost seems a human right to do it.
At the same time, there are very good arguments against moving as fast as we can to save the individual, many of which have been outlined in a recent Moratorium paper on human genome editing.
Our understanding of how the human genome works is very limited, so making dramatic changes now in order to save an individual could have future ramifications that we have no concept of.
I don’t think any individual alone can come up with the right answer to how this technology should progress or be regulated. We need as many people involved as possible, to put their brainpower towards exactly these questions—because there needs to be a consensus, we need to think every possibility through, and work out how to mitigate any risks.
Another ethical dimension is, how do we treat the potential of genetic disease?
What do you mean by ‘potential’? You mean, if someone might get the disease?
If there are no symptoms yet, but a disease shows up in their genome, how should someone be treated? Should their job or their opportunities be limited just because there’s a risk in them developing a certain genetic disease? Which all begins to sound like a sort of dystopia, something from the film GATTACA...
Of course, yes! If you know someone could get sick in the future with a genetic disease, would you employ them? Would you marry them? Should they have children? It’s an ethical time bomb!
And again, every layer of society needs to get involved, needs to be educated, needs to form an opinion about how they want the world to look in the future.
Those are some very very big issues. Is that a bit daunting for you? I mean, this is obviously an exciting field to be in, but it’s also got a very high level of responsibility.
Absolutely. I also see a responsibility for researchers to make this work accessible, helping to provide accurate information to the general population. We have to make this subject as interesting as possible, we have to convey the difficulties and opportunities as objectively as we can, so people can make up their own minds.
It sounds like it’s an area with enormous potential for growth, a research field ready to vacuum up a wide range of very intelligent people—from basic data wrangling and cloud computing, through genetics and biotechnology, through to the ethical issues. It sounds like there are a lot of possibilities in genomics and machine learning for someone just starting their career.
Absolutely, yes! I might be biased, but to me it’s an absolute privilege to work in this area.