As technology improves and associated costs decrease DNA sequencing is becoming more prevalent in health care settings. DNA sequencing technologies are gaining relevance in the clinic because of its decreasing costs and the impact that has on our lifes. Our genome is what defines most of our traits such as hair color or height, but also which diseases are we going to develop or are more like to develop in our lifetime. Each of us has 3 billion pairs of letters that makes us unique. However, humans are not able to comprehend such complexity. 3 billion is too much for a doctor to analyze and draw conclusions. But fear no more! In the last decades researchers developed hardware and software to read and analyze such big datasets. Genome-Wide Association Studies (GWAS) search across the genome to find genetic variants that are associated with the traits of interest (e.g. diseases).
Thanks to these technologies we found that one size does not fit all. We are all unique in our own way and that is where precision medicine takes pride of. Precision medicine aims to find the best care for each one of us. While previous approaches provided a good generalization of what may be good for most, current technologies provide the opportunity for more complex analyses. For example, this enables us to cluster subsets of patients by their similarity on how likely they are going to react to specific medication or what is their quality-of-life prognosis. Moreover, since we gathered unbelievable amounts of data, we are also able to perform analysis using the systems biology paradigm.
In the past, researchers used a reductionist approach where the premise was that a single cause had a single effect. However, the systems biology paradigm argues that biology is much more complex than the simple one-to-one approach often used in the past. In the case of genetics that would be each individual letter (SNP) of our DNA has a unique and quantifiable effect on our health. However, this new paradigm argues that different parts of our genome interact and alter the outcomes in ways that otherwise could not be identified. Each component in the system interacts with other parts of the of the system. The properties of a system can be better identified when the system is looked holistically using a greater perspective and more complex algorithms.
Currently, most of the GWAS are done using regressions on individual SNPs; thus, missing interactions between different parts of our genome. Doing regressions for all possible SNP combinations is currently infeasible. To overcome this, we implemented VariantSpark a random forest model to identify the most important SNPs that affect the outcome. For this case, random forests are an ensemble of classification trees that take the genomic mutations in our DNA try to classify the phenotype given (generally a disease or trait). The beauty of this method is that we are able to take into consideration the interactions between SNPs so we no longer isolate single regions of our DNA but instead consider all possible interactions. With this tool we are able to obtain the variants that have a statistically significant impact based on the variable importance of the model.
Now you may be thinking, what about the interactions and the network effects? We can perform such analysis with BitEpi. BitEpi is a tool that enables the researcher to exhaustively search and test higher-order epistatic interactions. That means that the software tests ALL interactions of 2-SNP (pairs), 3-SNP (trios), 4-SNP (high-order). It is coded in C++ for efficiency and parallelized to use all available computing power. The usage can be done through the command line but also via python.