Nowadays, a growing range of parallel architectures, from general-purpose platforms to domain-specific architectures and programmable devices, are at our disposal to undertake grand computational challenges in science. The increasing compute capabilities of parallel devices has led to important advances in the solution of real-world optimization problems, supporting computational intelligence approaches (e.g. bioinspired metaheuristics) in the pursuit of faster and more accurate search methods. However, the choice of the parallel programming model represents a difficult issue, given the plethora of alternatives currently available. Moreover, the device (and even vendor)-specific nature of some popular programming models imposes portability restrictions that impact the practical applicability of the proposed parallel methods. To address these issues and orchestrate the use of such a wide spectrum of parallel devices, the high-performance computing (HPC) community has been pushing the definition of cross-architecture, unified programming models. Initiatives like SYCL [RAB+21] represent an opportunity to attain this goal, facilitating parallel code development and tuning across heterogeneous devices. Applications in tsunami simulation, cosmology, and benchmarking show the potential of cross-architecture models [NB21].
From a cross-architecture perspective, the attainment of accurate compute performance across devices represents a key goal. However, cross-architecture parallel methods should be devised not only with the aim of achieving major performance improvements, but also considering the increasing concerns on energy consumption [MBB22]. In the current world context, accomplishing a successful paradigm shift from red computing to green computing is of uttermost importance, especially in the case of artificial intelligence applications due to their large carbon footprint [SDS+20]. Therefore, in the pursuit of sustainable computational methods, the definition of strategies to leverage resources for improved performance and energy efficiency must be conducted.
This project proposal is aimed at extensively exploring the benefits of defining cross-architecture and energy-aware parallel metaheuristics in an important application domain: bioinformatics. More specifically, we are interested in investigating how cross-architecture programming models and green heterogeneous computing can support bioinspired metaheuristics to deal with the limitations of current device-tailored red bioinformatics codes. As our main case study, this proposal will target real-world bioinformatics problems derived from the area of phylogenetics, the centrepiece of research in evolutionary biology.
Phylogenetic inference deals with the identification of the evolutionary events that motivated the diversity observed in the molecular characteristics of current organisms or genes [War17]. Evolutionary relationships are illustrated in the shape of tree structures T = (V,E), known as phylogenetic trees. The node set V contains: 1) internal nodes describing hypothetical ancestral organisms, and 2) leaf nodes representing the extant organisms under study. The phylogenetic topology is consequently defined by the node-to-node relationships specified in the branch set E. Multiple applications (e.g. conservation biology, forensics, drug discovery, and epidemiology [ABA+21]) shows the relevance of the information discovered by phylogenetic analyses. In fact, this problem is considered not only a key topic in molecular biology but also a major computational challenge in computer science.
Finding optimal phylogenetic trees implies the exploration of a complex search space that grows exponentially with the number N of input sequences. More specifically, the number of possible candidate solutions is given by the double factorial (2N−5)!!, thus surpassing Eddington number for only N = 50. Nowadays, the application of third-generation sequencing tools has given rise to complex genome-scale datasets with several orders of magnitude larger than traditional datasets. The need to find evolutionary relationships derived from such genome-scale data has motivated a major turning point in the field - from phylogenetics to phylogenomics, where a new variety of computational challenges arise [YG20]. We will herein focus on two main subproblems: phylogenetic placement and missing data imputation.
Phylogenetic placement: Given a reference phylogenetic tree T, inferred from N aligned sequences, and a set of K query sequences, phylogenetic placement consists of determining the most likely location of the organisms characterized in the query sequences within the reference tree. The output of a placement algorithm is a probability distribution that denotes the degree of evolutionary proximity between the clades of the reference tree and the query sequence. Applications in ecological profile estimation and studies of microbial diseases and the identification of low-coverage viral strain genomes denote the importance of this problem [CSD+22]. In fact, phylogenetic placement provided the means to update the viral phylogeny with SARS-CoV-2 strains [TTH+21].
The attainment of optimal placements is a complex issue, since the search algorithms must deal with a high number of query sequences (9,437 for SARS-CoV-2 [TTH+21]) and large reference trees with hundreds of thousands of leaves [CSD+22]. The likelihood function L(T, θ) can be employed to measure the quality of the attachments. However, the use of this criterion in large-scale scenarios is constrained by its intensive computational costs and memory requirements [BS21]. Alternatively, distance-based criteria e.g. least squares can be adopted to soften such computational restrictions. The goal is to minimize the weighted least-square difference between the distances across organisms in the tree and the distances observed in the sequences. Although faster placements can be attained through this method, distance-based criteria are very sensitive to the presence of missing data, the second subproblem to tackle.
Missing data imputation: Data missingness occurs when the nucleotides associated to a genetic locus for an organism are not successfully sequenced. Incomplete data assembly, assembly errors, and redundancy are some of the problems that give rise to missing data in sequence alignments, a situation that is commonly encountered in real-world analyses [KSM+17]. In the case of distance-based phylogenetic methods, the presence of unknown data in the sequences can lead to failures in the calculation of the sequence distance matrix. If the data available for two organisms i,j do not overlap in any subsequence, the entry of the associated dissimilarity matrix for i,j cannot be directly calculated, leading to missing distances. Figure 2 shows an example of this situation.
The probability of inferring wrong evolutionary relationships increases when missing data is encountered in large sparse matrices [RBP13], thus justifying the interest in applying data imputation and optimization approaches. Likelihood and least-squares measurements can be adopted to guide the search towards satisfying imputations. However, the performance and convergence speed of current phylogenetic imputation methods highly depends on the percentage of missing data [BB20], which accounts up to 80% in phylogenomic scenarios. Novel approaches are therefore needed to address this issue in current real-world analyses.
There exist some good references giving a general perspective about different aspects of bioinformatics, metaheuristics, or parallel computing. However, we focus on the background and state of the art tightly related with the bioinformatics problems that we will tackle.
In the general case of phylogenetic inference, the literature gives account of the strong relationship between HPC and phylogenetic search methods. Two of the most important methods for maximum likelihood inference, RAxML-NG [KDF+19] and IQ-TREE 2 [MSC+20], implement parallel heuristic schemes for CPU architectures, combining POSIX or OpenMP threads with MPI message passing and AVX SIMD extensions. The matOptimize tool mixes MPI with Thread Building Blocks to enable phylogenetic analyses on Intel CPU clusters [YTH+22]. FPGA implementations of the phylogenetic likelihood function were proposed in [ABA+21]. FPGAs were also employed to define near-memory processing models for phylogenetics, using Xilinx Vivado HLS [ASP+21]. NVIDIA GPUs were explored to accelerate popular software like MrBayes [PSR+15], which implements parallel Bayesian inference using CUDA. Furthermore, in [SVS22] we explored the definition of metaheuristic approaches for phylogenetic reconstruction, enabling heterogeneous computing supported on NVIDIA GPUs through different parallel schemes based on MPI, OpenMP, and CUDA.
These previous methods are tailored to specific hardware (CPU, GPU, FPGA…) or vendor technology (Intel, NVIDIA, Xilinx…), so they do not provide a unified solution to exploit the current spectrum of heterogeneous resources. Moreover, since cross-architecture approaches have not been implemented, the state-of-the-art parallel tools are not portable to other configurations or platforms (e.g. AMD GPUs), thus imposing restrictions on the specific hardware to be used. To the best of our knowledge, the only research that considers the cross-architecture question in the field is due to Ayres et al. [ACB+19], who proposed a library to calculate the likelihood function with OpenCL. However, this library only exploits the data parallelism exhibited by this particular function at the sequence level, so other sources of parallelism (e.g. parallel topological rearrangements, parallel independent runs and bootstrapping) are not considered. Moreover, energy efficiency has played a secondary role in this area (only investigated in the context of FPGAs [ASP+21]). In-depth research on cross-architecture and energy-aware algorithms for phylogenetics is therefore required.
Regarding phylogenetic placement, the vast majority of approaches in the literature are devised for execution in CPUs. For example, EPA-NG [BKC+19] conducts phylogenetic placements based on likelihood for multicore CPUs and clusters by implementing two levels of parallelism: MPI to distribute independent query sequences and OpenMP to parallelize the placement calculations for a given query sequence. This approach was refined in [BS21] to reduce memory footprint, observing memory savings up to 96% at the expense of increasing execution times. The RAPPAS tool [LSP19] offered an alternative approach based on the identification of k-mers in the reference sequences, without implementing explicit parallel strategies. APPLES-2 [BJR+22] implements a least-squares algorithm to handle placements in large trees, integrating job-level parallelization in which each query is independently handled by a different CPU thread. A similar job-level parallel approach is implemented in App-SpaM, an alignment-free placement algorithm for short sequencing reads [BM21]. Alternatively, the UShER tool [TTH+21] adopts the phylogenetic parsimony function to guide the method, using CPU threads to parallelize independent Fitch-Sankoff computations. Finally, SCAMPP [WCW22] defines a serial heuristic based on Hamming distances to identify promising attachment points prior to the execution of a user-defined placement tool.
It can be concluded that most placement algorithms only incorporate basic parallelization strategies (or no parallelization at all), targeting CPU devices for execution. An exception to this rule is observed in [JBZ+22], where neural networks were adopted to learn patterns from the reference tree and the sequences, calculating embeddings that were later employed for query placement. To accelerate the network training, NVIDIA 2080Ti GPUs were used. Other hardware platforms (e.g. FPGAs) and energy-aware heterogeneous methods that efficiently orchestrate multiple devices are still unexplored. An additional interesting aspect is given by the fact that multiobjective optimization has never been investigated in this problem, thus opening the door for the definition of parallel multiobjective metaheuristics that perform placements attending to different criteria (e.g. likelihood and least-squares) simultaneously.
As for the missing data imputation problem, two main classes of methods can be distinguished in the state of the art: 1) optimization methods and 2) machine learning approaches. Among the proposed optimization methods, we can highlight tools like LASSO [KDR+15], which defines heuristic strategies to infer rooted phylogenetic trees from distance matrices with missing values, under the molecular clock assumption. Rphylopars imputes missing observations by optimizing the log-likelihood of trait covariance using the available data [GBA17]. DAMBE [Xia18] uses a downhill simplex method based on the least-squares criterion for imputing distances. Mixed integer non-linear programming is employed in imPhy to infer missing distances in a set of gene trees [YVY+20]. Regarding machine learning approaches, it can be highlighted the use of matrix factorization, which was initially proposed to fill gaps in sparse trait matrices [SKS+15] and later extended to handle other sorts of missing data, such as missing distances [BB20]. In this last work, it was also explored the use of autoencoders to reconstruct phylogenies from partial distance matrices, showing that machine learning approaches are able to handle matrices with over 50% of missing entries. We proposed in [PSI22] a random forest method guided by the least-squares criterion, which allowed the imputation of lost phylogenetic distances in scenarios with >60% of missing data.
These two classes of methods have different advantages and drawbacks, as shown in previous research [PSI22]. The optimization algorithms allow fast imputation of missing data. However, some of these methods are constrained by assumptions that may not be verified in real-world data (e.g. molecular clocks in LASSO). More importantly, the current optimization methods cannot lead to satisfying solutions in scenarios with >20% of missing data. On the other side, the machine learning tools can effectively conduct imputations with larger missing data percentages, at the expense of prohibitive execution times. For a moderate dataset with 201 sequences, the times required vary from 1.5h (autoencoder) to more than 48h (matrix factorization). It is worth noting that these state-of-the-art tools do not incorporate explicit parallelization strategies. Consequently, the development of cross-architecture and energy-aware parallel methods represents an open research issue in this problem. The combination of metaheuristics, specialized operators, and other search techniques is a promising strategy to overcome the deficiencies of the optimization methods. Finally, the use of multiobjective optimization to handle imputations under several criteria is also yet to be investigated.
[ABA+21] |
Alachiotis, N., Brokalakis, A., Amourgianos, V., Ioannidis, S., Malakonakis, P., Bokalidis, T. (2021) Accelerating Phylogenetics Using FPGAs in the Cloud. IEEE Micro 41(4), 24-30 (http://doi.org/10.1109/MM.2021.3075848) |
[ASP+21] |
Alachiotis, N., Skrimponis, P., Pissadakis, M., Pnevmatikatos, D. (2021) Scalable Phylogeny Reconstruction with Disaggregated Near-memory Processing. ACM Trans. Reconfigurable Technol. Syst. 15(3), Art. 25, 1-32 (http://doi.org/10.1145/3484983) |
[ACB+19] |
Ayres, D. L., Cummings, M. P., Baele, G. et al. (2019) BEAGLE 3: Improved performance, scaling, and usability for a high-performance computing library for statistical phylogenetics. Systematic Biology 68(6), 1052–1061 (http://doi.org/10.1093/sysbio/syz020) |
[KDF+19] |
Kozlov, A. M., Darriba, D., Flouri, T., Morel, B., Stamatakis, A. (2019) RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics 35(21), 4453-4455 (http://doi.org/10.1093/bioinformatics/btz305) |
[MSC+20] |
Minh, B. Q., Schmidt, H. A., Chernomor, O., Schrempf, D., Woodhams, M. D., von Haeseler, A., Lanfear, R. (2020) IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol. Biol. Evol. 37(5), 1530-1534 (http://doi.org/10.1093/molbev/msaa015) |
[MBB22] |
Muralidhar, R., Borovica-Gajic, R., Buyya, R. (2022) Energy Efficient Computing Systems: Architectures, Abstractions and Modeling to Techniques and Standards. ACM Computing Surveys 54(11s), Art. 236, 1-37 (http://doi.org/10.1145/3511094) |
[NB21] |
Nozal, R., Bosque, J. L. (2021) Exploiting Co-execution with OneAPI: Heterogeneity from a Modern Perspective. In: L. Sousa et al. (Eds): Euro-Par 2021, LNCS 12820, 501-516 (http://doi.org/10.1007/978-3-030-85665-6_31) |
[PSR+15] |
Pang, S., Stones, J. R., Ren, M.-M. et al. (2015) GPU MrBayes V3.1: MrBayes on Graphics Processing Units for Protein Sequence Data. Mol. Biol. Evol. 32(9), 2496-2497 (http://doi.org/10.1093/molbev/msv129) |
[RAB+21] |
Reinders, J., Ashbaugh, B, Brodman, J., Kinsner, M., Pennycook, J., Tian, X. (2021). Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL. Apress Open, Springer, (http://doi.org/10.1007/978-1-4842-5574-2) |
[SVS22] |
Santander-Jiménez, S., Vega-Rodríguez, M. A., Sousa L. (2022) Exploiting Multi-level Parallel Metaheuristics and Heterogeneous Computing to Boost Phylogenetics. Future Generation Computer Systems 127, 208-224 (http://doi.org/10.1016/j.future.2021.09.011) |
[SDS+20] |
Schwartz, R., Dodge, J., Smith, N. A., Etzioni, O. (2020) Green AI. Communications of the ACM 63(12), 54-63 (http://doi.org/10.1145/3381831) |
[War17] |
Warnow, T. (2017) Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation. Cambridge University Press (http://doi.org/10.1017/9781316882313) |
[YTH+22] |
Ye, C., Thornlow, B., Hinrichs, A., et al. (2022) matOptimize: a parallel tree optimization method enables online phylogenetics for SARS-CoV-2. Bioinformatics 38(15), 3734-3740 (http://doi.org/10.1093/bioinformatics/btac401) |
[YG20] |
Young, A. D., Gillung, J. P. (2020) Phylogenomics – principles, opportunities and pitfalls of big-data phylogenetics. Systematic Entomology 45(2), 225-247 (http://doi.org/10.1111/syen.12406) |
[BJR+22] |
Balaban, M., Jiang, Y., Roush, D., Zhu, Q., Mirarab, S. (2022) Fast and accurate distance-based phylogenetic placement using divide and conquer. Mol. Ecol. Resour. 22(3), 1213-1227 (http://doi.org/10.1111/1755-0998.13527) |
[BKC+19] |
Barbera, P., Kozlov, A. M., Czech, L., Morel, B., Darriba, D., Flouri, T., Stamatakis, A. (2019) EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences. Systematic Biology 68(2), 365-369 (http://doi.org/10.1093/sysbio/syy054) |
[BS21] |
Barbera, P., Stamatakis, A. (2021) Efficient Memory Management in Likelihood-based Phylogenetic Placement. In: 2021 IEEE IPDPS Workshops (IPDPSW), 218-227 (http://doi.org/10.1109/IPDPSW52791.2021.00041) |
[BM21] |
Blanke, M., Morgenstern, B. (2021) App-SpaM: phylogenetic placement of short reads without sequence alignment. Bioinformatics Advances 1(1), Art. vbab027, 1-9 (http://doi.org/10.1093/bioadv/vbab027) |
[CSD+22] |
Czech, L., Stamatakis, A., Dunthorn, M., Barbera, P. (2022) Metagenomic Analysis Using Phylogenetic Placement - A Review of the First Decade. Frontiers in Bioinform. 2, 871393, 1-25 (http://doi.org/10.3389/fbinf.2022.871393) |
[JBZ+22] |
Jiang, Y., Balaban, M., Zhu, Q., Mirarab, S. (2022) DEPP: Deep Learning Enables Extending Species Trees using Single Genes. Systematic Biology, Art. syac031, 1-63 (http://doi.org/10.1093/sysbio/syac031) |
[LSP19] |
Linard, B., Swenson, K., Pardi, F. (2019) Rapid alignment-free phylogenetic identification of metagenomic sequences. Bioinformatics 35(18), 3303-3312 (http://doi.org/10.1093/bioinformatics/btz068) |
[TTH+21] |
Turakhia. Y, Thornlow, B., Hinrichs, A. S. et al. (2021) Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic. Nature Genetics 53(6), 809–816 (http://doi.org/10.1038/s41588-021-00862-7) |
[WCW22] |
Wedell, E., Cai, Y., Warnow, T. (2022) SCAMPP: Scaling Alignment-based Phylogenetic Placement to Large Trees. IEEE/ACM Trans. Comput. Biol. Bioinform. (Early Access), 1-14 (http://doi.org/10.1109/TCBB.2022.3170386) |
[BB20] |
Bhattacharjee, A., Bayzid, M. S. (2020) Machine learning based imputation techniques for estimating phylogenetic trees from incomplete distance matrices. BMC Genomics 21, Art. 497, 1–14 (http://doi.org/10.1186/s12864-020-06892-5) |
[GBA17] |
Goolsby, E. W., Bruggeman, J., Ané, C. (2017) Rphylopars: fast multivariate phylogenetic comparative methods for missing data and within-species variation. Methods Ecol. Evol. 8(1), 22-27 (http://doi.org/10.1111/2041-210X.12612) |
[KDR+15] |
Kettleborough, G., Dicks, J., Roberts, I. N., Huber, K. T. (2015) Reconstructing (Super)Trees from Data Sets with Missing Distances: Not All Is Lost. Mol. Biol. Evol. 32(6), 1628–1642 (http://doi.org/10.1093/molbev/msv027) |
[KSM+17] |
Kocot, K. M., Struck, T. H., Merkel, J. et al. (2017) Phylogenomics of Lophotrochozoa with Consideration of Systematic Error. Systematic Biology 66(2), 256-282 (http://doi.org/10.1093/sysbio/syw079) |
[PSI22] |
Pinheiro, D., Santander-Jiménez, S., Ilic, A. (2022) PhyloMissForest: a random forest framework to construct phylogenetic trees with missing data. BMC Genomics 23, Art. 377, 1-21 (http://doi.org/10.1186/s12864-022-08540-6) |
[RBP13] |
Roure, B., Baurain, D., Philippe, H. (2013) Impact of Missing Data on Phylogenies Inferred from Empirical Phylogenomic Data Sets. Mol. Biol. Evol. 30(1): 197–214 (http://doi.org/10.1093/molbev/mss208) |
[SKS+15] |
Schrodt, F., Kattge, J., Shan, H. et al. (2015) BHPMF – a hierarchical Bayesian approach to gap-filling and trait prediction for macroecology and functional biogeography. Glob. Ecol. Bio. 24, 1510-1521 (http://doi.org/10.1111/geb.12335) |
[Xia18] |
Xia, X. (2018) Imputing missing distances in molecular phylogenetics. PeerJ 6, Art. e5321, 1–17 (http://doi.org/10.7717/peerj.5321) |
[YVY+20] |
Yasui, N., Vogiatzis, C., Yoshida, R., Fukumizu, K. (2020) imPhy: Imputing Phylogenetic Trees with Missing Information Using Mathematical Programming. IEEE/ACM Trans. Comput. Biol. Bioinform. 17(4), 1222-1230 (http://doi.org/10.1109/TCBB.2018.2884459) |