Genomic Analysis at Scale: Mapping Irregular Computations to Advanced Architectures
Genomic data sets are growing dramatically as the cost of sequencing continues to decline and community databases are built to store and share this data with the research community. Some of data analysis problems require large scale parallel platforms to meet both the memory and computational requirements of these data sets. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. The tools in common use often run only on shared memory machines and on distributed memory they involve irregular communication patterns such as asynchronous updates to shared data structures. The ExaBiome project at Berkeley Lab—part of the Department of Energy’s Exascale Computing Project—is developing high performance tools for analyzing microbial data. I will give an overview of several high-performance genomic analysis problems, including alignment, profiling, clustering, and assembly, and describe some of the challenges and opportunities of mapping these to petascale and exascale architectures. I will also describe some of the common computational patterns or “motifs” that inform parallelization strategies and can be useful in understanding architectural requirements, algorithmic approaches, and benchmarking of current and future systems. The project team is pursuing two general approaches to these problems, one based on asynchronous one-sided communication in UPC++ to build counting data structures, hash tables, and graphs, and another based on bulk-synchronous collectives to build sparse matrix analogs for these. In both cases, as with all computations on modern systems, locality optimization and communication avoidance are key, but the optimization approaches are somewhat different given the sparse and unstructured nature of the underlying data structures.
Katherine Yelick is the Vice Chancellor for Research at the University of California, Berkeley, where she also holds the Robert S. Pepper Distinguished Professor of Electrical Engineering and Computer Sciences. She is also a Senior Faculty Scientist at Lawrence Berkeley National Laboratory. Her research is in high performance computing, programming systems, parallel algorithms, and computational genomics and she currently leads the ExaBiome project on Exascale Solutions for Microbiome Analysis.
Yelick was Director of the National Energy Research Scientific Computing Center (NERSC) from 2008 to 2012 and led the Computing Sciences Area at Lawrence Berkeley National Laboratory from 2010 through 2019, where she oversaw NERSC, the Energy Sciences Network (ESnet) and the Computational Research Division. She earned her Ph.D. in Electrical Engineering and Computer Science from MIT and is a member of the National Academy of Engineering and the American Academy of Arts and Sciences. She is a Fellow of the Association for Computing Machinery (ACM) and the American Association for the Advancement of Sciences (AAAS), and she is a recipient of the ACM/IEEE Ken Kennedy award and the ACM-W Athena award. Yelick was an Associate Dean in the Division of Computing, Data Science, and Society from 2020-2021 and Vice Chancellor for Research at UC Berkeley starting in January 2022.