Scientists Release New Pangenome Reference to Reflect Human Diversity
May 10, 2023
Rare Daily Staff
Researchers have released a new collection of reference human genome sequences that they say captures substantially more diversity from different human populations than what was previously available, a development expected to improve the ability to diagnose genetic diseases.
The new pangenome reference includes genome sequences of 47 people, with researchers setting a goal of increasing that number to 350 by mid-2024. With each person carrying a paired set of chromosomes, the current reference actually includes 94 distinct genome sequences, with a goal of reaching 700 distinct genome sequences by the completion of the project. The work, led by the international Human Pangenome Reference Consortium, appears in the journal Nature.
The Human Pangenome Reference Consortium is funded by the National Institutes of Health’s National Human Genome Research Institute with about $40 million over five years. The funding includes efforts to create the human pangenome reference, improve DNA sequencing technology, operate a coordinating center, conduct outreach and create resources for the research community to use the pangenome reference.
A genome is the set of DNA instructions that helps each living creature develop and function. Genome sequences differ slightly among individuals. In the case of humans, any two peoples’ genomes are, on average, more than 99 percent identical. The differences can provide insights about their health, helping to diagnose disease, predict outcomes and guide medical treatments.
To understand genomic differences, scientists create reference human genome sequences for use as a standard—a digital amalgamation of human genome sequences that can be used as a comparison to align, assemble, and study other human genome sequences. The original reference human genome sequence is nearly 20 years old. While it has been regularly updated to fix errors and include newly discovered regions of the human genome, scientists say it is fundamentally limited in its representation of the diversity of the human species. The current reference genome is constructed from the genomes of about 20 people, but most of the reference sequence is from one person.
The current reference human genome sequence has gaps that reflect missing information, especially in areas that were repetitive and hard to read. Recent technological advances, such as long-read DNA sequencing, reads longer stretches of DNA at a time. That’s helped researchers fill in those gaps to create the first complete human genome sequence. This complete human genome sequence, released last year as part of the NIH-funded Telomere-to-Telomere (T2T) consortium, is incorporated into the current pangenome reference. In fact, many of the T2T researchers are also members of the Human Pangenome Reference Consortium.
“Everyone has a unique genome, so using a single reference genome sequence for every person can lead to inequities in genomic analyses,” said Adam Phillippy, senior investigator in the Computational and Statistical Genomics Branch within NHGRI’s Intramural Research Program and a co-author of the main study. “For example, predicting a genetic disease might not work as well for someone whose genome is more different from the reference genome.”
Using advanced computational techniques to align the various genome sequences, the researchers constructed a new human pangenome reference with each assembly in the pangenome covering more than 99 percent of the expected sequence with more than 99 percent accuracy. It also builds upon the previous reference genome sequence, adding more than 100 million new bases.
While the previous reference genome sequence was single and linear, the new pangenome represents different versions of the human genome sequence at the same time. Scientists said this gives researchers a wider range of options for using the pangenome in analyzing other human genome sequences.
“By using the pangenome reference, we can more accurately identify larger genomic variants called structural variants,” said Mobin Asri, a Ph.D. student at the University of California Santa Cruz and co-first author of the paper. “We are able to find variants that were not identified using previous methods that depend on linear reference sequences.”
Structural variants can involve thousands of bases. Until now, researchers have been unable to identify the majority of structural variants that exist in each human genome using short-read sequencing due to the bias of using a single reference sequence.
“The human pangenome reference will enable us to represent tens of thousands of novel genomic variants in regions of the genome that were previously inaccessible,” said Wen-Wei Liao, a Ph.D. student at Yale University and co-first author of the paper. “With a pangenome reference, we can accelerate clinical research by improving our understanding of the link between genes and disease traits.”
Sign up for updates straight to your inbox.