I studied Latin for four years. Reading it was not a problem. Translating it, though, was another issue.

Once we moved on from Tres Porci Parvi, it was a long and painful process that was only cut short by locating Michael Raship in the cafeteria and asking this gifted student what a passage meant.

It can be a lot like that when it comes to using whole genome sequencing to diagnose rare disease patients. One of the most frustrating realities rare disease patients may confront when they finally get their genome sequenced is that rather than providing the answer they are seek, the process often leaves questions unresolved.

When Gill Bejerano started tinkering with the problem about five years, about 10 percent of patients with a rare genetic disease would find sequencing resulted in a definitive diagnosis.  Now, he said, between 30 percent and 50 percent of the time results will provide an answer.

Bejerano, an associate professor of developmental biology and computer science at Stanford University, has seen the rapid pace of improvement in sequencing. The increase in the speed of sequencing and the rapidly falling cost will now allow hundreds of thousands of patients—eventually millions of patients—to get sequenced. The problem, though, while there have been great improvements in speeding the process of sequencing, the interpretation of results continues to rely on a skilled professional engaging in a labor-intensive process of interpretation.

Though a patient may have a genetic disease caused by a mutation in a single gene, genetic sequencing typically identifies scores of genes with mutations. Diagnosticians are left with the task of finding the most likely genetic culprit by comparing the patient’s symptoms to identified genetic variants and seeking to match genotype to phenotype.

In the absence of computer assistance, it can take between 20 to 40 hours to scour through the scientific literature to look up each suspect gene and match a patient’s symptoms to identify the gene most likely to be responsible for causing a condition and provide a diagnosis. Though there are some computational tools that have helped automate the process, for now they are imperfect, and only as good as the data with which they are working.

Bejerano and his colleagues in a paper in Genetics in Medicine in July described Phrank, an algorithm they developed to automate the process of matching genotype to phenotype and ranking the mostly like disease. Phrank can also be used to determine the mostly likely disease-causing gene in a patient. Unlike other similar tools that exist today, it is not married to a single database, which means a researcher is free to incorporate Phrank in the database of his or her choosing.

Bejerano compares the decoupling of Phrank to the components of an automobile saying rather than being stuck with the manufacturers choices, researchers should be able to select the best engine, wheels, and axels to suit their needs. “That’s not done so much in this field,”
he said. “Stop offering me solutions that have seven components mushed together where I can’t separate them.”

Phrank, in simple terms, uses a ranking process to identify the most likely disease or gene underlying a patient’s condition. To test out how well it worked, Bejerano compared it to a handful of other similar ranking tools that researchers use. The one catch, while other tools have been put to the test with simulated data, Bejerano used actual patient data to run the different tools through their paces.

“Real patients are more complicated,” he said. “They don’t manifest what’s in the text book.”

In the past, when such tools were tested, researchers relied on simulated patients. In those circumstances, some of these methods ranked the causative gene in 90 percent of the cases. Real-world data is another story.

Bejerano and his team curated a set of 169 patients from the Deciphering Developmental Diseases study. On average, each patient in the final set has 7.5 phenotypes and a candidate list of 278.8 genes. Phrank on average ranked the correct disease in the top four. That compared to an average of the top 32 for the next tool.

Bejerano is quick to note that Phrank has room for improvement, particularly in the ability to rank genes. It was able to rank the causative gene in the top 9 on average. Bejerano said this should be in the top 5 with 90 percent certainty. Still, it outperformed competitors. He said such improvements will be critical in addressing the flood of genomic data that is coming.

The Phrank code is available on the Bejerano Lab website under the publications tab for researchers to download for free. For those without the technical prowess to incorporate the code as needed, the website also provides a link to the AMELIE database of Mendelian disease genes, which incorporates Phrank.

As the knowledge of genetic diseases grows, researchers will be better able to diagnose these conditions. Already researchers have been able to revisit patient sequencing data that a few years ago was inconclusive and provide a diagnosis without sequencing the patient again. Bejerano envisions a constant process where intelligent systems can provide new answers as information becomes available.

“We are realizing a small part of what will become the future of diagnostic systems,” said Bejerano. “Smart systems will sit in hospitals and look at the patients at all time. When they have something valuable to say, they will alert the medical staff.”

August 29, 2018
Photo: Gill Bejerano, associate professor of developmental biology and computer science at Stanford University