RARE Daily

Changing the Understanding of Rare Diseases with Data

April 3, 2024

Last year, Global Genes launched the inaugural Xcelerate RARE Open Science Data Challenge. The event brought together 24 teams of clinical researchers, data scientists, and biostatisticians from around the world. The teams competed to make the best use of patient-provided data to address significant challenges to research and medical management for patients with rare neurological diseases. A total of 27 different ultra-rare neurodevelopmental disease communities collaborated in the challenge. The results showed the value of open science environments where stakeholders with diverse backgrounds and perspectives come together to answer important healthcare questions. We spoke to Karmen Trzupek, senior director of scientific programs for Global Genes, about Xcelerate RARE, how the effort is helping to change the understanding of the neurodevelopmental diseases involved in the challenge, and why it more broadly validates RARE-X’s approach to collaborating with patient communities to collect high-quality, research grade patient data.

Daniel Levine:  Karmen, thanks for joining us.

Karmen Trzupek: Thank you so much. It’s nice to be here.

Daniel Levine: We’re going to talk about the Accelerate Rare, RARE-X ‘s open science data challenge, the impressive results that came from that, and what this might suggest is validation for the broader work RARE-X is doing. For listeners not familiar with RARE-X, can you explain what it is?

Karmen Trzupek: Absolutely. So RARE-X is a data collection and data sharing platform. RARE-X was founded to solve a lot of the problems and barriers that occur in rare disease data today. So it’s very common in rare diseases that it’s hard to find data, and when you do find data, it’s hard to find good data. So this can look like several things. Often data in a rare disease may exist in a natural history study that’s at a particular academic medical center or maybe a couple of academic medical centers, but it can be really hard to access that data, if not impossible, or it might occur in a registry through a biopharma company that needs to do that for post-marketing reasons. But broadly, it’s hard for many researchers to access that data. So RARE-X really exists to provide a platform where any rare disease group, no matter how small, can come and start to collect really structured research grade data and have that data made available on an open science platform where any qualified researcher can access it.

Daniel Levine: RARE-X held its first open science data challenge from May to August last year. Can you explain what the open science data challenge was and how it worked?

Karmen Trzupek: Sure. So, I think when RARE-X was started, there was this idea that there are so many researchers who want this data that as soon as we get it out there, it’s going to be used all the time. And the reality is that researchers are really, really busy and it’s hard to access a new data set and to learn about the structure of that data and to really dive deeply into it. So we definitely have some researchers who are actively using this data. We have more researchers becoming interested all the time, but one of the reasons we decided to host an open science data challenge was to get more researchers just using this data, knowing that it exists, using it, starting to become comfortable with it so that they would think about it for their future research. The other reason we decided to do it honestly is that the kinds of researchers who traditionally would request access to the RARE-X dataset are clinical researchers who are already familiar with these rare diseases.

But there are lots of hackathons and other types of data challenges that have been held that have shown us that if you make data available and you make it interesting and even kind of fun and you make it available to any kind of researcher and you don’t put any restrictions on what that researcher is supposed to look like or what kind of environment they’re supposed to come from, what you find is that you have lots of different types of people with lots of different types of backgrounds who start to look at the data in new and interesting ways. So we were really interested in seeing what would happen, what would happen if you make that data available to people with say, a statistics background but not a biostatistics background, or if you make it available to teams of people where you have different individuals coming in with different kinds of skill sets and backgrounds that can complement each other and look at the data and challenge each other in different ways.

So, I guess when we think about how that works, we worked with a platform partner. We worked with a platform partner called Sage Bionetworks, and they have a nonprofit platform called the Dream Challenge Platform. And so we worked with them to hold the data there and to structure the data in a way that it could be used easily by research groups. And then we created challenge questions and we created kind of a challenge atmosphere where the idea is for it to be a little bit competitive, a little bit fun, and of course to come up with some solutions or challenge answers that are really beneficial to these rare disease families.

Daniel Levine: As part of this effort, there was a push to get patients and their families to engage in the RARE-X platform. How did that work out?

Karmen Trzupek: Yeah, that’s right. I mean, ultimately there is no data if you don’t have people giving you the data. And this is a platform that is fueled entirely by patients and families participating. So if we’re talking about a pediatric neurologic disease, we’re really talking about caregivers providing this data, the type of data that we collect. I think of it in kind of two different types. We have the type of data that is more symptom-based data that participants, I’ll say participants here, so that could be a patient or a caregiver would provide, where that really provides a real description of the disease. And when I say description, I mean that in a cumulative way. So, all of the data that’s collected is individual discreet data elements that can be analyzed really mathematically, analytically. So, this is not paragraph answers.

The other type of data that patients and families respond to, sort of generate themselves, is what’s called a clinical outcome assessment. So, there are lots of clinical outcome assessments in medicine. Many of them are administered in the clinical setting, often in an academic medical institution designed to assess the impact of particular symptoms or disease characteristics. And those are often highly standardized, validated for certain disease areas. And we license and implement those on our platform so that we get this clinical grade data from participants who can answer it from home instead of traveling to a medical center to collect this data.

There is a third type of data that they don’t generate themselves, but they share with us, and that meaningfully includes genetic data. So if they’ve had prior genetic testing, which most of the participants in this open science data challenge, most of them had genetic testing done before, they can upload a genetic test report. And it doesn’t matter if that’s in a PDF or it’s a picture they took on their phone, our team curates that data. So, we have an expert curation process where we go through and curate that data and make that data available on the platform. And so, when we worked with the patient communities to prepare for the open science data challenge, we really looked critically at these three types of data: at the symptom-based data, clinical outcome assessment measures, and the genetic data. And we came up with a prioritized list of what would be most impactful for those communities to try to complete, because the reality is that a lot of these families have children with very, very severe and complex medical conditions, and despite best intentions, they might not get it all done. So we really worked with the patient communities to develop what is a prioritized kind of minimum data set that would be really important to get as many people in your community to complete as possible. And then that’s going to be our core data set for the challenge. So it was a very, I would say, communicative back and forth process with those rare disease communities and trying to reach those goals.

Daniel Levine: What types of information did RARE-X seek from participants?

Karmen Trzupek: Well, I think when I talk about the prioritization, getting that genetic data of course was really valuable and important. The symptom-based data is also highly structured, just in a different way. So we really prioritized getting some of that symptom-based data. And then we worked together with some researchers to determine of the many different clinical outcome assessment measures, what are the ones that are likely most critical across this group of rare diseases? Because one thing that’s quite different about this challenge and the way that we approached it, compared to many others, is that we really looked at it from a cross-disorder perspective. So there have been some open science data challenges in, for example, neurofibromatosis. So they were very valuable to us. Actually, their chief scientific officer served as an advisor to us because they’ve done like five years in a row, they’ve done these evolving open science data challenges, and in that condition, they’re really focused on one condition—neurofibromatosis. But here our open science data challenge included 27 different rare disorders. So we had to look at what are the through lines, what are the common threads between them when we create that minimum dataset.

Daniel Levine: The data challenge focused on neurodevelopmental diseases. Why that focus for this first challenge?

Karmen Trzupek: So RARE-X is not very old. RARE-X has only been in existence for a few years live collecting data, and we started in pediatric neurodevelopmental disorders. We are a rare disease platform. If you think about the impact of rare diseases broadly, about 80 percent of all rare disorders include a significant neurologic component and neurologic symptoms. Also, if you look across all rare diseases, the vast majority of them are pediatric onset. So as a platform, we really started out in pediatric onset neurologic diseases, and we take the approach as we grow the RARE-X platform, we take the approach of taking on groups of disorders at a time and in order, we took on neurodevelopmental disorders first, then pediatric onset neurodegenerative disorders, then neuromuscular conditions across the lifespan, and now we’re continuing to grow beyond that. But because we began in neurodevelopmental disorders, that’s where we had the most data to start with. So it made the most sense to start there.

Daniel Levine: I think you mentioned there were 27 diseases. Can you give some sense of how broad the participation was in the type of conditions that were represented?

Karmen Trzupek: Sure. In general, a neurodevelopmental disorder is one in which a child has a pretty complex neurologic condition that can look like a lot of things. There are some neurodevelopmental conditions where children really struggle to walk and they have a pretty significant kind of motor condition associated with their disease. There are other conditions that have much less of a motor condition, but they might have a pretty severe seizure disorder. Nearly all of these are going to have impacts to brain and cognition. So many of these children, of course, have impaired cognitive functioning. Some of them have extremely impaired speech to the point that they might not have any verbal communication, though of course many of these children do still communicate. But it is a really wide spectrum, from conditions with a severe seizure disorder and severe cognitive impacts to much milder conditions, though still quite significant to these families.

Daniel Levine: The challenge had three separate tasks. What were they?

Karmen Trzupek: Yeah, that’s right. So we had three separate challenge questions or tasks that we posed, and we decided to do this for a couple of reasons. One, of course, is to make the most of it. So we wanted to try to get as much value for these communities out of the challenges we could. But the other is because we really were thinking about the different types of researchers who might participate. So, the first task, the challenge question, we asked researchers if they could identify previously underrecognized or under-reported symptoms associated with a particular condition. And this is something that we knew going in is a task question that is really amenable to a variety of different statistical methods. And from our perspective of bringing in different kinds of researchers, that was super exciting to think about getting people looking at this data who wouldn’t normally think about this data.

The other thing that’s really important about this particular question is that it is so meaningful to some of these ultra-rare conditions. Rare conditions really span the gamut. Some of the conditions on our platform are so ultra-rare that we may only know of a hundred people in the world affected with this condition, sometimes less. And so for some of these more newly recognized conditions that are starting to be diagnosed through whole exome or whole genome sequencing, there’s just no data out there. And being able to really characterize this disease for the first time and get something out in the literature about it is really important. And developing really statistically valid, rigorous methods of how we define the phenotypic spectrum of these conditions is something that we’re really excited to figure out how to do and how to do it reproducibly. So that was task question number one.

The second challenge question was one that we really designed thinking about how a lot of hackathons work. So in a lot of hackathons, one, if not the primary challenge question or task, is related to developing some kind of machine learning model or tool or algorithm that enables you to take a test dataset and to develop a predictive algorithm and then to use it on a withheld dataset, and to see if your predictions ring true. And so this is what a lot of people are used to if they’ve participated in hackathons before. It’s really fun for a lot of people who like to participate in hackathons to develop these predictive analytics. So we thought about how can we use that on our dataset? And what we decided to do was to challenge researchers to come up with a machine learning algorithm that would try to predict the molecular diagnosis given phenotype information.
So, I already told you here what we have 27 rare diseases that are all neurodevelopmental conditions. Some of them are associated with pretty severe seizure disorders, some are not. Some are associated with more of a motor kind of phenotype, some are not. So there are clear differences, but there’s also a ton of overlap. So if we put machine learning methodologies to work against the dataset, can we meaningfully predict the molecular diagnosis? Because if we can, that might be useful in thinking about clinical algorithms in how to work up patients.

And then the third challenge question or task that we posed was very open-ended and very much intended for clinical research groups who were already doing research in rare diseases. And that was to take a previous hypothesis in any way related to a therapy minded idea. So this can be a therapeutic hypothesis, it can be sort of an early idea kind of leading to the potential, the type of therapeutic approach, and then use our dataset to either further or to refute that hypothesis. And they didn’t have to come in with their own hypothesis to begin with. We actually worked with a partner at Netramark who teed up a couple of hypotheses for groups to chew on.

Daniel Levine: And while you had the challenge questions, I take it there was a greater goal beyond that. What was that?

Karmen Trzupek: I think the greater goals for sure are that we learn how we can apply some of these methodologies to rare diseases, and really to put a point on it, to smaller data sets than what’s been done in the past. Prior to the work that we did here, when I think about some kind of a data challenge related to healthcare, I think about things like how can you look at MRI images and try to predict progression in cancer using tens of thousands of MRIs and known outcomes in those patients. So these are massive data sets, just massive. And I think one of the things that is really important to us is how can we take some of these approaches that are being utilized today—some of these AI-based, sometimes machine learning, sometimes not, approaches. How can we take that and think about how to use it on smaller data sets in rare diseases and get something truly meaningful out of it, because that can change the face of rare disease research.

Daniel Levine: I wanted to focus on the first of the three tests because that was the one where you sought to identify previously unrecognized symptoms associated with different neurodevelopmental conditions. One of the interesting aspects of this challenge was that it brought together interdisciplinary teams and involved people who don’t normally work on rare disease. What unique approaches did participants use?

Karmen Trzupek: Well, I have to tell you, they were really interesting and unique. The winning teams for test three used a variety of different statistical methods, and they use methods ranging from natural language processing to much more sort of traditional biostats approaches. They pulled in multiple external data sets, which was part of the task given to them, because if we’re going to say this is unrecognized or underrecognized, you only know that if you compare it to what’s already known and existing. Some of them pulled in data from Orphanet and OMEM, some of them did much, much deeper kind of data pulls out of PubMed. One of the solutions even looked at MGI, which is a database that pulls in mouse model data, the phenotype seen in mouse models. And then there were some really interesting statistical approaches that are not typically used on healthcare data.

I’ll give you an example. The Chong lab team out of the University of Washington used a statistical approach called TFIDF, which stands for “term frequency inverse document frequency.” And this is a tool that’s really frequently used in search engine optimization. Now I’m going to come clean and tell you that I am not a statistician. I did not even know what TFIDF was before I reviewed all of the submissions and read the results from this team. But this is a technique that was developed for web document search and information retrieval. And the way it works is that it looks for particular words and it increases proportionally to the number of times a word appears in a document, but it offsets that by the number of documents that contain the word. That’s probably confusing. So I’m going to give you an example. You can imagine when I described to you the validated structured survey instruments that we have on the platform, those measure the impact of particular symptoms. So, a word like maybe “difficulty” or “trouble.” Those occur a lot in those measures so those words probably are not very significant because they happen all the time, but they also occur all the time across all 27 of these groups. But the word bruxism, which means teeth grinding, shows up a lot in the dataset related to just a few of the neurologic diseases on our platform and it doesn’t show up at all in others. So that kind of technique will allow the identification of individual symptoms that are disproportionately common in a particular document or in a particular data set. And this happens all the time in search engine optimization, but I had never heard of it being used in this kind of a data set, and now that’s available to us to reuse. So pretty cool.

Daniel Levine: The results were actually quite stunning. Can you explain what happened?

Karmen Trzupek: Sure, yeah. delighted to. I agree. I thought it was really terrific. This first challenge question, I would say what we really hoped for in the results were two things. We hoped that number one, we would get some meaningful results for some, ideally most of the disorders that participated. So, there are 27 groups on the platform. If we had gotten really valuable, interesting results for 15 or so of them, I would’ve been happy. The second goal that we had that we were really hoping for out of this particular challenge question was that we would get something that’s reproducible. And I don’t just mean reproducible by us internally. Our goal was that at least one of these solutions would be then available in a totally open science manner, available like in a GitHub repository, where other researchers can go and use it, where they can rerun it in other rare disease communities, where we can rerun it in these rare disease communities when we get more and more data over time. Those goals were absolutely exceeded.

So for the first goal between the different winning solutions, there were novel or under-recognized symptoms described in every single one of the 27 groups. Every single one. We are working on a publication now where our intent is to make all of that data available in supplementary content that will essentially serve as a publication for each of these rare disease communities to help get this information out there to help get medical guideline documents created. It’s really very important to the medical management of a lot of these rare diseases that we just get this information out there. The other goal was also exceeded. So, we ended up with two different solutions that were validated to be highly reproducible, and where now they already exist in our GitHub repository so any researcher who is accessing the Rare-X dataset today can go in and they can just access these algorithms and they can rerun them.

Daniel Levine: What does this suggest about the utility of doing an open data science challenge?

Karmen Trzupek: Well, I think to the point that I described earlier, I think the idea that you can apply these kinds of methodologies in rare disease data sets is just really critical. I think the idea that we can engage communities beyond what we typically think about as healthcare researchers and that that’s valuable. We’re not the first ones to show that by any stretch, but I think it further validates that in this cross-disorder manner. So I think there’s no question that it’s useful.

Daniel Levine: And have you learned anything from doing this data challenge that you’re going to apply to the next?

Karmen Trzupek: I feel like maybe this is the opportunity for me to list out all the things that we did wrong, and therefore we learned from. Gosh, there were a lot of things that we learned in doing this for the very first time that I’m just so delighted we only had to learn once. For example, keeping this data safe is really, really critical. We knew that going in. That wasn’t a learning, but figuring out how to do that and figuring out how to protect the data in the kind of environment where people can go in and really work on it and yet ensure that the data is ultra-safe and protected, that’s a hard balance. It’s a hard line to figure out. And so I think we’ve just learned a lot along the way that’s helping us as we think about moving forward.

I’ll say that in the second challenge question that I described where we asked researchers to use machine learning techniques to try to predict a particular diagnosis, some of the winning solutions were significantly better than the baseline that was initially generated that they were challenged to exceed. So I would say the researchers did a great job and yet in the vast majority of diseases, it was just not possible to do. And so I think what that tells us is even with some really advanced technologies and methodologies, we just need to be doing comprehensive genome sequencing in kids with complex neurologic diseases as early as possible in stock.
Daniel Levine: And I was assuming that you are doing another open data science challenge, is that correct?

Karmen Trzupek: That’s right.

Daniel Levine: Is that scheduled yet?

Karmen Trzupek: It is. It’s planned. We’re doing another one this year. It will launch this fall. I don’t have the exact date yet, but it will launch this fall. And we are actively working with a number of different external partners, some of whom are participating as advisors, and a couple of groups that might be contributing data so that we can have multiple data sets involved in the challenge. So, we’re deep, deep in planning mode.

Daniel Levine: And beyond the challenge itself, does this validate what Rare-X is doing in any way?

Karmen Trzupek: Yeah, I think so, absolutely. I think that the idea that a really rare disease community can come on board this platform and can not only collect really good quality, highly structured research grade data, but attract researchers who’ve never heard of their disease before to start working on it, that’s a huge win, right? That’s really important. I think the other thing is not all of the rare disorders on our platform are ultra, ultra-rare, but some of these techniques should be even more applicable in those kinds of conditions. So, I’ll give you a couple of examples—this year, we’re doing a big launch in two different disease communities that are rare genetic disorders, but I would just say far less rare than some of the ones we’ve been talking about. One is Pompe disease, which is a lysosomal storage disorder, and the other is Huntington’s disease. And both of these conditions are ones where there are lots of researchers currently working on these conditions and yet most of the research that has happened to date has been on data that has been collected at academic medical centers. It’s often difficult to share really broadly, and it’s only collected from the individuals who can get there, who can either afford to or who have the circumstances to travel to those academic medical centers and to collect that kind of data. So the fact that we can do this from anywhere, and I really mean anywhere, this is a global platform, and our open science data challenge last year included global data meeting GDPR requirements and other global requirements. That will only continue. I think that’s really important.

Daniel Levine: Karmen Trzupek, senior director of scientific programs for Rare-X. Karmen, thanks as always.

Karmen Trzupek: Thank you so much, Danny.

Daniel Levine: Thanks for listening. Rare-X is a collaborative platform for global data sharing and analysis to accelerate treatments for rare disease. Rare X is adapting proven technologies and partnering with leading experts to create federated data analysis platform, specifically designed by rare community leaders scaled to support the diverse and expanding needs of rare disease, research, development, and care. To learn more about Rare X, go to rare-x.org. This podcast is produced for Rare-X by the Levine Media Group. Music is courtesy of the Jonah Levine Collective.

This transcript has been lightly edited for clarity and readability.

Stay Connected

Sign up for updates straight to your inbox.