Cracking the Cancer Code
In an age where data are everywhere, harnessing the power of data science can be a catalyst for groundbreaking discoveries in the fight against cancer. Welcome to the Cracking the Cancer Code podcast where we explore the latest in cancer data science. As a part of the ITCR Training Network (itcrtraining.org), we’re a small team of individuals who are working to democratize data science education in the hopes of catalyzing cancer research and ultimately fighting health inequities in cancer.
The ITCR Training Network (and this podcast) is supported by NCI UE5CA254170 but the views expressed on this podcast are those of the individuals who expressed them and do not reflect the views of our funders.
Find out more about the ITCR Training Network at https://www.itcrtraining.org/
Cracking the Cancer Code
Sharing Data for the Common Good
Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.
This episode of "Cracking the Cancer Code" explores the critical role of data sharing in cancer research, particularly in pediatric oncology. Featuring interviews with experts from the "Data for the Common Good" initiative at the University of Chicago, the podcast highlights how collaborative data sharing is essential for meaningful research in rare diseases. The initiative focuses on building community, creating data standards, and developing governance structures to facilitate effective data sharing across institutions and countries.
The episode discusses the challenges of changing academic culture from data hoarding to sharing, and proposes various incentives to encourage researchers to participate. These efforts have already led to improvements in disease treatment and patient outcomes. The podcast emphasizes that proper data sharing requires specific skills and training, and that data science and management are as crucial as clinical research in advancing cancer treatment. Overall, the episode underscores the transformative potential of data sharing in accelerating progress in cancer research and improving patient care.
0:00: It's too often that researchers will say, well, I just didn't know where to turn or I didn't know how to do this.
0:05: I never knew I should have looked up how to build a standard form.
0:08: The corollary to that is we need to keep researchers from hurting themselves with data.
0:12: And I didn't make up either of those expressions.
0:14: I've heard people say that you're listening to Cracking the cancer code, a podcast series about the researchers who use data to fight cancer.
0:31: I'm Doctor Carrie Wright, a senior staff scientist at the Fred Hutchinson Cancer Center.
0:35: I'm the head of content development for the I T C R Training Network.
0:39: A collaborative effort funded by the National Cancer Institute to support researchers around the United States for cancer informatics and data science training.
0:48: And I'm Candace Savonen.
0:49: I'm a data scientist at the Fred Hutchinson Cancer Center and I'm the tech lead of the I T C R Training Network.
0:55: We work closely with a variety of dedicated cancer researchers who are shaping the field's future.
1:03: Last episode, we talked about some of the unusual challenges that imaging data and free text data can pose one theme that came up is how there's too much data in this very technological age for one researcher or one group of researchers even to tackle alone, we need the right tools, collaboration.
1:22: But ultimately, it comes down to good communication and community working together and really to effectively and efficiently do cancer research.
1:31: We need to share data and collaborate and communicate regularly.
1:35: One consortium that is leading the way in effective data sharing is data for the common good based at the University of Chicago.
1:42: This initiative started with pediatric cancer data but has now expanded to other rare diseases and even are studying the social determinants of health.
1:51: The data for the common good is an initiative to share useful high quality data between institutions, groups and countries.
1:58: We talked with Dr Sam Volen Baum and Ellen Cohen, two of the key members about the importance of data sharing and their process for making that happen.
2:09: Hi, I'm Sam Bchen Baum.
2:10: I'm a pediatric oncologist at the University of Chicago.
2:13: I'm a professor here in pediatrics and I also direct a group called Data For The Common Good, which is my research group and the largest part of that group is called the pediatric Cancer Data Common.
2:22: My name is Ellen Cohen.
2:23: I am the Deputy Director of Data for the Common Good, which is housed at the University of Chicago.
2:29: How did your work on pediatric oncology lead to working on data sharing initiatives as a pediatric oncologist.
2:36: I've obviously been involved a lot with clinical trials and understanding how to collect data and how we use data from trials.
2:42: And even the process of helping write clinical trials and administer them.
2:47: Really introduced me to the world of clinical trials and data collection, which is really broken.
2:51: I mean, the way that we write trials and a word processor, the way we collect data manually, the way that we put order sets in the electronic health record by hand, it just screamed that it was right for innovation and thinking about different ways to do things.
3:05: Why is data sharing so crucial to pediatric oncology research?
3:09: They have the problem of not being able to establish really consequential and significant outcomes because there are so few cases for them to work with in their own local.
3:22: The pediatric cancer community has always known that it had to work together across boundaries by collecting large amounts of data, one or two patients at a time from multiple institutions, which is the only way we have been able to get the numbers needed to study pediatric cancer.
3:36: And that's allowed us to make these incredible strides and curates over the last 60 years.
3:41: So applying those principles to thinking about data commons, we knew that we wanted to engage groups from all over the world and collect a larger set of data.
3:50: But we want to take a different approach than had been taken by most groups.
3:54: And instead, we wanted to concentrate more on the data models and the quality of the data.
3:59: And so that really got me thinking about how do we build community, how do we build partnerships?
4:03: How do we get the world thinking about better ways to collect and use data?
4:07: The data for the common good is honestly a really impressive initiative where they are trying to tackle these problems in a practical sense, but also by gathering experts together to make decisions about how we can increase these opportunities for scientific discovery through data sharing.
4:25: How did the data for the common good come to be?
4:28: So it came from Sue Con who is a venerated researcher in neuroblastoma.
4:35: She came to S A with a, an Excel spreadsheet of 12,000 cases of neuroblastoma and it was harmonized to a standard that they had developed.
4:45: That was very, I don't say primitive, but it wasn't like you'd build a national data dictionary.
4:50: It was like zeros ones and twos for different data elements.
4:52: And it was a fine standard.
4:53: And what they allowed them to do was to harmonize data from all over the world to they had about 10,000 patients in this database and for a disease that only affects a couple of 100 kids in the US every year.
5:04: It was an enormous collection of data.
5:05: Going back to the nineties when we saw that data set as an Excel spreadsheet, we understood that there was something bigger we could do here which was make the data available at an interface for researchers.
5:14: And that's really how the data comments was born.
5:16: It was the confluence of my thinking about it, having the right platform to do this from in running this informatics group and then taking this large set of neuroblastoma patients and putting it into a data commons format.
5:28: And then we just started to think more about how do we develop a scalable way to do governance, to do data standards, to do data modeling.
5:34: And then here we are now 10 years later, you know, doing this over a dozen or more pediatric cancer groups with tens of thousands of patients with multiple data models and dictionaries and now way beyond pediatric cancer into other areas.
5:47: So it's really been a fun journey.
5:49: And I think having seen how broken the system was, was really the impetus to try to change it.
5:54: What led to the name data for the common good because it is a great name.
5:58: The original name of the group was the pediatric cancer data commons.
6:02: And it was because we started forming these data consortia around pediatric cancer diseases.
6:09: And that worked for about four or five years.
6:12: Some collaborators within the University of Chicago, one of whom is an international figure in a rare form of type one diabetes called monogenic diabetes.
6:23: And he said, I see what you're doing.
6:25: I want to do that in monogenic diabetes.
6:27: Can you help us?
6:29: And so we put together a proposal and got some funding to launch a data commons for monogenic diabetes.
6:37: And then a collaborator in pediatrics said, hey, I see what you're doing and I want to do it for myogenic epilepsies.
6:47: And we realized what we're doing is now really no longer a pediatric cancer data commons.
6:54: And in terms, we convened as the full group and said, how do we want to be known?
7:00: And that was the consensus agreement on the new name, data for the common good.
7:06: What does the data for the common good?
7:08: Do now, how do you help groups, share data and collaborate?
7:12: So what we do first is we build community.
7:15: We call everybody to the table and say here is what we do.
7:18: We want you to be a part of it.
7:20: We want you to own your data and make decisions about your data together with the other people at the table.
7:27: We want you as a group to decide what are your research priorities?
7:32: What is it that you want to do with these data once we get through the whole process?
7:38: And then we start by building committees, executive committees who make decisions on behalf of the group such as who are we going to release our data too when they propose to use it for something who represents the group, right?
7:54: We start in basically in two ways with governance and with building a data dictionary.
7:59: So everybody signs a common memorandum of understanding if they want to be a part.
8:05: And then when anybody is ready to contribute their data to us, they sign a data contributor agreement with us and the institution that allows us to be good stewards of their data.
8:19: And what we do when they give us their data is we standardize it according to the data dictionary that everybody agreed upon.
8:27: We send the data back to them to harmonize it with their previous data.
8:31: And then we put it into our data commons and publish it.
8:34: Not at the line level, it's all deidentified.
8:37: But people can go to our our portal and they can do some cohort discovery on all of the cases that are in our data portal.
8:47: And right now, we have something like 42,000 cases in the pediatric cancer data commons representing it's either five and almost six diseases right now.
8:58: Even though we have 13 or 14 consortia, a lot of the other consortia are in much earlier stages.
9:16: So data harmonization is the process of combining multiple data sets together in a way that allows it to be usable together.
9:26: So if we combine different data from different sources and things are coded differently, that mean the same thing, it can be really challenging to understand that downstream.
9:37: So we try to get things into standardized formats so that we can analyze the data much more simply.
9:43: So for example, if we're trying to look for a specific cancer and the way that it's coded in the data is different from one data set to another.
9:51: It can be really hard to collect those particular patients and subset them out of a larger data set.
9:57: If we have multiple names compared to, if we just have one standardized way and to use data effectively, it's not just about the data itself, it's about something called metadata which describes the data.
10:12: It's the data about the data descriptions of what each variable is.
10:17: For example, maybe you have an age column, maybe you have a column about when they were diagnosed, there's all kinds of examples.
10:25: And so a metadata table would explain what each of those columns of data or variables, what they are actually meaning and a little more information about how to interpret them and how maybe potentially to use them.
10:38: Maybe even what kind of transformations have been done to that data and data dictionaries like the kind that data for the common good is making sure folks are all agreeing upon is then a collection of metadata and documents that really basically help make that data actually usable by others.
10:57: So it should describe the database itself, how different data pieces are related to each other, potentially.
11:04: What kind of formats are the data in just everything that you might need to know in order to effectively use that database.
11:12: So making these data standards is actually really not trivial.
11:17: Some of these data standards don't exist yet.
11:20: So what Sam has done is really innovative because he puts together all these people that are working on this type of research, working on this type of clinical work and asks them to come up with standards together that they can all agree on.
11:34: And as you can imagine, that's very tricky.
11:37: It's hard to get consensus across a large group.
11:40: But eventually, after lots of discussion, they come up with data standards that they use to make the data consistent across all of the groups so that the data can be shared effectively.
11:57: So historically, academia has really incentivized this sort of private lab research, which means that people would collect data and do the research on that data and may not share that data publicly.
12:11: But things have changed a lot.
12:13: The field is understood, especially for things like rare diseases, rare cancer.
12:18: We need to share that data to have numbers that can allow appropriate create power for our statistical studies.
12:26: And for us to really make advances in cancer research.
12:29: The concern used to be particularly before the age of the internet that if you shared your data, someone else could claim that that data was theirs and they collected it.
12:38: And I think that fear sometimes of what used to be called scooping still exists.
12:44: But nowadays, there are systems and ways that we can share the work that we've done before, we published it to prove that we have done that work.
12:53: And there's no reason to hog credit because credit is not a thing that is limited.
12:59: We can attribute people's work and it doesn't cost us anything.
13:02: This is also really important for institutions that may be under-resourced.
13:06: Because now if we share our data, there might be brilliant minds at under-resourced institutions that could find findings that the original creators of that data did not find having more diverse access to this kind of data can lead to more innovative ways of using the data.
13:24: It's been shown that having more diverse research teams leads to more innovation.
13:29: And this can also lead to more research in health disparities and health inequities to help mitigate those issues.
13:37: So now we're trying to incentivize researchers that share data academia has had a history of data silos in the past where each institution might have their own compartment of data and not necessarily be shared.
13:53: And that still persists to this day somewhat.
13:55: What are some of the motivations for researchers to push for more data sharing efforts?
14:00: There is a real human exercise to can we get people to want to work together?
14:04: Especially if you think about the incentives that we've built in academics are all aligned against doing something like this, which is board your data silo your efforts, you know, publish or perish, get promoted.
14:14: So we have to try to change that paradigm and we're starting to see some of those benefits, but it's still very difficult to get people out of their old fashioned molds of, you know, why would I want to share my data when I should just hoard it and do my own work with it?
14:26: And then I think in parallel, we're seeing parent groups and advocates demand this.
14:29: I think we're starting to see the R FA S come out with language that mandates da better data sharing and open access.
14:36: And then I think we're just starting to see chipping away at the old academic structure.
14:40: So you know, maybe when tenure committees look at packages, they're not just looking at senior author publications and high tier journals, they're looking at the degree to which somebody's enabled data sharing or they're on papers that show their contributions to data sharing.
14:53: So I think we're starting to see movements in that direction.
14:56: We're far from it being the norm though to share data.
14:58: But I think we're getting there slowly and I think you just remind them that it's in their interest, right?
15:05: If they want other people's data and they want to do their science better, they need to come to some kind of consensus in addition to motivation, what are some possible incentives for researchers?
15:19: Well, again, it has to be both a carrot and a stick.
15:21: And so I think the grant mechanisms and I've helped some of this found I've helped some foundations write their data sharing language.
15:26: I think they have to be, I think you have to tie future funding to showing successful data sharing.
15:31: I think people shouldn't be afraid to put those in their grants.
15:33: I think to get year two funding, you should show that you put your data into a publicly accessible resource.
15:39: After year one, I don't think there's anything wrong with that.
15:41: So I think that approach is helpful.
15:43: I think we have to reward people who do share data.
15:46: I think having tenure and promotion committee start to recognize data sharing as a tenable element.
15:52: I think will be really helpful.
15:54: I think rewarding clinicians and researchers in some way, whether it's through something through a or orchid or something around, you know, who's a good share of data.
16:03: I know folks like Melissa Hendel have thought a lot about these issues.
16:06: I think there has to be both.
16:07: But I think starting with the granting mechanisms is, is a great way to get people to realize how important this is and how essential it is to their continued funding.
16:15: The incentives have to be both top down.
16:17: So they have to be mandated by the groups that have the power like the N I H and N C I HHS, the O N C.
16:24: All these top down groups have to mandate it and then bottom up, we have to see researchers and a grassroots push each other to do better data sharing and be part of this community.
16:33: Often it is so punitive in the work that we do like, you know, you will be punished if you do not do it.
16:37: This way.
16:38: I think we have to incentivize people to do the right thing.
16:40: I think in the end that is going to make for a better community, these data sharing efforts are wonderful.
16:45: What sort of impact do they have on cancer research at large?
16:49: What impact do they have on the patients?
16:51: Often the those publications are related to improvements in disease treatment.
16:59: Like how do you reduce radiation?
17:01: Right.
17:02: How do you improve longer term outcomes?
17:04: How do you do the best science with the least harm, you know, the best medicine with the least harm?
17:08: And I would say that's a general approach to the output so far in the pediatric cancer data comments, I always say like we have to make it easy to do the right thing.
17:19: So that means the data literacy and the informatics training.
17:23: Part of our work is incredibly important because it empowers the community to make better decisions when it comes to utilizing data, even collecting data.
17:31: And I want every clinician and researcher out there and they go to build a data collection for, to stop and think, you know, why would I come up with my own list of drop downs for this menu when I can look them up somewhere and somebody else has done this heavy lift that's going to be make it interoperable.
17:45: So that's a huge part of it that we're now starting to see more groups take this as a mandate to make their own data more shareable.
17:52: Data sharing is a skill set in order to do it properly.
17:55: And in a way that's effective, we need to be trained on it, whether we train ourselves or work with other folks, particularly, hopefully experts, not only do we need to be concerned with that the way that we share the data is effective, but we want to make sure that data that is sensitive, that needs to be protected in order to protect the well being of our patients and other folks who've contributed to research, we need to make sure that that data is not shared when it shouldn't be shared.
18:19: Yeah, we need to provide access to only the people that should have access.
18:23: So people who are doing research, we need it to be shared in a way that's safe and secure.
18:28: Some researchers might think the solution is to not share data at all because it might feel like that's safer.
18:34: However, that also is a disservice potentially to the patients whose data that we have because many patients are disappointed if they hear that their data hasn't been used to the full extent that they have, their data reflects their lived experience.
18:49: And it was a really hard time, potentially the hardest time in their lives and some of them did not continue on.
18:55: And so in that spirit of making sure that future people don't have to go through that same experience.
19:02: We need to make sure that the data that we've collected is used to its full potential.
19:07: And that means making sure it's shared and making sure it's shared properly.
19:11: And cancer patients and their families and support systems are really excited about their data being reused at times because this means that we can make faster advances in cancer research so that we can hopefully improve treatment for these diseases.
19:26: And maybe one day eradicate some of them data science as well as good data infrastructure and data management is really important to research.
19:46: It should not be thought of as an afterthought.
19:48: It's actually just as important as the actual clinical and wet lab research itself.
19:53: And these data science issues require people to actively work on them and work together.
19:58: So it really means that it needs the support of the community, the proper tool tools and appropriate funding to help support it.
20:05: In our next episode, we'll look at the efforts of the National Cancer Institute of the N I H to address the need for community tools and the funding to support both.
20:14: Thank you for listening to cracking the cancer code.
20:17: New episodes are released every other Monday.
20:19: Wherever you get your podcast, you can find out more about our work at I T C R training dot org.
20:25: Cast is sponsored by the National Cancer Institute through the Informatics Technology for Cancer Research Program grant number U E five C A 254170.
20:36: The views expressed in this podcast do not reflect those of our funders or employers.
20:41: We'd like to thank everyone who graciously lent us their time for making this podcast without their contributions.
20:46: It would not be possible.