Cracking the Cancer Code
In an age where data are everywhere, harnessing the power of data science can be a catalyst for groundbreaking discoveries in the fight against cancer. Welcome to the Cracking the Cancer Code podcast where we explore the latest in cancer data science. As a part of the ITCR Training Network (itcrtraining.org), we’re a small team of individuals who are working to democratize data science education in the hopes of catalyzing cancer research and ultimately fighting health inequities in cancer.
The ITCR Training Network (and this podcast) is supported by NCI UE5CA254170 but the views expressed on this podcast are those of the individuals who expressed them and do not reflect the views of our funders.
Find out more about the ITCR Training Network at https://www.itcrtraining.org/
Cracking the Cancer Code
More than numbers
Use Left/Right to seek, Home/End to jump to start or end. Hold shift to jump forward or backward.
This episode of "Cracking the Cancer Code" focuses on the challenges and advancements in working with different types of cancer data, particularly free text clinical data and imaging data. The hosts, Dr. Carrie Wright and Candace Savonen, interview experts in these fields to discuss the current state of cancer informatics and the tools being developed to improve data analysis.
A special thanks to Dr David Hanauer and Dr. Gordon Harris for taking the time out of their busy schedules to talk with us for this episode.
0:08: You're listening to Cracking the cancer code, a podcast series about the researchers who use data to fight cancer.
0:15: We support cancer informatics through informatics and data science training.
0:20: I'm Dr Carrie Wright, a senior staff scientist at the Fred Hutchinson Cancer Center and I lead content development for the I T C R training network.
0:27: A collaborative effort funded by the National Cancer Institute and I'm Candace Savonen.
0:32: I'm a data scientist at the Fred Hutchinson Cancer Center and I'm the tech lead for the I T C R training Network.
0:37: We work closely with a variety of dedicated cancer researchers on the forefront of cancer informatics, shaping our understanding of the field and shaping the field's future.
0:48: Last episode, we discussed how advents in laboratory techniques and computational technology has led to a revolution and how much data cancer researchers have to work with.
0:59: There's basically been an explosion of data, but the challenge with cancer research isn't just the amount of data.
1:05: It's also the type of data.
1:07: Often when we say the word data, we automatically think numbers.
1:11: But this is only half the story.
1:13: When we discuss our health with practitioners, they are taking loads of notes, some that are numbers like weight and blood pressure, but most that are in the form of words.
1:22: So what about this data?
1:23: That isn't numbers?
1:25: We spoke to Dr David Hanauer, an expert in parsing electronic health record data free text is an absolute mess.
1:35: But it's also the most rich dense, probably valuable data that anybody can work with.
1:42: David Hanauer.
1:43: I'm an associate professor here at the University of Michigan in the Department of Learning Health Sciences.
1:48: My work is primarily in the area of clinical informatics, mostly on the research side.
1:54: So I really work a lot with clinical data, mostly data coming from the electronic health record system.
1:59: The area I mostly work in is with three text data and mostly around a software tool that we have developed and have been supporting called emmers, which is the electronic medical record search engine.
2:11: So what's the difference between structured data and unstructured data?
2:15: So structured data are things that are often numeric or they could be codes.
2:20: They are often the kind of things that I would say fit easily within a spreadsheet if something can fit easily with the column and can be easily tabulated or you can get a mean or immediate or some other easy metrics that's usually structured data.
2:34: The free text are the unstructured data.
2:37: So free text is unstructured, often used interchangeably which is kind of the the open ended language that people use to convey their thoughts about what's going on, what's in clinical notes, that's so important.
2:50: Why should we analyze them those other in a sense omic data or whatever one might call them are really kind of more about aspects of the patient that they don't really have much control over.
3:01: It's sort of whatever their body is expressing in some ways and they would really not have aware of it.
3:05: Be aware of these things in the E hr a lot of it really covers more about what the patient is experiencing and what matters to them, right?
3:14: So there is lots of, I would say phenomic data that goes with it, the type of the patient, it can be about side effects, they're experiencing concerns.
3:23: They have, it can even be about economic issues they're having, which can impact their ability to get care, their ability to travel to places to receive the care.
3:31: And every one of these elements is actually quite important because you can have, it doesn't almost matter sometimes what your tests are.
3:38: If you can't get the proper care, can't pay for your treatments or you don't even understand what's going on with your, your conditions.
3:45: Let's talk a little bit about some of this omic data and omic data that David was talking about.
3:54: So Omic data often refers to the collection of perhaps genomic data, meaning information about gene expression could also mean information about metabol omics, meaning information from possibly blood tests, things that we might collect on data to better understand how the cancer is happening in an individual.
4:16: In terms of phenomic data, David's referring to phenotype or the way in which genes may be manifesting something in an individual.
4:26: So the way they're presenting a trait or a disease, that's right.
4:29: So if we go back to how genes work, we inherit these genes, they are, and ultimately, their expression leads to usually the classic as we think of eye color or hair color.
4:41: But obviously, there's all kinds of things of ways genes are expressed a widow's peak, an absence of a widow's peak.
4:47: And sometimes the way they are expressed leads to errors in the original DNA, which can lead to cancerous cells.
4:55: What makes free text data so hard to analyze.
4:59: There's I think a lot of reasons people don't use it.
5:01: One is that of course, it does take a lot more work to get.
5:04: You need to often have or partner with an expert.
5:09: Oftentimes you have to use natural language processing tools.
5:11: These tools are not easy to use.
5:13: They are actually quite challenging to use.
5:16: There's lots of problems with it because it's not coded to any standards.
5:19: Usually it is full of misspellings, it's full of incorrect information, it's full of discordant contradictory information.
5:28: I think it's also challenging because unlike structure data, it's very hard to remove all the identifiers from free text data.
5:35: And it's also important to know that things change over time.
5:38: So you can look at one note at one point in time, but you sometimes have to look at the entire sequence of notes because the course changes over time, what they were thinking and what they were concerned about changed over time.
5:49: The diagnosis may be very unclear at first and may become more clear later on.
5:53: So there's a lot of those things that are really hard to do unless you're really sort of holistically looking at everything and considering all the details in the entire package for each patient to know what's going on.
6:05: It's also important to remember there's a lot of authors of these notes, right?
6:08: So you might just think, oh, well, the doctor writes a note but you know, the doctor does, the medical assistant might a nurse will a social worker will I mean, there's a whole lot of different people on a care team who will write a note for whatever parts that they are working on.
6:23: And and therefore it's hard to know like which is the note, which is the source of truth, which one has been transcribed from another one.
6:29: So it can be really challenging to use.
6:32: And that's often why people don't seek out that those kind of data with all these challenges.
6:36: Why is it still worth analyzing free text data?
6:39: So the clinical notes are really more than just statistics or numbers and there is plenty of that, right?
6:45: But there's just a lot of details that may not be present elsewhere.
6:47: The size of the tumor will be in a note, it won't be in some of these other data systems, the progression of the disease, how you know how well someone is doing, how they're responding to treatments and anything else that that might be there.
7:00: These are all in the notes that just are not available anywhere else also depends on what kind of research you're doing.
7:05: If you're only interested in, if someone had a certain genetic mutation, then the notes may not matter.
7:11: But if you're interested in how that mutation impacts their, their outcomes or other aspects of how well they respond to a medication, you probably have to go to the note to get that kind of detail because it's not found anywhere else.
7:26: And I think there's probably just a lack of awareness of how valuable those data are because people do generally tend to like and use the structured data because it is so readily available.
7:37: And, and I think there's probably an assumption that because it's structured data coming from the E hr it's probably real and valid.
7:44: Even though many times we've shown it's actually not with the advancements of computers being able to understand language better.
7:52: These days, we can think about how the world of these clinical data might be opened up more than they ever have been.
8:00: We have actually helped demonstrate the immerse tool in collaboration with David Hanauer.
8:05: And it's really cool because people can put in a term that they're interested in and it will come up with a full dictionary of all sorts of related terms and people can modify those and add terms that they think are relevant.
8:19: And then it can search for that information among a giant collection of electronic health records data.
8:26: This kind of task would take so much time for a human to do by themselves.
8:31: It also enables people to identify individuals that might be really appropriate for a clinical trial.
8:37: So if they have certain traits, that would make them especially good for that trial, those people can be identified much more easily than it would traditionally take.
8:47: So in my mind, the beauty of things like emmers where we are processing these clinical data really fast are not necessarily that they are going to do it better than an individual human, but that they can go through a lot of information at the same time, the amount of information that an individual human would not have the time in their lifetime to go through tools like emer can do that for us.
9:12: So we can really mine what we need out of it to inform cancer research free text data isn't the only type of data that's been traditionally a challenge to parse and analyze images are another type of data that can be very tricky.
9:30: So when we think of cancer imaging data or medical imaging in general.
9:35: There is a variety of ways that we can take images of patients and their tumors or any other part of their body that we are needing to look at for clinical purpose or research purpose.
9:47: There is X rays, these are really good for seeing bones.
9:51: For example, there is also magnetic resonance imaging or MRI, which can give us more information beyond the bones.
9:58: We can see other types of tissue like soft tissue.
10:01: There's also things like C T people informally also call these C A T scans.
10:05: We can also talk about histology data where folks do different stains on a tissue and see what kind of reactions come up.
10:14: This is especially helpful to figure out what kind of markers the cancer tumor is expressing and then those proteins can potentially be used as targets for therapy.
10:24: Dr Gordon Harris is an expert in cancer imaging data and rethinking the way in which we collect and collaborate to analyze cancer imaging data.
10:38: You know, there's been talk about decentralizing clinical trials, but they've never talked about doing that for oncology.
10:44: Because for oncology to decentralize clinical trials, you need to be able to decentralize the imaging.
10:50: And until the availability of a platform that can do that that we've been discussing that's been just a pipe dream, but we can actually make that a reality.
10:59: Now I I'm Gordon Harris, I am the co-director of the tumor imaging Metrics CORP for the Dana Farber, Harvard Cancer Center and Director of the 3D Imaging Service at Mass General Hospital.
11:11: I'm also the principal investigator for the Open Health Imaging Foundation U 24 grant from the Informatics Technology for Cancer Research Program from N C I.
11:24: And while I'm part time at Mass General, now, I am also part time co founder and Chief science Officer of a Clinical trials imaging Informatics platform called UNU.
11:33: How is imaging data used in a clinical trial setting?
11:36: Ideally, clinical trials, imaging assessments are getting the images from the scan is to where the reader is with all the information they need to do the imaging assessment and having the imaging assessment done on time correctly and getting that information back to the site wherever the patient is being treated and being able to have that communication in case changes are needed in that imaging assessment that can be done back and forth.
12:03: So what exactly is a clinical trial, a clinical trial is where researchers are interested perhaps in a new therapy or treatment and they want to assess if it's actually helpful for patients.
12:19: So they will get a number of patients that have the appropriate type of cancer that might be mitigated or benefit benefited from this treatment.
12:29: And then they test it on these patients to see how effective it is.
12:33: They also test for side effects and other issues.
12:36: There's also multiple phases to clinical trials and the way these phases work is if they don't pass the first phase, they don't go on to the next phases.
12:44: The very first phase of a clinical trial is just to see if it can be tolerated as a treatment or therapy.
12:51: And so this determines safety, it determines dosage, how much can people safely take.
12:57: And of course, the word safety is a little bit confusing in the terms of cancer treatment because chemotherapy and other treatments are actually very rough on, on individuals and their bodies.
13:09: But at the end of the day, we're also working to combat the cancer.
13:13: It's not even getting to the point of asking whether it works.
13:17: It's just asking will people not be harmed by this?
13:20: And that's phase one, phase two is where they start to look at the effectiveness of a clinical trial.
13:25: Is it actually doing some good against whatever it's true to treat?
13:30: If not, then again, clinical trials won't go to the next phase in any of these phases if they don't pass the phase prior, next, is we want to know, is it effective?
13:39: But also is anybody having bad reactions to it?
13:42: Because again, we'll stop these clinical trials if at any point a treatment or therapy is not passing these tests.
13:50: And then lastly, it starts to get expanded out to more and more folks who are going to try it and we're going to continue to if it is effective.
13:58: And also safe.
13:58: Most importantly, a lot of the people that Gordon works with aren't necessarily collecting this type of imaging data for clinical trials.
14:08: They might typically be doing it for actual clinical work just to get people to treatment and help that they need.
14:14: So they don't necessarily have all the information to help optimize the collection and movement of that data in the best way.
14:29: In addition to the training that's needed to interpret the images, what are some of the reasons why imaging data are so difficult to work with?
14:36: So there is certainly a training component, but it's also the tools like if you think about what a radiologist is faced with, they're usually left to reading imaging assessments on their pack system and the pack system is a clinical system and it's not geared towards clinical trials, imaging research.
14:53: And so they have no way of knowing what the protocol is or knowing how to make the assessment for that trial.
15:04: And usually they're just making a mark on a tumor and they're not even able to identify whether it's the long axis or the short axis.
15:11: And criteria have requirement to measure the long axis on solid tumors and short axis on lymph nodes.
15:17: If you're just doing one measurement, you know, there's no way to confirm that it was done correctly and then that's just being transferred onto spreadsheets and paper forms.
15:24: And there's no way to track back to how it was done and also calculations have to get made relative to the lowest point in that clinical trial, which clinical reads are comparing.
15:36: Usually just to the prior and in a clinical trial assessment, you have to go to the nadir usually which may be eight time points prior and a clinical radiologist isn't gonna be looking at all the prior time points and checking which one had the lowest point of the tumor.
15:50: In fact, just because of the lack of tech technology, people downgrade imaging data in clinical trials to just numbers on a paper form or a spreadsheet, which is unconnected to the images and unable to really demonstrate in in audits how you got those numbers and whether they're done correctly.
16:08: And so you're looking at having these imaging assessments done to make decisions about whether the patient should be treated on a experimental therapeutic.
16:17: And so you know, there have been other papers published showing about a 10% censoring rate of patients that get enrolled for central review and then have to get taken off trial.
16:27: So that's both really poor for the patients.
16:31: And it's also really expensive for the companies running the trials to have patients that are not actually valid for the trial.
16:40: What are the unique challenges with analyzing imaging data in a clinical trial?
16:44: Some of the things that are unique about imaging data are the size of the files.
16:48: So you've got these huge imaging files, the scanners have become able to work faster and faster to produce more and more slices and provide more and more data that have to get read and reviewed.
17:03: So that's been an added burden because the imaging workload has grown much faster than the radiology resource.
17:12: So there's a lot of strain on, on radiologists clinically, which means that for clinical trials, they're really strapped because they have expanding clinical workload that they're getting pressured to keep up their R V US and their productivity on the clinical side.
17:26: And so they're being asked to read all these cases where they're doing a qualitative read and then getting interrupted to do a few clinical trials assessments that are quantitative and where they don't have the tools and they don't have the information to know what the criteria are.
17:39: So it's really disruptive to their workflow and the complexity of doing the imaging assessments for clinical trials in oncology because there are like 30 plus different primary response assessment criteria.
17:53: Most common one is resist 1.1 but that's only one of 30 different criteria for different kinds of tumors.
18:00: And then each trial has its own modifications and different sponsors have different modifications.
18:05: But without a platform to manage all of that workflow and make sure that the images are assessed correctly and compliantly with each trials, trials specific protocol, it's really an impossible task for the radiologists.
18:18: And you know, everyone wants to do a good job.
18:20: Everyone I think feels like they're trying to do their best, but most people just are in a situation where they don't have tools to meet the needs of what's being asked of them.
18:32: And so it's just frustrating for everybody.
18:34: What's the biggest data problem with imaging data?
18:37: And what impact does that have on research in patients?
18:40: Well, from my perspective, from the imaging side of things, I think the biggest data problem is the quality and timeliness of the data for clinical trials.
18:51: So the norm I would say right now, you know, not counting the the sites that we're working with and maybe a few sites that have developed their own in-house solutions.
19:00: The norm is a broken workflow, unreliable data, turnaround times that are not meeting the needs of the patients or the investigators.
19:11: You would be amazed at how many sites that we talk to are using manual data entry and paper form and spreadsheets and have really cumbersome workflows.
19:22: And you know, I mean, there's not enough people to waste all the time and effort doing manual workflows.
19:27: Patients are coming in getting scanned and then they have an office visit with the investigators who have to make a decision about whether they meet the clinical trial criteria, which involve imaging assessments according to specific criteria.
19:40: And they are generally not getting those in time and even when they are, they are not getting compliant assessments.
19:46: And so, you know, you're having patients falling off trial because they're not getting their results in time or you're having investigators do their own imaging assessments and then later getting the radiology results and then often they don't match and then they have protocol violations cause they treated a patient that didn't meet the criteria to be treated.
20:03: I've heard someone recently from a cancer center in New York City told me that it took them four weeks to get a clinical trials, imaging assessment from their radiology practice.
20:13: Now they are not waiting for weeks to get the patient treatment decision.
20:16: So what's happening there?
20:18: You know, it's really the whole motivation for what we've been doing in both the clinical trials, imaging assessment realm and you know, in the open source realm for data, patients are really being ill served.
20:30: And that's what really kind of motivates me and keeps me going when I'm running into these frustrations with institutions that have people who understand the benefit and want to get it done.
20:40: But can't seem to have the institutions getting out of their way.
20:43: What are some solutions to addressing all of these barriers and obstacles have a way of enabling people to share resources and work together across institutions.
20:53: So for example, at the Dana Farber, Harvard Cancer Center, we provide imaging assessments for other cancer centers.
20:59: So we have about a half a dozen cancer centers that we do their reads for where they don't have the infrastructure or the radiologists to do it but they use our platform, upload scans to us.
21:08: We do the assessments and they view the results on our platform.
21:11: Now, we also have other cancer centers that we collaborate with where sometimes they're short staffed and sometimes we're short staffed.
21:19: So the University of Washington in Seattle, sometimes they're short of radiologists to read, but they have image analysts to do the preliminary assessments.
21:29: And so their image analysts do preliminary assessments.
21:31: And our radiologists from D F H C C will read for U W.
21:36: Now, this summer, we're short staffed on image analysts because we have a few out on leave.
21:41: And so we're having U W image analysts do the preliminary assessments for us and our radiologists are reading those.
21:48: So we're working back and forth sharing those while at the same time, we're reading for Massey Cancer Center and U W is helping us with preliminary assessments and we're doing reads for U W.
21:58: So we've really created a whole community that can collaborate and work together because it's not a problem that there aren't people who can do the reads.
22:06: It's the problem that not every site has people who can do the reads at their site at the time, they need them.
22:10: So, you know, we've really created a whole ecosystem where people can collaborate, people can work together.
22:17: We can also implement a clinical trials protocol on the platform once and deploy to multiple sites on that trial.
22:24: So never before.
22:26: Have people been able to have a common framework where all the sites on the trial could have a system that implemented the imaging assessment workflow the same and made sure that the imaging assessments were all done the same regardless of where you were scanned on that trial.
22:41: So, you know, these are the things like you say, well, shouldn't people be able to do that?
22:44: Well, they should and they haven't been able to what Gordon harris' group is doing is really revolutionary in terms of making things really streamlined and easy for people to do these clinical trials, which will really help us advance cancer research.
23:01: What's really cool about what David and Gordon are doing is their projects are what's called open source projects, meaning that they have made their software available, the code that is for their software available to the public.
23:18: So other people can take their software and use it for something else that can also used to benefit cancer research or maybe even some other type of medical research.
23:27: What's amazing when talking to these researchers is that for research to work and to work well, so many pieces need to align and without the right tools and the right community buy in it can happen.
23:39: And so we're starting to see a theme here.
23:41: Our ability to explore new types of data means an exponential increase in that amount of data.
23:47: And also challenges of course, that follow how are we going to deal with this data.
23:51: How are we going to really get everything we can to ultimately help cancer patients in our next episode?
23:58: We'll discuss a groundbreaking data sharing initiative and explore how their experiences can shape how we think about data sharing in the future.
24:08: Thank you for listening to cracking the cancer code.
24:10: New episodes are released every other Monday.
24:12: Wherever you get your podcast, you can find out more about our work at I T C R training dot org.
24:18: This podcast is sponsored by the National Cancer Institute through the Informatics Technology for cancer research program grant number U E five C A 254170.
24:30: The views expressed in this podcast do not reflect those of our funders or employers.
24:34: We'd like to thank everyone who graciously lent us their time for making this podcast without their contributions.
24:40: It would not be possible.
24:43: And I know this exists for other software too where I've been sort of shocked to realize like you, you're just like us.
24:48: Like we think that you're this gigantic big program that everybody uses.
24:52: Like Red Cap is a perfect example of that.
24:54: And you realize that again, like for, for that software, it's Paul Harris who keeps it going and he is constantly trying to pull and pull together funding to keep everything going.
25:03: Just like we as a nation that rely on this software, like every academic medical center uses it and we all just assume it's there nobody thinks about, well, how is it being supported?
25:11: So, it's, it's not really an ideal way to do things.