Big Data in Biology: Big Data Challenges and Solutions: Control Access to Individual Genomes
Series Introduction: I attended the Keystone Symposia Conference: Big Data in Biology as the Conference Assistant last week. I set up an Etherpad during the meeting to take live notes during the sessions. I’ve compiled all the abstracts, notes and slides (where available) here. Shout-out to David Kuo for helping edit the notes.
Warning: These notes are somewhat incomplete and mostly written in broken english
Other posts in this series: #
- Big Data in Biology
- Big Data in Biology: Large-scale Cancer Genomics
- Big Data in Biology: Databases and Clouds
- Big Data in Biology: Personal Genomes
- Big Data in Biology: Imaging/Parmacogenomics
- Big Data in Biology: The Next 10 Years of Quantitative Biology
Panel: Big Data Challenges and Solutions: Control Access to Individual Genomes #
Monday, March 24th, 2014 2:15pm - 4:00pm
Panel members #
- Moderator - Doreen Ware (DW), Cold Spring Harbor Laboratory, USA
- David Haussler (DH), University of California, Santa Cruz, USA
- Laura Clarke (LC), European Bioinformatics Institute, UK
- Jill P. Mesirov (JM), Broad Institute, USA
- Andrew Carroll (AC), DNAnexus, USA
- Lincoln D. Stein (LS), Ontario Institute for Cancer Research, Canada
- Mark Gerstein (MG), Yale University, USA
DW: Interaction between panel members/audience
Started planning this meeting almost 1.5 years ago - we decided that controlled access would be a main talking point
Challenges and opportunities #
- scale (volume)
- variety - the heterogeneity of the data. Representation and analysis of this. How will we deal with metadata? how to integrate?
- timeliness - velocity, getting data, operated on, updates
- privacy, topic of this session - key point
- usability, want this data to be useful - accept human input and support collaborations. Interpretation of the data.
Personal Genomes #
- 1000 genomes
- publishing their own genomes
- personal genomes project (George church)
- GigaDB - liver cancer patients
- more examples of having access to this data is not easy! privacy and bio-ethics. nature - privacy protections the genome hacker. some of the privacy we think we have isn’t as private as we think. ‘anonymized’ sets - can be identified by combining the data. How will we handle integration?
Panel Introductions #
Mark Gerstein (MG) - Yale (bioinformatics)
- originally worked in model organisms
- transitioned to human genomic - scale, but not really privacy issue
- disease genomics - privacy issues!
David Haussler (DH) - UCSC
- running into all kinds of data issues
- go through long protocol to get to all data sets
- cancer datasets - didn’t make it clear didn’t make it clear it was a childhood cancer study, was rejected
- subtle consents get crazy
Laura Clarke (LC) 1000 genomes project
- managed access data on some projects - trying to make applications as lightweight as possible
- new open/managed accessed project - not clear how to make useful
Lincoln Stein (LS) OICR
- works with ICGC DCC - make cancer genome datasets available as frictionless as possible
- open and controlled tiers
- main concern: maximize access to data - make useful- do not violate donors trust. donated under agreement used for research and no other purposes (identification)
Jill Mesirov (JM) Broad
- most of the time collaborators worry about permissions etc.
- there’s a tension
- mostly clinical studies - patients want to do whatever they can to help us understand their diseases. BUT: learned that consents that they sign aren’t necessarily consumable by the average person
- many patients don’t understand - if i share my data it doesn’t just affect me, but my relatives and other people that share large pieces of their genome with me
- other issue: ethical/legal. a lot of the problems with disclosing the identity of the patients data and clinical info and genetic info is that it can affect things like hiring, insurance, liability. these risks need to be clear to them
Andrew Carroll (AC) DNAnexus
- used 1000 genomes data
- used CHARGE consortium data - under IRB restrictions. only combine in appropriate way, keep data flowing consistently
- used pharma company sequencing data - internal for r&d
DW: Q: Are the current support systems right now sufficient?
LS: The issue a lot of researchers are encountering - like cell phone makers, every phone contains thousands of licensed technology: Need to negotiate with each maker of hardware/software component. It’s beginning to get a lot like that in genomics - each one is consented for different rules. Cancer research, pediatric research, general research… must observe restrictions on each of the components. Makes it very difficult to combine two datasets. Even using controls - can’t use other sets as normal controls in a cancer study if they only consented for a diabetes (or other) study. Need uniformed consent - stop focusing on dataset and focus on researcher - have an ethically approved researcher status. If I pass every year
JM: One of the things I observed at Broad - datasets will come in to Broad and will take on a life of its own. Shared in ways that are not appropriate for consent - through ignorance. Implications for the data not made clear. We put in place a training program around how you handle this kind of data, and to minimize the replication of this kind of data. Got authorisation - did not duplicate the data. Track better who is accessing what.
LC: As we move towards centralised compute and moving analysis to compute - these sorts of challenges will be easier. One of the key points of making this data useful is better defined consent.
MG: I second the points of LS and JM. Most people who would inappropriate use private data is accidental - ignorance. They do that because it’s easier - just copy dataset, don’t go through protocols. Need to make good tools and infrastructure so there’s no incentive to do it wrong.
DW: Moving forward (question to JM), do you think there’s a need for some sort of education on handling this data?
JM: Yes, especially trainees who are beginning their research career. Human subject certification test - goes on forever. These are the key important things you need to understand: These people are giving you a gift, something very personal about themselves for you to further your knowledge and help treat the disease. In turn you have to respect that, and here are some simple rules on how to do this.
AC: Looking at this in a technical sense. Many people working on this in a flexible way, someone will make a mistake. We need to architect technical solutions that make it easier for the graduate student to not make a mistake.
MG: Not only the grad students - in clinical orientation. Clinicians sloppy about where they put their data and how they move it. Need to educate.
LS: There’s a lot of debate in the USA on the legality on putting genomic data in clouds. Misguided debate - more secure than letting grad students play with it on laptops.
DW: do you think the compute clouds are secure enough to share among collaborators
DH: Appropriate levels, the cloud vendors can be more secure than the NSA. It’s going to be so much more secure than at any medical institution. Need to work with the cloud vendors to come to terms with a compliance framework.
Institutions may not want to change for historical regions (e.g. consent forms specifying where data stored). Why does banking accept cloud and not NIH?
DW: Are the current restrictions on whole genomes too restrictive?
AC: depends on what you want to do, how ambitious you want to be. There’s an immense amount to discover. If it’s not acted on in an academic setting, pharma will go out and sequence their own pools and make their own discoveries. The value is there - if there isn’t a means to get at it they’ll find their own way.
DH: There is a willingness to use controls and share in Pharma
LC: Pharma doesn’t want to make this massive investment by themselves individually.
LS: technology is enabling lots of things. Patients that have a serious disease are very willing to share their genomic data for the greater good if its handled appropriately. PMH (Princess Margaret Hospital, Toronto) study - has sociology group to get attitudes on genomic sequencing. When patients were asked:
- ‘would you be willing to share your mutations with researchers?’: 100% positive response rate.
- ‘would you share your germline polymorphism around areas relevant to your cancer’: still positive responses
- ‘would you share incidental findings’: Complete drop off - almost nobody in the study wanted incidental findings disclosed.
Need to rework regulatory framework and the way consents are posed in order to address the read and perceived harms to patients/donor/ family members
JM: This is an area of intense activity. Regulations/consents/risks conveyed. It’s a tricky business.
Q: (Ouellette) What if people were told that these germline/incidental findings would help others?
LS: the way you ask greatly affects response. Wording. We want to look directly at what the short term and potential long term harms are. Short term - non-paternity. There will be people trying to figure out if their friends/neighbours are in a cancer db.
MG: I am a privacy advocate in this context. What is the harm that can happen? People don’t know exactly what the disclosure of their genetic info will affect. There will be a major harm to genomics and bioinformatics as a field if people commit stunts/db gets hacked and break privacy. We have to think about how this reflects on teh field. It’s a potentially a bad thing - consent implies they’re really trusting us. You really have to understand the trust. If you breach the trust, everyone looks bad.
LC: Mark (MG), what do you think will be appropriate consequences? How do we maintain the societies trust?
MG: Concept of license (LS) - you are a responsible researcher, prove it and update it.
JM: Read Yaniv Erlich’s paper. We need to understand what data we can and what data we shouldn’t share. Some data was on ancestry.com - the db had addresses (city and state locations). It wasn’t the case that he went to a repository of data that was just genomic data. He was very clever, used a lot of ancillary information around the particular genomes to get that information. Great paper, raises a lot of issues. We could be disclosing identity.
NB: Paper mentioned:
Gymrek, Melissa, et al. “Identifying personal genomes by surname inference."Science 339.6117 (2013): 321-324.
DH: We need to start thinking about privacy in terms of granular facts and how they are linked. Separate the idea of what is information that can be public and associations between those that are private. Internet of things/facts - if you can link multiple of facts to the same person it causes a violation of privacy. Share private information - sharing the linkage -> previously anonymous data becomes controlled information.
- who can see it
- for what purpose.
Need to have remedies and ways of looking at controlling and approaching privacy.
Linking too many facts about one person
AC: Where this chain is broken - where someone is able to tie outside information to a piece of genomic sequence. It becomes easy to identify everyone related. For example, Bitcoin - if you can break some of these hashes, you can determine entire transactional history. Single break in the link will expose many people.
LS: So we need to ban the genealogy databases (laughter). That will break all the links and allow any piece of information be linked
DH: If you break it up to enough pieces, each piece will be uninformative
MG: There’s still the issue of outliers: you’re going to have outliers. Maximum income in a survey - you know who that is. Correlation - lot of these factoids have subtle correlation, can do de-identification. A few bits of information, some simple correlation
DH: I disagree. Suppose there’s a position on the genome where only person has an A at this position. Suppose I publicize that only one person has an A at this position.
LS: But what’s the usefulness of this for research - one single position? Once we get to the usefulness part, we run privacy risk.
DH: Yes, when we link things. dbSNP is fine, no one argues it ruins our privacy. There are stats
LC: There’s a lot of info they won’t put in dbSNP - because it becomes identification. These are the pieces that are important for research
DH: once we establish world of anonymous facts - can have private exchange of links
LC: are the barriers too high to establish that system
DH: it’s all out there with UUID, everywhere. Whole protocol is based on keeping secure private key chains
Q: (Schatz) Do you think it’s a problem that perception in popular press of accurate identification?
LS: issue in popular press is that the informative power was oversold during the Human Genome Project. They’re not happening at the rate people expect
AC: Scope, sensitivity isn’t great. The problem will take care of itself
DH: people tend to overestimate the impact in the 5-10 year range, underestimate in 20+ range
LC: Global alliance, verification
Q: How do we contain/quantify the privacy that is consented for? Can we come up with metrics that quantify 1) uniqueness 2) identifiability? Actuarial tables to find uniqueness?
DH: We need to come up with categories - this granularity if it’s anonymous is not identifiable in itself. Then, only think that’s private is linking to pieces. Don’t think it’s a matter of counting how many people have that type of value. We can make assessments where it’s granular enough.
MG: Agree with DH, make a few observations: theory of information. Risk: relationship between amount of info leaked and amt of risk taken.
When we talk about this information leakage, we’re talking about identifiability risk AND characterization risk. But don’t consent to having all your proclivities/characteristics unearthed over time.
Q: Danger of privacy - when the first db gets hacked. Are we selling these databases as being secure to the public? Change legislation - can’t be discriminated against. This will eventually be leaked - add more security or lessen consequence of being identified.
JM: This is critical - legislation. If I can’t get health insurance because of BRCA mutation, it’s important. Can’t get a job because genetics are known. Some initial legislation has been passed, but it’s up to us to lay out what this will look like. Serious risks to people in terms of daily life.
AC: everyone agrees we have to have the greatest legislative on people. But even if its passed, it will not be sufficient - discrimination still happens.
Q: Whole genome will be cheap and accessible and non-scientists will be able to get this done. In that world would you be able to get a hair from someone and collect the data yourself, you can circumvent all these security issues.
LS: Real and scary scenario - happening in law enforcement. Suspects are routinely genotyped without their knowledge.
JM: You should all watch the movie GATTACA - logical and scary extension of all of this.
LS: The federal databases of genotype data are extremely well thought out. # of SSRs is just enough to identify people to narrow down suspect pool, but not enough to pick out a single person in the whole US. But the state unregulated.
MG: Privacy bias - genetics has a checkered history. Darwin, 1920, etc. Given that history it’s good to reflect on this future.
Q: There are other communities that have faced this: Census department. They understand the benefit to provide data to researchers (summary statistics, de-identified sub-sets, experimenting with creating simulated sets where the probabilities of the data are mirrored and operated on). Can there be a parallel track where we start to experiment?
MG: Big Data is about data not simulations. Very doubtful if a simulation could recreate the linkage.
DH: The linkage of all of our genomes is a product of our common heritage, as it gets dense we’ll reach a critical point where we can do a lot of inference.
LC: If someone comes up with a Facebook for human genomes, way more useful
LS: I once ran a thought experiment across wife’s relatives (south indians). If you have a cell phone app where you can search for all your relatives within a nth degree radius. "Oh yes, love it This would save so much time!” - Then I say, all you have to do is donate a bit of DNA - “sure!”
JM: I look at my children and their friends - their notion of privacy is very different. Sharing their genomes would be a drop in the bucket.
AC: Enough people would share so that if your genome data is linked they could identify you.
MG: Then when you’re the only person left who cares about privacy, you’re identified as the one person who hasn’t shared.
Q: We talked in circles around the legal issues - imagine the outcomes. Danger: if we don’t do this, we’ll end up in a dystopian situation where we can’t talk to each other.
Q: At a big data conference. Difficult to link these entities - that’s why we’re here, to make these links. How privacy affects the downstream. Should there be a consideration ‘upon your expiration your data is withdrawn’?
LC: I had 23andMe done. I discussed this with my parents, but not sisters.
JM: Watsons personal genome published but not APOE status
MG: People were able to trivially determine his APOE status.
JM: Concerning: people often don’t understand that a lot of these genetic variants are just a predisposition to certain endpoint. The kind of education that is required is huge - help people understand probabilistic.
LS: Recently discussed Canada’s policy on withdrawal of genetic information. Proposed to allow 1st degree relatives to withdraw on demise. This was unworkable - what if siblings disagree?
LC: We’re getting paternalistic - remember getting this data out and easy to use will be of such benefit to health and science. We shouldn’t put too many barriers up.
DH: We owe it to our grandchildren to do our best to understand how genomes and disease are related
MG: One thing that’s important - there’s a lot of countries that don’t care about privacy. Their legal system setup is not ready to worry about this. I can imagine a future where most of the genomics and discoveries are centred in places that don’t put up these barriers
Q: We should license people to use big data analysis. Age of big data - privacy is an illusion. You can go to someone’s home and know everything about them. If someone wilfully wants to know you - it doesn’t cost that much.
AC: it all comes down to how many people have access to data. We want to provide a technical solution robust enough share with research help cut some of that off.
DW: What are some of the technical barriers in the next 5 years
LS: enabling people to get into cloud (or whatever) and use it. Accessible to as many people as possible in a secure manner
JM: How do I find out what data is out there that’s relevant to my particular project/study. Better metadata. If I could find the sets I need - I don’t mind going to whoever owns them and get permission. There’s a lot of data that’s acquired that people don’t know about and it’s not described. Description, registry and search - without command line.
AC: Making sure everyone is doing what they know how to do best. Bioinformaticians aren’t tied down doing things outside their expertise, biologists have access, researchers have access
MG: Having lots of worked out exchange standards for secondary analysis files. Want to share reads/BAMs, but secondary (summarized) data sets are very useful. Very little standardization now.
DH: technology moving so fast - have to be nimble. Have flexible standards/evolving. Up to speed to transfer/process/exchange data. APIs are important. Metadata is important. Require goodwill, work together to create standards. e.g. W3C - internet standards. Not easy.
LS: analytic pipelines are complicated and finicky. Small changes get dramatically different results. Projects like Galaxy and synapse - keep track of steps of a workflow. Track the output/input files - human and machine readable and reproducible.
DW: Any other points? Any prediction next 5-10 years
LS: In the next 10-15min, we’ll all enjoy a nice reception.
MG: sports genomics and superstar genomics
DH: I see turmoil and opportunity - research projects talking to each other at a large scale. Work with clinical world.
JM: Great promise for translation. we’re doing better at identifying the genetic variants and signatures associated with disease. Beginning to make progress on mechanism. Treatment is a greater challenge - hopefully it will come.
LS: The nature of the clinical trial is going to change - not just a single region/centre with 100 patients. Globally distributed clinical trials - networks of independent physicians. Patients with rare genetic variants enrolled. Precision genetics clinical trials.
LC: Hope: we can start answering basic biological questions and providing clinical outcomes
AC: Predict: tools will become more robust: Clinical applications - cancer will lead the way. Drug companies will combine genotype and phenotype data. The majority of sequencing will be cattle, plants ($2 a plant!)- humans are backwards.