The 'Genome Hacker' Who Mapped a 13-Million-Person Family Tree
Huge crowdsourced genealogy databases are inspiring new genetics research.
Yaniv Erlich has been a white-hat hacker and a geneticist at Columbia University, and now he works for a genealogy company.
This unusual career trajectory has led, most recently, to a 13-million-person family tree unveiled today in Science.
The massive trove of data comes from public profiles on the crowdsourced genealogy website Geni.com, and it sheds light on human longevity and dispersal over time. (I wrote about a preprint of this paper last year.) But most of all, Erlich is excited about overlaying DNA information on top of family trees to study genes implicated in disease.
MyHeritage, the company behind Geni.com, also sells DNA ancestry tests. And since 2017, Erlich has been on leave from Columbia working as MyHeritage’s chief scientific officer to develop those DNA tests.
If that sounds like a lot of data going into the hands of one company, well, it is. Erlich has very much been in discussions about DNA research and privacy. In 2013, he showed it was possible, using only public information in places like consumer genealogy databases, to identify certain study participants who had contributed their DNA to research projects. For this feat, Nature dubbed him the “genome hacker.”
I talked to Erlich about how he thinks about privacy in the era of big-data genetics research—he’s actually published his genome online—and how genealogists have inspired his research. This interview has been condensed and edited for clarity.
Sarah Zhang: You’ve constructed a family tree connecting 13 million people. That’s quite an accomplishment, and so rather than let you bask in it, let me ask: Would it be possible to construct a family tree that connects every single living person in the world?
Yaniv Erlich: There is a theory you need to go 75 generations or so to connect everyone in the world. By everyone, I mean everyone. I’m talking about people in some tribes in the Amazon to someone in Iceland. So it’s possible, but there are no really good ways to trace so deeply. Maybe with genetics we can start to bridge gaps where the genealogy is not there.
Zhang: How did you get interested in genealogy?
Erlich: It’s a bit of a long story. Every kid in Israel in seventh grade needs to do a genealogy project. I did my genealogy and I was so excited about it. In fact, I won the school award for the best project.
Now at the end of 2008, my third cousin that is really into genealogy told me about this website Geni.com. He emailed me like, “Oh, do you want to put in some people in your family?” I was looking at the data and I was thinking—this was toward the end of my Ph.D.—somebody should download the data and do something cool about it. I didn’t have any application in mind.
Then in 2010, I started my own lab at the Whitehead Institute [at MIT]. I was sitting there thinking what I can do and reading a bit about how to mine social media. There was a book, Mining the Social Web. I sent a cold email to the CTO of Geni, asking if I could download the data. It’s a cold email, who knows right? He got back to me saying, “Yeah, you can download the data, no problem,” and gave me some pointers on how to do that.
Zhang: Your study is published now, but it seems like this is a beginning rather than an end. I’d imagine what you’re really interested in is overlaying genetic data on top of the family tree.
Erlich: Exactly. At MyHeritage, we started to offer DNA tests to users in November 2016. Since then we’ve collected 1.2 million DNA profiles of users.
Zhang: And why make the jump to MyHeritage? Are there things you can do at a company you couldn’t do in academia?
Erlich: I think this is a model for the future. There are certain things that you can only do in academia. There are certain things you can only do in companies. If you want to move in scientific endeavors, collaborating with companies is a very fruitful direction.
I could not do this study if I was just in a company because it’s years and years of process, and this is academic freedom that I could actually take this time. On the other hand, if there are no companies, nobody would collect this data. This amount of data, you cannot get it in academic studies. Companies have the ability to reach out to millions of genealogists, to work with them, to convince them to give the data, to give them the good feeling about it. You need a company that has websites that are perfect, that are responsive, that are fun to use. Not PubMed, which is a nice website but has a very geeky look.
Zhang: So what are you going to study now with the combination of DNA and genealogy data?
Erlich: Even better, we also have phenotypes now. Since October, we started to allow users to fill out surveys about themselves. So we have the genealogy and the surveys and the DNA. Our surveys are modeled after the U.K. Biobank surveys. We’re asking, did you have a heart attack? Are your parents suffering from Alzheimer’s?
About a year ago, Joe Pickrell and myself had a paper in Nature Genetics that was a genome-wide association study by proxy. Think about, say, we want to look at genes related to Alzheimer’s in our data set. If I go to our users and ask, do you have Alzheimer’s, they are healthy people; otherwise they wouldn’t be buying the test. So for certain diseases, it’s quite hard to get the information. What we show is you can ask users to ask about their first-degree relatives [parents, siblings, and children] and since you share half of their genome you lose half of the signal but you get so many people to answer the question that you get back to the power needed to implicate genetic variants.
Zhang: Let’s talk about privacy. Senator Chuck Schumer recently held a press conference calling for more scrutiny of DNA tests. You have a history of thinking about DNA and privacy, so how has that informed MyHeritage’s practices?
Erlich: It’s part of the challenge of this new era. At MyHeritage, we take it very seriously. We allow people to delete profiles. There are settings—you can have your profile private or public. People can delete their DNA data, and we’ll go to the lab and we’ll even wash away the tube. So we take these things very seriously.
But if you ask me, do you want to share with me your genealogy or your cellphone records or search-engine records, I will share my genealogy.
Zhang: In fact, you’ve put your own genome online.
Erlich: Because I feel like I don’t have a lot to risk in general. If you ask me do you want your search-engine data or data your ISP sees or your bank account versus your genome, your genome is actually quite—I don’t think it’s very interesting.
Zhang: In 2013, you actually published a paper finding that it’s possible to identify some DNA donors from publicly available information. I think this study still gets talked about a lot. Do you think it changed anything?
Erlich: I think it changed the way policy makers think about how we communicate risks to participants. I think previously the prevailing thought was we just promised them everything will be okay. Now we promise you 100 percent effort, but we are also learning.
I think the other interesting thing is in 2013, people didn’t understand why I did this study. I got many questions: “Why even do something like that?” And then now, we’ve matured into this data-intensive world, it became very clear this is the right research to do.
Zhang: That study was actually inspired by a mother-and-son pair who tracked down the son’s anonymous sperm donor using consumer genealogy databases, right?
Erlich: Yeah, the mother worked in Cold Spring Harbor [Laboratory in New York] or she used to work there 20 years before, and she contacted Cold Spring Harbor. I did my Ph.D. at Cold Spring Harbor, so I met her. I was like, “Wow, that’s crazy.” It was really mind-blowing.