Usually, a phrase like “OkCupid data leak” would refer to a hacking incident done with a stated intent to harm. In this case, the leaking of data from 70,000 profiles from the popular dating site happened in the name of scientific research. But does that make it okay? Not necessarily.
Danish researchers from Aarhus University scraped the data from the site without asking OkCupid, or any of the users, for their consent. Then, they published the data set on the Open Science Framework, where it has already been downloaded hundreds of times. Emil Kirkegaard, the lead researcher of the group, published a paper about the data in a scientific journal called Open Differential Psychology—a journal for which he serves as editor.
The data dump doesn’t include any of the users’ real names, but it does include enough information about each user that they could be identifiable. Vox reports that the data set contains users’ locations, their demographics (e.g. age, gender, sexuality), as well as their answers to personal questions about which traits they’re looking for in a partner.
Kirkegaard and his collaborators have framed the data dump as an important tool for social science research, but their decision to publish it has come under fire from their peers for a potential breach in ethics. When questioned about the morality of releasing the data without asking either OkCupid or the users about their willingness to take part in a scientific data set, Kirkegaard responded: “Data is already public.”
That doesn’t make it right, though. The American Psychological Association makes clear that participants in studies have the right to informed consent; the European Federation of Psychologists’ Associations have a very similar clause in their ethical guidelines, as well.
So, the users should definitely have had the opportunity to consent to their profiles appearing in the data set. But would OkCupid have allowed this, even if their users were okay with it? An OkCupid spokesperson told Vox, “This is a clear violation of our terms of service — and the Computer Fraud and Abuse Act — and we’re exploring legal options.” Sounds like that’s a no, then.
Even if OkCupid does issue a takedown of the dataset, it won’t wrap up the larger question about the ethics of studying “public” internet behavior. Even when internet behavior is nominally “public,” we don’t always see it that way in practice. All of these “leaked” OkCupid profiles were already available to view by the public, as Kirkegaard stated. But now that they’ve been exported and repackaged into a dataset for scientific use, it makes us feel uncomfortable. The only difference is the context. But context is everything.
Some believe that since these profiles can already be accessed by the public, they should be fair to use without question (clearly, Kirkegaard and his colleagues think so). Most people seem to believe that the users should have had the option to consent to the data breach. But still others have pointed out that OkCupid and other sites like it already perform studies on their users without asking them, and scientists can’t replicate or verify any these studies because they don’t have access to the data. The fact that OkCupid, as opposed to the users, has the final say about what happens to the data might not necessarily be a comfort.
Anil Dash’s essay about internet privacy, titled “What Is Public?“, is over two years old now, yet it’s still the best navigation of the topic that I’ve seen. Dash points out how it’s in the best interests for the media to continue to frame “public” as a gray area, since it allows journalists to profit off of sharing other people’s content from social media (e.g. embedding a series of tweets written by someone else as an “article”). Meanwhile, the tech industry benefits from framing “public” as a binary, because it means companies can monetize their own data. In this case, OkCupid owns the data and doesn’t want anyone to scrape it, no matter how “public” that data may be. But that means that it’s not possible for anyone but OkCupid to study this data and analyze it.
Luckily, this ethical quandary is not being left up to me to solve, because I don’t have a good answer. I don’t like that these researchers published this data dump without asking anybody if it was okay. I don’t like that OkCupid (and other websites) experiments on their users without asking. Yet I also enjoy reading studies of human behavior, and I’m sure that I’ve read studies that were created from scraped information in the past. In all my coverage of social media and the tech industry here, I often cite studies describing demographic information about users. Did social scientists have to pay Facebook and Twitter in order to get that information? Did websites collect it and publish it without asking? How is my own data being used?
As a lowly internet user, I realize that I’m probably under a microscope all the time, and that I have no idea where my data’s ending up, even if I think I have a lock on it. It’s still nice to see that at least some researchers don’t see all people as data points, though, and they’ve pushed back against this data set being published. There’s got to be some way to study human behavior online (an important task, no doubt!) while still collecting the data ethically. That goes for researchers and tech companies and, yes, journalists like yours truly.
Normally I’d close with a joke here, but I got nothing. The problems of the internet are just overwhelming as heck, sometimes. Yeesh!
—The Mary Sue has a strict comment policy that forbids, but is not limited to, personal insults toward anyone, hate speech, and trolling.—
Have a tip we should know? firstname.lastname@example.org