How Well Can Machine Learning Predict Sexual Orientation, Pregnancy, and Other Sensitive Information?

How Well Can Machine Learning Predict Sexual Orientation, Pregnancy, and Other Sensitive Information?

If machine learning models predict personal information about you, even if it is unintentional, then what sort of ethical dilemma exists in that model? Where does the line need to be drawn? There have already been many such cases, some of which have become overblown folk lore while others are potentially serious overreaches of governments.


This article is based on a transcript from Eric Siegel’s Machine Learning for Everyone (fully free to access on Coursera). View the video version of this specific article.

Machine learning predicts sensitive information about you, such as your sexual orientation (a featured example of the accuracy fallacy), whether you’re pregnant, whether you’ll quit your job, and whether you’re gonna die. Researchers predict race based on Facebook likes, and officials in China use facial recognition to identify and track the Uighurs, a minority ethnic group. Some of the most promising predictive models deliver dynamite.

Now, do the machines actually “know” these things about you, or are they only making informed guesses? And, if they’re making an inference about you, just the same as any human you know might do, is there really anything wrong with them being so astute?

In this article, I’ll separate fact from fiction and provide some more details on these cases.

In the U.S., the story of Target predicting who’s pregnant is probably the most famous. In 2012, a big media storm led with the story of a father learning his teen daughter was pregnant due to Target sending her coupons for baby items in an apparent act of premonition. I’d say at least one in three of my non-technical friends and family who aren’t in the industry heard about this story.

Let me tell you what’s false and then what’s real. The teenager’s story is apocryphal. It was I who told the New York Times reporter who broke the story about Target predicting pregnancy in the first place. The revelation originated from a Target database marketing manager whom I invited to keynote at the Predictive Analytics World conference. All the news subsequently circulated about the story stemmed entirely from that keynote speech’s video (free to view here). The main problem with the story is that there’s no substantiated connection between the alleged, unnamed teen — if she even exists — and Target’s pregnancy prediction project. If some particular teen happened to receive materials that included those coupons, there’s no way to know that was a result of a pregnancy model. The New York Times reporter leads you to assume it, but if you read carefully, you’ll see it’s only cleverly implied there.

On the other hand, even if there’s no known substantiated case of a pregnancy model violating someone’s privacy, that doesn’t mean there’s no risk to privacy. After all, if a company’s marketing department predicts who’s pregnant, they’ve ascertained medically sensitive, unvolunteered data that only healthcare staff are normally trained to appropriately handle and safeguard.

As one concerned citizen posted online, imagine a pregnant woman’s “job is shaky, and your state disability isn’t set up right yet and … to have disclosure could risk the retail cost of a birth ($20,000), disability payments during time off ($10,000 to $50,000), and even her job.”

This isn’t a case of mishandling, leaking, or stealing data. Rather, it is the generation of new data, the indirect discovery of unvolunteered truths about people. Organizations predict these powerful insights from existing innocuous data as if creating them out of thin air.

So are we ironically facing a downside when predictive models perform too well? We know there’s a cost when models predict incorrectly, but is there also a cost when they predict correctly?

Even if the model isn’t highly accurate per se, it still may be confident in its predictions for a certain group of pregnant individuals. Let’s say that 2% of the female customers between the ages of 18 and 40 are pregnant. If the model identifies customers, say, three times more likely than average to be pregnant, only 6% of those identified will actually be pregnant. That’s a lift of three. But if you look at a much smaller, focused group, say the top 0.1% likely to be pregnant, you may have a much higher lift of, say, 46, which would make women in that group 92% likely to be pregnant. In that case, the system would be capable of revealing those women as very likely to be pregnant.

The same concept applies when predicting sexual orientation, race, health status, location, and your intentions to leave your job. Even if a model isn’t highly accurate in general, it can still reveal with high confidence — for a limited group — things like sexual orientation, race, or ethnicity. This is because, typically, there is a small portion of the population for whom it is easier to predict. Now, it may only be able to predict confidently for a relatively small group, but even just the top 0.1% of a population of a million would mean 1,000 individuals have been confidently identified.

It’s easy to think of someone whom you wouldn’t want to know these things. As of 2013, Hewlett-Packard was predictively scoring its more than 300,000 workers with the probability they’d quit – they called it the Flight Risk score. This was delivered to the managers — like, if you’re planning to leave, your boss would probably be the last person you’d want to find out before it’s official.

And face recognition can serve as a way to track location, decreasing the fundamental freedom to move about without disclosure. I certainly don’t sweepingly condemn face recognition, but know that CEO’s at both Microsoft and Google have come down on it.

Also, mortality risk is a risk. A consulting firm I know was modeling employee loss for an HR department and noticed they could model death since that’s one way you lose an employee. The HR folks were like, “Don’t show us!” They didn’t want the liability of knowing.

Models can also discern other protected classes such as race and ethnicity — based on, for example, Facebook likes, research studies have shown.

A concern here is the ways in which marketers may be making use of this. As Harvard professor of government and technology, Latanya Sweeney put it, “Online advertising is about discrimination… The question is, when does that discrimination cross the line from targeting customers to negatively impacting an entire group of people?”

And indeed, a study by that professor showed that Google searches for “black-sounding” names were 25% more likely to show an ad suggesting that the person had an arrest record, even if the advertiser had nobody with that name in their database of arrest records.

“If you make a technology that can classify people by ethnicity, someone will use it to repress that ethnicity,” according to Clare Garvie, an associate at the Center on Privacy and Technology at Georgetown Law.

And this brings us to China, where the government applies facial recognition to identify and track members of the Uighurs, an ethnic group systematically oppressed by the government. This is the first known case of a government using machine learning to profile by ethnicity.

This flagging of individuals by ethnic group is designed specifically to be used as a factor in discriminatory decisions — that is, decisions based at least in part on a protected class. In this case, members of this group, once identified, will be treated or considered differently on that basis.

One Chinese start-up valued at more than $1 billion said its software could recognize “sensitive groups of people.” Its website said, “If originally one Uighur lives in a neighborhood, and within 20 days six Uighurs appear, it immediately sends alarms” to law enforcement.

By implementing the differential treatment of an ethnic group in full scale, data-driven technology takes this to a new level. Jonathan Frankle, a deep learning researcher at MIT, warns that this potential extends beyond China. “I don’t think it’s overblown to treat this as an existential threat to democracy. Once a country adopts a model in this heavy authoritarian mode, it’s using data to enforce thought and rules in a much more deep-seated fashion… To that extent, this is an urgent crisis we are slowly sleepwalking our way into.”

It’s a real challenge to draw the line as to which predictive objectives pursued with machine learning are unethical, let alone which should be legislated against, if any. But, at the very least, it’s important to stay vigilant for when machine learning serves to empower a preexisting unethical practice and also for when it generates data that must be handled with care.

Related:

Author: admin

Leave a Reply

Your email address will not be published.