Alphabet’s Hate-Fighting AI Doesn’t Understand Hate Yet
Hate is tricky to teach to software.
Yesterday, Google and its sister Alphabet company Jigsaw announced Perspective, a tool that uses machine learning to police the internet against hate speech. The company heralded the tech as a nascent but powerful weapon in combatting online vitriol, and opened the software so websites could use it on their own commenting systems.
However, computer scientists and others on the internet have found the system unable to identify a wide swath of hateful comments, while categorizing innocuous word combinations like “hate is bad” and “garbage truck” as overwhelmingly toxic. The Jigsaw team sees this problem, but stresses that the software is still in an “alpha stage,” referring to experimental software that isn’t yet ready for mass deployment.
In tandem with the announcement that its project would be open to developers through an application programming interface, Jigsaw posted a simple text box that would call the API and return what the system thought of words and phrases. Sentences and phrases are given a toxicity ranking based on what respondents to Survata surveys deemed similar examples as “a rude, disrespectful, or unreasonable comment that is likely to make you leave a discussion.”
David Auerbach, a writer for MIT Tech Review and former Google engineer, ran a list of hateful and non-hateful phrases through the system:
“I fucking love you man. Happy birthday.” = 93% toxic
“Donald Trump is a meretricious buffoon.” = 85% toxic.
“few muslims are a terrorist threat” = 79% toxic
“garbage truck” = 78% toxic
“You’re no racist” = 77% toxic
“whites and blacks are not inferior to one another” = 73% toxic
“I’d hate to be black in Donald Trump’s America.” = 73% toxic
“Jews are human” = 72% toxic
“I think you’re being racist” = 70% toxic
“Hitler was an anti-semite” = 70% toxic
“this comment is highly toxic” = 68% toxic
“You are not being racist” = 65% toxic
“Jews are not human” = 61% toxic
“Hitler was not an anti-semite” = 53% toxic
“drop dead” = 40% toxic
“gas the joos race war now” = 40% toxic
“genderqueer” = 34% toxic
“race war now” = 24% toxic
“some races are inferior to others” = 18% toxic
“You are part of the problem” 16% toxic
Like all machine-learning algorithms, the more data the Perspective API has, the better it will work. The Alphabet subsidiary worked with partners like Wikipedia and The New York Times to gather hundreds of thousands of comments, and then crowdsourced 10 answers for each comment on whether they were toxic or not. The effort was intended to kickstart the deep neural network that makes the backbone of the Perspective API.
“It’s very limited to the types of abuse and toxicity in that initial training data set. But that’s just the beginning,” CJ Adams, Jigsaw product manager, told Quartz. “The hope is over time, as this is used, we’ll continue to see more and more examples of abuse, and those will be voted on by different people and improve its ability to detect more types of abuse.”
Previous research published by Jigsaw and Wikimedia details an earlier attempt at finding toxicity in comments. Jigsaw crowdsourced the rating of Wikipedia comments, asking Crowdflower users to gauge whether a comment was an attack or harassment of a person, third party, or whether the commenter was quoting someone else. They then captured 1-5 character snippets, called character-level ngrams, of the attacking comments and trained a machine-learning algorithm that those ngrams were correlated with toxic activity.
Yoav Goldberg, a senior lecturer at Bar Ilan University and former post-doc research scientist at Google not associated with the research, says previous system lacked the ability to represent subtle differences in the text.
“This is enough to capture information about single words, while allowing also to capture word variations, typos, inflections and so on,” Goldberg told Quartz. “This is essentially finding ‘good words’ and ‘bad words,’ but it is clear that it cannot deal with any nuanced (or even just compositional) word usage.”
For example, “racism is bad” triggers the old system into giving an overwhelmingly negative score because the words “racism” and “bad” are seen as negative, Goldberg says.
The Perspective API is not necessarily a huge improvement on previous efforts quite yet, and is a step back in some ways. Demonstrated to Wired’s Andy Greenberg in September of 2016, the phrase “You’re such a bitch” rates as 96 percent toxic. In the new system’s public API, it’s 97 perecent. Good!
But when testing his example of a more colloquial (yet still aggravatingly misogynistic) phrase “What’s up bitches? :)” Greenberg’s test of the old system ranks 39 percent toxicity, while the new public version released yesterday ranks the phrase as 95 percent toxic.
Lucas Dixon, chief research scientist at Jigsaw, says there’s two reasons for this. First, the system shown to Greenberg was a research model specifically trained to detect personal attacks, meaning it would be much more sensitive to words like “you” or “you’re.” Second, and potentially more importantly, the system was using the ngram technique detailed before.
“Character-level models are much better able to understand misspellings and different fragments of words, but overall it’s going to do much worse,” Dixon told Quartz.
That’s because, while that technique can be efficiently pointed at a very specific problem, like figuring out that smiley faces correlate with someone being nice, the deep neural network being trained through the API now has a much greater capacity to understand the nuances of the entire language.
By using Jigsaw’s “Writing Experiment,” it’s easy to see that certain words are now correlated with negative comments while others are not. The single word “suck” has 93 percent toxicity. On its own, “suck” doesn’t mean anything negative, but the system still associates it with every negative comment it’s seen containing the word. “Nothing sucks” has a toxicity of 94 percent. So does “dave sucks.”
Other examples on the internet:
“Hate is stupid” garners 97 percent toxicity.
“Black Trans Woman Eats Can of Pears, Really Enjoys It” scores 61 percent toxicity.
And “racism is bad” scores as 60 percent toxic, while “racism is good” only gets 35 percent.
To fix this, Jigsaw’s API has two features: One to provide the service of rating comments’ toxicity, and another for providing feedback to say that a rating was wrong. With this data, over time, the Perspective API should be able to understand more and more forms of hate speech and toxicity.