Social media posts often contain abusive content that is overlooked by machine learning algorithms.
No technological innovation since the telephone has influenced human communication like social media. Never before has it been easier to keep in touch with old friends from high school, remember your great uncle’s birthday or waste study time watching short videos.
However, despite their intention to bring together members of distant communities, it is no secret that applications such as Twitter and Facebook have also opened deep chasms in American society in recent years.
At the bottom of these chasms lie thousands of hateful and abusive social media posts that perpetrators often defend by labeling their words as “free speech.”
Lately, businesses have been on a crusade to root out these sources of hate using algorithms designed to detect toxic rhetoric, with mixed results.
According to a collection of research reports on the subject, it appears that current machine learning algorithms are easily tricked into putting the stamp of approval on otherwise intolerable posts.
Researchers from the University of Washington in 2017 found that simple typos could severely compromise artificial intelligence (AI) as powerful as Google’s Perspective, an application programming interface (API) that determines the level of toxicity in a public post.
In this case, although a simple statement such as “I hate you” was labeled toxic, muddier renditions such as “Ihate you” and “I hateyou” were considered acceptable.
Even more confusing to these algorithms is the inclusion of benign words such as “love.” Researchers from the Aalto University Secure Systems Research Group noted that Perspective was easily deceived by the statements “I love to hate you” and “I love hateyou.”
Although the test input that these scholars used is uncommon and unlikely to appear on real-world social media posts, it is still fairly indicative of a greater failure on the part of software engineers to design programs that can effectively vet hateful speech.
These mistakes may be the result of structural deficiencies in the way mainstream algorithms are designed. According to a report by EurekAlert! most widely-used speech detection algorithms interpret a string of text by identifying key words and phrases.
Algorithms are trained using lists or dictionaries of these commonly used hate terms. Once confronted with unfamiliar input, the program does not recognize the words hidden beneath the typos.
To combat these vulnerabilities, programmers from the University of Copenhagen have been testing a unique algorithm that evaluates input based on characters.
These developers identified offensive tweets as those that include racist or sexist slurs, blatantly misrepresent the truth, negatively stereotype minorities, support xenophobia, promote violent crimes or language, or use problematic hashtags such as “#BanIslam.”
Following these criteria, they surveyed thousands of existing tweets to uncover sets of 3-4 character strings, or “n-grams,” that are most commonly present in hate messages, breaking them down into two categories based on whether they are associated with sexist or with racist comments.
The contents of these n-grams are often obvious references to anti-feminist or Islamophobic speech, including common strings such as “wom,” “mus” and “irl.”
This method was vastly superior to list-based word filters largely because its abilities are not confined to a list of words. It disregards whitespace and adapts nicely to sentences that are outside of its immediate training sphere.
Still, it may be impossible for an algorithm of any caliber to match a human’s ability to evaluate these discriminatory messages. Hate can take many forms, and although it may be simple to detect posts that share a common set of words or characters, more elusive comments will likely continue making it under the radar.
Still, as the Twitter and Facebook clamor rages on, innovation awaits, and the first step may be updating text filters from word to character-based systems.