Secure Systems: How to evade hate speech filters with "love"

It has often been suggested that text classification and machine learning techniques can be used to detect hate speech. Such tools could then be used for automatic content filtering, and perhaps even law enforcement. However, we show that existing datasets are too domain-specific, and the resulting models are easy to circumvent by applying simple automatic changes to the text.

The problem of hate speech and its detection

Hate speech is a major problem online, and a number of machine learning solutions have been suggested to combat it.

The vast majority of online material consists of natural language text. Along with its undeniable benefits, this is not without negative side-effects. Hate speech is rampant in discussion forums and blogs, and can have real world consequences. While individual websites can have specific filters to suit their own needs, the general task of hate speech detection involves many unanswered questions. For example, it is problematic where we should draw the line between hateful and merely offensive material.

Despite this, multiple studies have claimed success at detecting hate speech using state-of-the-art natural language processing (NLP) and machine learning (ML) methods. All these studies involve supervised learning, where a ML model is first trained with data labeled by humans, and then tested on new data which it has not seen during training. The more accurately the trained model manages to classify unseen data, the better it is considered.

While a number of labeled hate speech datasets exist, most of them are relatively small, containing just some thousands of hate speech examples. So far, the largest dataset has been drawn from Wikipedia edit comments, and contains around 13 000 hateful sentences. In the world of ML, these are still small numbers. Other datasets are typically taken from Twitter. They may also differ in the type of hate speech they focus on, such as racism, sexism, or personal attacks.

Google has also developed its own “toxic speech” detection system, called Perspective. While the training data and model architecture are unavailable to the public, Google provides black-box access via an online UI.

We wanted to compare existing solutions with one another, and test how resistant they are against possible attacks. Both issues are crucial for determining how well suggested models could actually fare in the real world.

Applying proposed classifiers to multiple datasets

Datasets are too small and specific for models to scale beyond their own training domain.

Most prior studies have not compared their approach with alternatives. In contrast, we wanted to test how all state-of-the-art models perform on all state-of-the-art datasets. We gathered up five datasets and five model architectures, seven combinations of which had been presented in resent academic research.

The model architectures differed in input features and the ML-algorithm. They either looked at characters and character sequences, or alternatively at entire words and/or word sequences. Some models used simple probabilistic ML algorithms (such as logistic regression or a multilayered perceptron network), while others used state-of-the-art deep neural networks (DNNs). More details can be found in our paper.

To begin with, we were interested in understanding how ML-algorithms differ in performance when trained and tested on the same type of data. This would give us some clue as to what kinds of properties might be most relevant for classifying something as hate speech. For example, if character frequencies were sufficient, simple character-based models would suffice. In contrast, if complex word-relations are needed, recurrent neural network DNN-models (like LSTMs or CNNs) would fare better.

We took all four two-class model architectures, and trained them on all four two-class datasets, yielding eight models in total. Next, we took the test sets from each dataset, and applied the models to those. The test sets were always distinct from the training set, but derived from the same dataset. We show the results in the two figures below (datasets used in the original studies are written in bold.)

Performance of ML-algorithms on different datasets (F1-score)

Performance of models with different test sets (F1-score)

Our results were surprising on two fronts. First, all models trained on the same dataset performed similarily. In particular, there was no major difference between using a simple probabilistic classifier (logistic regression: LR) that looked at character sequences, or using a complex deep neural network (DNN) that looked at long word sequences.

Second, no model performed well outside of its training domain: models performed well only on the test set that was taken from the same dataset type that they were trained on.

Our results indicate that training data was more important in determining performance than model architecture. We take the main relevance of this finding to be that focus should be on collecting and labeling better datasets, not only on refining the details of the learning algorithms. Without proper training data, even the most sophisticated algorithms can do very little.

Attacking the classifiers

Hate speech classifiers can be fooled by simple automatic text transformation methods.

Hate speech can be considered an adversarial setting, with the speakers attacking the people their text targets. However, if automatic measures are used for filtering, these classifiers might also be attack targets. In particular, these could be circumvented by text transformation. We wanted to find out whether this is feasible in practice.

Text transformation involves changing words and/or characters in the text with the intent of altering some property while retaining everything else. For evading hate speech detection, this property is the classification the detector makes.

We experimented with three transformation types, and two variants of each. The types were:

word-internal changes:

typos (I htae you)

leetspeak (1 h4te y0u)
word boundary changes:

adding whitespace (I ha te you)

deleting whitespace (Ihateyou)
word appending:

random words (I hate you dog cat...)

"non-hateful" words (I hate you good nice...)

Details of all our experiments are in our paper. Here, we summarize three main results.

Character-based models were significantly more robust against word-internal and word boundary transformations.
Deleting whitespace completely broke all word-based models, regardless of their complexity.
Word appending systematically hindered the performance of all models. Further, while random words did not fare as well as “non-hateful” words from the training set, the difference was relatively minor. This indicates that the word appending attack is feasible even in a black-box setting.

The first two results are caused by a fundamental problem with all word-based NLP methods: the model must recognize the words based on which it does all further computation. If word recognition fails, so does everything else. Deleting all whitespaces makes the entire sentence look like a single unknown word, and the model can do nothing useful with that. Character-models, in contrast, retain the majority of original features, and thus take less damage.

Word appending attacks take advantage of the fact that all suggested approaches treat the detection task as classification. The classifier makes a probabilistic decision of whether the text is more dominantly hateful or non-hateful, and simply including irrelevant non-hateful material will bias this decision to the latter side. This is obviously not what we want from a hate speech classifier, as the status of hate speech should not be affected by such additions. Crucially, this problem will not go away simply by using more complex model architectures; it is built into the very nature of probabilistic classification. Avoiding it requires re-thinking the problem from a novel perspective, perhaps by re-conceptualizing hate speech detection as anomaly detection rather than classification.

The "love" attack

Deleting whitespaces and adding "love" broke all word-based classifiers.

Based on our experimental results, we devised an attack that uses two of our most effective transformation techniques: whitespace deletion and word appending. Here, instead of adding multiple innocuous words, we only add one: “love”. This attack completely broke all word-based models we tested, as well as severely hindered character-based models. However, character-models were much more resistant against it, for reasons we discussed above. (Take a look at our paper for more details on models and datasets.)

The "love" attack applied to seven hate speech classifiers

We also applied the “love” attack to Google Perspective, with comparable results on all example sentences. This indicates that the model is word-based. To improve readability, the attacker can use alternative methods of indicating word boundaries, such as CamelFont.

Example of the "love" attack applied to Google Perspective

Conclusions: can the deficiencies be remedied?

Our study demonstrates that proposed solutions for hate speech detection are unsatisfactory in two ways. Here, we consider some possibilities for alleviating the situation with respect to both.

First, classifiers are too specific for particular text domains and hence do not scale across different datasets. The main reason behind this problem is the lack of sufficient training data. In contrast, model architecture had little to no effect on classification success. Alleviate this problem requires focusing further resources on the hard work of manually collecting and labeling more data.

Second, existing classifiers are vulnerable to text transformation attacks that can easily be applied automatically in a black box setting. Some of these attacks can be partially mitigated by pre-processing measures, such as automatic spell-checking. Others, however, are more difficult to detect. This is particularly true of word appending, as it is very hard to evaluate whether some word is "relevant" or simply added to the text for deceptive purposes. Like we mentioned above, avoiding the word appending attack ultimately requires re-thinking the detection process as something other than probabilistic classification.

Character-models have a far superior resistance to word boundary changes than word-models. This is as expected, since most character sequences are still retained even if word identities are destroyed. As character-models performed equally well to word-models in our comparative tests, we recommend using them to make hate speech detection more resistant to attacks.

Secure Systems

Friday, 4 January 2019

How to evade hate speech filters with "love"

The problem of hate speech and its detection

Applying proposed classifiers to multiple datasets

Attacking the classifiers

The "love" attack

Conclusions: can the deficiencies be remedied?

No comments:

Post a Comment

Unintended Interactions among ML Defenses and Risks

Search This Blog

Friday, 4 January 2019

How to evade hate speech filters with "love"

The problem of hate speech and its detection

p { margin-bottom: 0.25cm; line-height: 120%; }a:link { } Applying proposed classifiers to multiple datasets

p { margin-bottom: 0.25cm; line-height: 120%; }a:link { Attacking the classifiers

The "love" attack

p { margin-bottom: 0.25cm; line-height: 120%; }a:link { }

Conclusions: can the deficiencies be remedied?

No comments:

Post a Comment

Unintended Interactions among ML Defenses and Risks

Search This Blog

Applying proposed classifiers to multiple datasets

Attacking the classifiers