It has often been suggested that text classification and machine learning techniques can be used to detect hate speech. Such tools could then be used for automatic content filtering, and perhaps even law enforcement. However, we show that existing datasets are too domain-specific, and the resulting models are easy to circumvent by applying simple automatic changes to the text.
The problem of hate speech and its detection
Hate speech is a major problem online, and a number of machine learning solutions have been suggested to combat it.
The
vast majority of online material consists of natural language text. Along with its undeniable benefits, this is not without negative side-effects.
Hate speech is rampant in discussion forums and blogs, and can have
real world consequences. While individual websites can have specific filters to suit their own needs, the general task of hate speech detection involves
many unanswered questions. For example, it is
problematic where we
should draw the line between
hateful and merely offensive material.
Despite
this,
multiple studies have claimed success at detecting hate speech
using state-of-the-art
natural language processing (NLP) and
machine learning (ML) methods. All these studies involve
supervised learning, where a ML model is first trained with
data labeled by humans, and then tested on new data which it has not
seen during training. The more accurately the trained model manages
to classify unseen data, the better it is considered.
While
a number of labeled
hate speech datasets exist, most of them are
relatively small, containing just some thousands of hate speech
examples. So far, the largest dataset has been drawn from
Wikipedia edit comments, and contains around 13 000 hateful sentences. In the
world of ML, these are still small numbers. Other datasets are typically taken from Twitter. They may also differ
in the type of hate speech they focus on, such as racism, sexism, or
personal attacks.
Google
has also developed its own “toxic speech” detection system,
called Perspective. While the training data and model
architecture are unavailable to the public, Google provides black-box access via an
online UI.
We wanted to compare existing solutions with one another, and
test how resistant they are against possible attacks. Both issues are
crucial for determining how well suggested models could actually
fare in the real world.
Applying proposed classifiers to multiple datasets
Datasets are too small and specific for models to scale beyond their own training domain.
Most prior studies have not compared their approach with
alternatives. In contrast, we wanted to test how all
state-of-the-art models perform on all state-of-the-art
datasets. We gathered up five datasets and five model architectures,
seven combinations of which had been presented in resent academic research.
The
model architectures differed in
input features
and the
ML-algorithm. They either looked at characters
and character sequences, or alternatively at entire words and/or word
sequences. Some models used simple probabilistic ML
algorithms (such as logistic regression or a multilayered perceptron
network), while others used state-of-the-art deep neural networks
(DNNs). More details can be found in our
paper.
To
begin with, we were interested in understanding how ML-algorithms
differ in performance when trained and tested on the same type of data. This would give us some clue as to what kinds of properties
might be most relevant for classifying something as hate speech. For
example, if character frequencies were sufficient, simple
character-based models would suffice. In contrast, if complex
word-relations are needed, recurrent neural network DNN-models (like
LSTMs or CNNs) would fare better.
We
took all four two-class model architectures, and trained them on all
four two-class datasets, yielding eight models in total. Next, we
took the test sets from each dataset, and applied the models to
those. The test sets were always distinct from the training set, but
derived from the same dataset. We show the results in the two figures below (datasets used in the original studies are written in bold.)
|
Performance of ML-algorithms on different datasets (F1-score) |
|
Performance of models with different test sets (F1-score) |
Our
results were surprising on two fronts. First,
all models trained
on the same dataset performed similarily. In
particular, there was no major difference between using a simple
probabilistic classifier (logistic regression: LR) that looked at
character sequences, or using a complex deep neural network (DNN)
that looked at long word sequences.
Second,
no model
performed well outside of its training domain: models
performed well only on the test set that was taken from the same
dataset type that they were trained on.
Our
results indicate that training
data was more important in determining performance than model
architecture. We take the main relevance of this finding to be that focus should be on collecting and labeling
better datasets, not only on refining the details of the learning
algorithms. Without proper training data, even the most sophisticated
algorithms can do very little.
Attacking the classifiers
Hate speech classifiers can be fooled by simple automatic text transformation methods.
Hate speech can be considered an adversarial setting, with the speakers attacking the people their text targets.
However, if automatic measures are used for filtering, these classifiers might also be attack targets. In particular, these could be circumvented by text
transformation. We wanted to find out whether this is feasible in practice.
Text
transformation involves changing words and/or characters in the text
with the intent of altering some property while retaining everything else. For evading hate speech
detection, this property is the classification the detector makes.
We
experimented with three transformation types, and two variants of
each. The types were:
-
word-internal changes:
typos (I htae
you)
leetspeak (1 h4te y0u)
-
word boundary changes:
adding whitespace (I ha te you)
deleting whitespace (Ihateyou)
-
word appending:
random words (I hate you dog cat...)
"non-hateful" words (I hate you good nice...)
Details
of all our experiments are in our paper.
Here, we summarize three
main results.
- Character-based models were
significantly more robust
against word-internal and word boundary transformations.
- Deleting whitespace completely broke all word-based models,
regardless of their
complexity.
- Word
appending systematically hindered the performance of
all models. Further,
while random words did not fare as well as “non-hateful” words
from the training set, the difference was relatively minor. This
indicates that the word appending attack is feasible even in a
black-box setting.
The
first two results are caused by a fundamental problem with all
word-based NLP methods: the model
must recognize the words
based on which it does all further computation. If word recognition fails, so does everything else. Deleting all
whitespaces makes the entire sentence look like a single
unknown word, and the model can
do nothing useful with that. Character-models,
in contrast, retain the majority of original features, and thus take
less damage.
Word
appending attacks take
advantage of the fact that
all suggested approaches treat the detection task as classification.
The classifier makes a
probabilistic decision of whether the text is more dominantly
hateful or non-hateful, and simply including
irrelevant non-hateful material will bias this decision to the latter
side. This is obviously not
what we want from a hate speech classifier, as the status of hate
speech should not be affected by such
additions. Crucially, this
problem will not go away simply by using more complex model
architectures; it is built into the very nature of probabilistic
classification. Avoiding
it requires re-thinking the problem from a novel perspective, perhaps
by re-conceptualizing hate speech detection as anomaly
detection rather than
classification.
The "love" attack
Deleting whitespaces and adding "love" broke all word-based classifiers.
Based
on our experimental results, we devised an attack that uses two
of our most effective transformation techniques: whitespace deletion
and word appending. Here,
instead of adding multiple innocuous words, we only add one: “love”.
This attack completely broke
all word-based models we tested, as well as severely hindered
character-based models. However,
character-models were much more resistant against
it, for reasons we
discussed above. (Take a look at our paper for more details on models and datasets.)
|
The "love" attack
applied to seven hate speech classifiers
|
We
also applied
the “love” attack to Google Perspective, with
comparable results on all
example sentences. This
indicates that the model is word-based. To
improve readability, the attacker can use alternative methods of
indicating word boundaries, such as
CamelFont.
|
Example of the "love" attack
applied to Google Perspective
|
Conclusions: can the deficiencies be remedied?
Our
study demonstrates that
proposed solutions
for hate
speech detection are unsatisfactory in
two ways. Here, we consider some possibilities for alleviating the situation with respect to both.
First, classifiers are too specific for particular text
domains and hence do not scale
across different datasets. The
main reason behind this problem is the lack of sufficient training data.
In contrast, model
architecture had little to no effect on classification success. Alleviate this problem requires focusing further resources on the hard work of manually collecting and labeling more data.
Second, existing classifiers are
vulnerable to text transformation attacks
that can easily be applied automatically in
a black box setting. Some of these attacks can be partially mitigated by pre-processing measures, such as automatic spell-checking. Others, however, are more difficult to detect. This is particularly true of word appending, as it is very hard to evaluate whether some word is "relevant" or simply added to the text for deceptive purposes. Like we mentioned above, avoiding the word appending attack ultimately requires re-thinking the detection process as something other than probabilistic classification.
Character-models have a far superior resistance to word boundary changes than word-models. This is as expected, since most character sequences are still retained even if word identities are destroyed. As character-models performed equally well to word-models in our comparative tests, we recommend using them to make hate speech detection more resistant to attacks.