Most Text-Based ML Systems Are Broken.
Unlike human writing, modern computers can encode any given piece of text with a near-infinite number of unique logical encodings. This is derived from the fact that common language encodings, such as Unicode, provide methods to create differences between the logical encoding of a string and its visual rendering.
Since text-based ML models, and most NLP systems more broadly, operate upon the logical encoding of text as inputs, the difference between logical encoding and visual rendering can be used to deceive users and adversarially control the output of these systems.
These Differences Can Manipulate NLP Systems.
Machine translation systems, search engines, spam filters, text classification, and nearly any other system which processes raw text-based user input can be manipulated using these tactics, which we collectively label imperceptible perturbations.
Exploitation of differences between logical and visual representations of text can take different forms in different settings.
Machine learning-based NLP systems are vulnerable to adversarial perturbations that are imperceptible to human users. This means that a motivated adversary can use imperceptible perturbations to control a system's output for a fixed visual input. Similarly, such perturbations can be used to poison training data.
These same methods can be used in an entirely different setting not just to degrade the performance of search engines, but also to allow content publishers to functionally hide content from search engine indexing systems.
There Are Four Flavors of Attack.
Four different classifications of imperceptible perturbations exist:
Invisible Characters are a subset of characters that are not intended to render to a visible glyph, such as zero width spaces. These cross-platform characters can be injected into strings with no limit.
Homoglyphs are distinct characters that render to the same or nearly the same glyph, such a the Latin
a and the Cyrillic
а. If any homoglyphs exist for a certain character, they can be swapped freely in most fonts.
Reorderings are methods by which special control characters can be used to change the rendering order of encoded characters. Although rendering order implementations vary by platform, well-crafted reorderings will render as desired on most modern platforms and can be injected an arbitrary number of times.
Deletions are methods by which control characters designed to remove text, such as
backspace, are used to hide characters within strings. Deletions are platform depdendent and will only render as desired in some settings, such as strings passed through Python's
It's possible to defend against imperceptible perturbation attacks.
These defenses take different forms in different settings and can be quite nuanced. The proper defense for one setting, such as English language NLP systems, may not be appropriate for other settings, such as multilingual search engines.
The key defense takeaway for imperceptible perturbations is that user inputs must be sanitized before ingress into an NLP pipeline. Without this, users may be vulnerable to adversarially manipulated results. Much like the consequences of SQL injection, imperceptible perturbations require conscious design decisions for all systems using affected technologies.
Try It Out.
Use the Perturbation Generator tool to generate your own imperceptible perturbations in the browser. If you'd like to check whether a specific string contains imperceptible perturbations, just paste it into the Attack Detector tool.
There's More to Know.
Read our paper to learn the details of crafting and defending against imperceptible perturbations.
If you use our paper or anything on this site in your own work, please cite the following: