Late-Unicode normalisation as a DoS primitive
Filenames, email addresses, identifiers — anywhere user input gets normalised after a length check, the multiplier between bytes-in and bytes-out becomes a denial-of-service primitive. Three case studies.
There’s a class of denial-of-service bug that almost nobody thinks about until it lands on their disclosure desk: Unicode normalisation that happens after the validation check. The Unicode standard guarantees that NFKC normalisation can expand a single codepoint into multiple — sometimes by an order of magnitude. If your len(input) < MAX_BYTES check runs before normalisation, an attacker only needs to send MAX_BYTES / 18 bytes to produce a MAX_BYTES-sized output.
Multiply that by an unbounded operation downstream — a regex, a database insert, a filename write — and you have a DoS.
The expansion table
Some of the worst offenders, by ratio of input bytes to output bytes after NFKC:
| Codepoint | After NFKC | Ratio |
|---|---|---|
U+FDFA | 18 chars | 18× |
U+FDFB | 8 chars | 8× |
U+FB2C | 3 chars | 3× |
U+FDFA is the classic — the Arabic ligature for “ṣallā Allāhu ʿalayhi wa-sallam.” A single codepoint, eighteen ASCII characters after normalisation. Multiply 1 MB of these and you’re feeding 18 MB into whatever happens next.
Where this has actually landed
Three CVEs I’ve shipped against real software, all the same root cause:
- CVE-2024-0081 (NVIDIA NeMo) — user-controlled Unicode filenames cause server-side DoS during preprocessing.
- CVE-2024-32874 (Frigate) — Multiple application-level DoS via long Unicode filenames.
- CVE-2024-45412 (Yeti Platform) — One-million-Unicode-character attack on input handling.
In each case, the application accepted a filename through a “max 256 chars” validator that ran on the raw input, then handed the result to a code path that normalised, indexed, and re-encoded it. The DoS appeared at the indexing step.
The fix is one line
Normalise first, then validate. That’s it. If you can’t move the normalisation step earlier, validate against the normalised length:
import unicodedata
MAX = 256
def safe_filename(raw: str) -> str:
normalised = unicodedata.normalize('NFKC', raw)
if len(normalised) > MAX:
raise ValueError("too long")
return normalised
This shouldn’t be controversial. It is, somehow, controversial.
Why this keeps happening
I think there are two reasons. First, the normalisation step is often implicit — a filesystem call, a database driver, a regex engine — and not visible in the code path the developer is reviewing. Second, the failure mode is asymmetric: the input looks tiny, the output is huge, and the developer’s mental model of “I’m validating a 256-byte string” stays intact even after the bug ships.
If the operation that follows your length check can re-allocate memory based on input content, your length check is wrong.
Self-test
Take any field where a user can supply free-form text. Send "\uFDFA" * 60_000. If the server takes longer than a second to respond, or returns an error that smells like memory pressure, you have it.
If your application takes user-supplied filenames or identifiers and you want me to look at it, email’s open.