Real-world String Comparison

How to handle Unicode sequences correctly

Torsten Ullrich, Fraunhofer Austria and Technische Universität Graz

In many languages a string comparison is a pitfall for beginners. With any Unicode string as input, a comparison often causes problems even for advanced users. The semantic equivalence of different characters in Unicode (represented by different byte sequences) requires a normalization of the strings before comparing them. This article shows how to handle Unicode sequences correctly; although all examples are illustrated in Java, the problem and the solution are language-independent.

One of the many built-in data types in Java is the String, which represents text or sequences of characters. The comparison of two strings for equality often raises questions concerning the difference between comparison by value, comparison of object references, strict equality, and loose equality. Unfortunately, the questions are usually answered only superficially, with many aspects not taken into account. The most important aspect, which is almost always forgotten, is semantic equivalence.

Object-Oriented Programming

Strings are objects. To test objects (i.e., two reference types) for equality, the Java Language Specification² states the following:

"The result of == is true if the operand values are both null or both refer to the same object or array; otherwise, the result is false. [. . . ] While == may be used to compare references of type String, such an equality test determines whether or not the two operands refer to the same String object. The result is false if the operands are distinct String objects, even if they contain the same sequence of characters [. . . ]. The contents of two strings s and t can be tested for equality by the method invocation s.equals(t)."

This behavior is shown in figure 1, an example application illustrating a simple string comparison in Java. Its output result is:

 Simple Test: 
   String #1: 'John' [74, 111, 104, 110] 
   String #2: 'John' [74, 111, 104, 110] 
   #1 == #2 : false 
   #1 equals #2 : true

Does this definition resolve everything? Unfortunately, it's only half the truth. The problem is that the same letter can be represented by different sequences of characters.

From ASCII to Unicode

Computers deal only with numbers and not letters. Electronic communication therefore requires a standardized mapping between characters and numbers (and vice versa). In the 1960s, ASCII (American Standard Code for Information Interchange) was created by the American Standards Association, now called ANSI (American National Standards Institute). The standard encodes 128 characters into seven-bit integers. The printable characters, based on the English language, include the Latin alphabet in upper- and lowercase, the 10 Arabic digits, and some punctuation marks, word marks, and special characters.

The microprocessors of the 1970s preferred to work with eight bits. Representing a character with eight bits results in 256 possible values. While the first 128 values of the eight-bit encoding correspond to ASCII encoding, the values from 128 to 255 have different interpretations, called code pages.

Unicode was proposed as a new standard in the 1980s. In Unicode the letters are called code points, which are usually represented by four hexadecimal digits. The standard tries to define codes for all meaningful text elements of all known cultures. The ASCII characters have been transferred into the Unicode standard and form the first 128 code points. In version 13.0, published in March 2020, Unicode had 143,859 characters.³

While ASCII contains too few characters, Unicode probably contains too many. Regardless of the number of characters, the lack of uniqueness is a problem. For example, the letter é can be represented in Unicode in many ways. The previous example with the strings

String rene0 = "nu0052nu0065nu006enu00e9";

and

String rene1 = "nu0052nu0065nu006enu0065nu0301";

demonstrates this problem using the name René. Unfortunately, the results of

compare("Unicode Test", rene0, rene1);

using both the equality operator and the equals method are not satisfactory. Neither the operator nor the method is able to identify equivalent strings. Therefore, both approaches return false.

 Unicode Test: 
   String #1: 'René' [82, 101, 110, -61, -87] 
   String #2: 'René' [82, 101, 110, 101, -52, -127] 
   #1 == #2 : false 
   #1 equals #2 : false

The solution to this problem is a normalization routine—an algorithm that rewrites a string into a canonical, normal, "standard" form. In Java, the class java.text.Normalizer provides text transformations into the standard normalization forms described in Unicode Standard Annex #15.³ As recommended by the W3C (World Wide Web Consortium), the following solution uses canonical decomposition, followed by canonical composition—NFC (Normalization Form Canonical Composition). Figure 2 demonstrates the string normalization in Java. It returns

 Unicode Test: 
   String #1: 'René' [82, 101, 110, -61, -87] 
   String #2: 'René' [82, 101, 110, 101, -52, -127] 
   #1 == #2 : false 
   #1 equals #2 : false 
  
Normalized Unicode Test: 
   String #1: 'René' [82, 101, 110, -61, -87] 
   String #2: 'René' [82, 101, 110, -61, -87] 
   #1 == #2 : false 
   #1 equals #2 : true

The example in Figure 2 is the preferred way to compare strings in a "friendly" environment.

The way to handle strings and string comparisons in an "unfriendly" environment is described in the Unicode Technical Standard #39^4,5 on Unicode Security Mechanisms. If characters have to be differentiated according to visual appearance (e.g., in order to detect phishing attacks), then the problem becomes even more complicated. (Irongeek.com offers a rather playful, practical take on this topic.)

A Language-Independent Problem

Although all of these examples have been illustrated by source code in Java, the Unicode normalization problem is independent of any programming language:

• In Python the module unicodedata offers a function called normalize that transforms strings into normal form—NFC, NFKC (Normalization Form Compatibility Composition), NFD (Normalization Form Canonical Decomposition), and NFKD (Normalization Form Compatibility Decomposition).

• The C# function public string Normalize() in namespace System returns a new string whose textual value is equivalent to the input string, but whose binary representation is in Unicode normalization form NFC.

• In C++ Unicode string normalization is not part of the standard libraries. As a consequence, external libraries have to be used—such as g_utf8_normalize in Glib; QString::normalized in Qt; or NormalizeString in Windows.h.

In JavaScript the ECMAScript 6 standard introduced the normalization function String.prototype.normalize(), which takes care of Unicode normalization.

String Manipulation

In addition to the test for equality, all other string functions should, of course, also support Unicode. For example, the length of a word should not depend on its Unicode representation; the reverse of a string should not reverse code units that have to be interpreted in a particular order; and regular expressions that match semantic units (e.g., punctuation marks) have to be aware of Unicode. For example, the list of Unicode space characters comprehends:

 SPACE U+0020 
NO BREAK SPACE U+00A0 
PUNCTUATION SPACE U+2008 
THIN SPACE U+2009 
HAIR SPACE U+200A 
. . . . . .

The Wikipedia article on whitespace characters (https://en.wikipedia.org/wiki/Whitespace_character) lists all semantically equivalent (and more) whitespaces (as of April 2021).

An interesting collection of curiosities (test cases) can be found in the Big List of Naughty Strings, which includes strings that have a high probability of causing issues when used as user-input data.¹ A look at this list gives an idea of what is possible with Unicode (see figure 3, which shows how Unicode contains several variants of letters—especially italics, bold, etc.).

In short, string comparisons are anything but easy.

References

1. Big List of Naughty Strings. GitHub; https://github.com/minimaxir/big-list-of-naughty-strings/blob/master/blns.txt.

2. Gosling, J., Joy, B., Steele, G., Bracha, G., Buckley, A., Smith, D., Bierman, G. 2021. The Java Language Specification, Java SE 16 Edition. Oracle America, Inc.; https://docs.oracle.com/javase/specs/jls/se16/jls16.pdf.

3. Unicode Consortium. 2018. The Unicode Standard, Version 11.0 Core Specification; http://www.unicode.org/versions/components-11.0.0.html.

4. Unicode Consortium. 2020. Unicode Standard Annex #15, Unicode Normalization Forms; https://www.unicode.org/reports/tr15/tr15-50.html.

5. Unicode Consortium. 2020 Unicode Technical Standard #39, Unicode Security Mechanisms; https://www.unicode.org/reports/tr39/tr39-22.html.

Torsten Ullrich is a researcher in the visual computing department of Fraunhofer Austria Research GmbH and at the Institute of Computer Graphics and Knowledge Visualization of Graz University of Technology, Austria. He studied mathematics at the University Karlsruhe (TH) and received a Ph.D. in reconstructive geometry in computer science from Graz University of Technology in 2011. His main research areas are visual computing in combination with numerical optimization, statistics, and data analysis. He is the deputy head of the business area in visual computing at Fraunhofer Austria Research GmbH, where he is responsible for scientific research coordination.

Originally published in Queue vol. 19, no. 3—
Comment on this article in the ACM Digital Library