Download PDF version of this article PDF

Real-world String Comparison

How to handle Unicode sequences correctly

Torsten Ullrich, Fraunhofer Austria and Technische Universität Graz

 

In many languages a string comparison is a pitfall for beginners. With any Unicode string as input, a comparison often causes problems even for advanced users. The semantic equivalence of different characters in Unicode (represented by different byte sequences) requires a normalization of the strings before comparing them. This article shows how to handle Unicode sequences correctly; although all examples are illustrated in Java, the problem and the solution are language-independent.

One of the many built-in data types in Java is the String, which represents text or sequences of characters. The comparison of two strings for equality often raises questions concerning the difference between comparison by value, comparison of object references, strict equality, and loose equality. Unfortunately, the questions are usually answered only superficially, with many aspects not taken into account. The most important aspect, which is almost always forgotten, is semantic equivalence.

 

Object-Oriented Programming

Strings are objects. To test objects (i.e., two reference types) for equality, the Java Language Specification2 states the following:

 

"The result of == is true if the operand values are both null or both refer to the same object or array; otherwise, the result is false. [. . . ] While == may be used to compare references of type String, such an equality test determines whether or not the two operands refer to the same String object. The result is false if the operands are distinct String objects, even if they contain the same sequence of characters [. . . ]. The contents of two strings s and t can be tested for equality by the method invocation s.equals(t)."

 

This behavior is shown in figure 1, an example application illustrating a simple string comparison in Java. Its output result is:

Real-world String Comparison

 

 Simple Test: 
   String #1: 'John' [74, 111, 104, 110]
   String #2: 'John' [74, 111, 104, 110]
   #1 == #2 : false
   #1 equals #2 : true

 

Does this definition resolve everything? Unfortunately, it's only half the truth. The problem is that the same letter can be represented by different sequences of characters.

 

From ASCII to Unicode

Computers deal only with numbers and not letters. Electronic communication therefore requires a standardized mapping between characters and numbers (and vice versa). In the 1960s, ASCII (American Standard Code for Information Interchange) was created by the American Standards Association, now called ANSI (American National Standards Institute). The standard encodes 128 characters into seven-bit integers. The printable characters, based on the English language, include the Latin alphabet in upper- and lowercase, the 10 Arabic digits, and some punctuation marks, word marks, and special characters.

The microprocessors of the 1970s preferred to work with eight bits. Representing a character with eight bits results in 256 possible values. While the first 128 values of the eight-bit encoding correspond to ASCII encoding, the values from 128 to 255 have different interpretations, called code pages.

Unicode was proposed as a new standard in the 1980s. In Unicode the letters are called code points, which are usually represented by four hexadecimal digits. The standard tries to define codes for all meaningful text elements of all known cultures. The ASCII characters have been transferred into the Unicode standard and form the first 128 code points. In version 13.0, published in March 2020, Unicode had 143,859 characters.3

While ASCII contains too few characters, Unicode probably contains too many. Regardless of the number of characters, the lack of uniqueness is a problem. For example, the letter é can be represented in Unicode in many ways. The previous example with the strings

 

  String rene0 = "nu0052nu0065nu006enu00e9";

 

and

 

  String rene1 = "nu0052nu0065nu006enu0065nu0301";

 

demonstrates this problem using the name René. Unfortunately, the results of

 

  compare("Unicode Test", rene0, rene1);

 

using both the equality operator and the equals method are not satisfactory. Neither the operator nor the method is able to identify equivalent strings. Therefore, both approaches return false.

 

 Unicode Test: 
   String #1: 'René' [82, 101, 110, -61, -87]
   String #2: 'René' [82, 101, 110, 101, -52, -127]
   #1 == #2 : false
   #1 equals #2 : false

 

The solution to this problem is a normalization routine—an algorithm that rewrites a string into a canonical, normal, "standard" form. In Java, the class java.text.Normalizer provides text transformations into the standard normalization forms described in Unicode Standard Annex #15.3 As recommended by the W3C (World Wide Web Consortium), the following solution uses canonical decomposition, followed by canonical composition—NFC (Normalization Form Canonical Composition). Figure 2 demonstrates the string normalization in Java. It returns

Real-world String Comparison

 

 Unicode Test: 
   String #1: 'René' [82, 101, 110, -61, -87]
   String #2: 'René' [82, 101, 110, 101, -52, -127]
   #1 == #2 : false
   #1 equals #2 : false
 
Normalized Unicode Test:
   String #1: 'René' [82, 101, 110, -61, -87]
   String #2: 'René' [82, 101, 110, -61, -87]
   #1 == #2 : false
   #1 equals #2 : true

 

The example in Figure 2 is the preferred way to compare strings in a "friendly" environment.

The way to handle strings and string comparisons in an "unfriendly" environment is described in the Unicode Technical Standard #394,5 on Unicode Security Mechanisms. If characters have to be differentiated according to visual appearance (e.g., in order to detect phishing attacks), then the problem becomes even more complicated. (Irongeek.com offers a rather playful, practical take on this topic.)

 

A Language-Independent Problem

Although all of these examples have been illustrated by source code in Java, the Unicode normalization problem is independent of any programming language:

• In Python the module unicodedata offers a function called normalize that transforms strings into normal form—NFC, NFKC (Normalization Form Compatibility Composition), NFD (Normalization Form Canonical Decomposition), and NFKD (Normalization Form Compatibility Decomposition).

• The C# function public string Normalize() in namespace System returns a new string whose textual value is equivalent to the input string, but whose binary representation is in Unicode normalization form NFC.

• In C++ Unicode string normalization is not part of the standard libraries. As a consequence, external libraries have to be used—such as g_utf8_normalize in Glib; QString::normalized in Qt; or NormalizeString in Windows.h.

In JavaScript the ECMAScript 6 standard introduced the normalization function String.prototype.normalize(), which takes care of Unicode normalization.

 

String Manipulation

In addition to the test for equality, all other string functions should, of course, also support Unicode. For example, the length of a word should not depend on its Unicode representation; the reverse of a string should not reverse code units that have to be interpreted in a particular order; and regular expressions that match semantic units (e.g., punctuation marks) have to be aware of Unicode. For example, the list of Unicode space characters comprehends:

 

 SPACE U+0020 
NO BREAK SPACE U+00A0
PUNCTUATION SPACE U+2008
THIN SPACE U+2009
HAIR SPACE U+200A
. . . . . .

 

The Wikipedia article on whitespace characters (https://en.wikipedia.org/wiki/Whitespace_character) lists all semantically equivalent (and more) whitespaces (as of April 2021).

An interesting collection of curiosities (test cases) can be found in the Big List of Naughty Strings, which includes strings that have a high probability of causing issues when used as user-input data.1 A look at this list gives an idea of what is possible with Unicode (see figure 3, which shows how Unicode contains several variants of letters—especially italics, bold, etc.).

Real-world String Comparison

In short, string comparisons are anything but easy.

 

References

1. Big List of Naughty Strings. GitHub; https://github.com/minimaxir/big-list-of-naughty-strings/blob/master/blns.txt.

2. Gosling, J., Joy, B., Steele, G., Bracha, G., Buckley, A., Smith, D., Bierman, G. 2021. The Java Language Specification, Java SE 16 Edition. Oracle America, Inc.; https://docs.oracle.com/javase/specs/jls/se16/jls16.pdf.

3. Unicode Consortium. 2018. The Unicode Standard, Version 11.0 Core Specification; http://www.unicode.org/versions/components-11.0.0.html.

4. Unicode Consortium. 2020. Unicode Standard Annex #15, Unicode Normalization Forms; https://www.unicode.org/reports/tr15/tr15-50.html.

5. Unicode Consortium. 2020 Unicode Technical Standard #39, Unicode Security Mechanisms; https://www.unicode.org/reports/tr39/tr39-22.html.

 

Torsten Ullrich is a researcher in the visual computing department of Fraunhofer Austria Research GmbH and at the Institute of Computer Graphics and Knowledge Visualization of Graz University of Technology, Austria. He studied mathematics at the University Karlsruhe (TH) and received a Ph.D. in reconstructive geometry in computer science from Graz University of Technology in 2011. His main research areas are visual computing in combination with numerical optimization, statistics, and data analysis. He is the deputy head of the business area in visual computing at Fraunhofer Austria Research GmbH, where he is responsible for scientific research coordination.

Copyright © 2021 held by owner/author. Publication rights licensed to ACM.

acmqueue

Originally published in Queue vol. 19, no. 3
see this item in the ACM Digital Library


Tweet


Related:

Ashish Gehani, Raza Ahmad, Hassan Irshad, Jianqiao Zhu, Jignesh Patel - Digging into Big Provenance (with SPADE)
Several interfaces exist for querying provenance. Many are not flexible in allowing users to select a database type of their choice. Some provide query functionality in a data model that is different from the graph-oriented one that is natural for provenance. Others have intuitive constructs for finding results but have limited support for efficiently chaining responses, as needed for faceted search. This article presents a user interface for querying provenance that addresses these concerns and is agnostic to the underlying database being used.


Pat Helland - Data on the Outside vs. Data on the Inside
This article describes the impact of services and trust on the treatment of data. It introduces the notions of inside data as distinct from outside data. After discussing the temporal implications of not sharing transactions across the boundaries of services, the article considers the need for immutability and stability in outside data. This leads to a depiction of outside data as a DAG of data items being independently generated by disparate services.


Kate Matsudaira - The Science of Managing Data Science
What are they doing all day? When I first took over as VP of Engineering at a startup doing data mining and machine learning research, this was what the other executives wanted to know. They knew the team was super smart, and they seemed like they were working really hard, but the executives had lots of questions about the work itself. How did they know that the work they were doing was the "right" work? Were there other projects they could be doing instead? And how could we get this research into the hands of our customers faster?


Lucian Carata, Sherif Akoush, Nikilesh Balakrishnan, Thomas Bytheway, Ripduman Sohan, Margo Seltzer, Andy Hopper - A Primer on Provenance
Assessing the quality or validity of a piece of data is not usually done in isolation. You typically examine the context in which the data appears and try to determine its original sources or review the process through which it was created. This is not so straightforward when dealing with digital data, however: the result of a computation might have been derived from numerous sources and by applying complex successive transformations, possibly over long periods of time.





© 2021 ACM, Inc. All Rights Reserved.