Monday, 18 December 2017

Computer Science Internationalization - Grapheme Clusters

Unicode has sequences of codepoints which reduce to single human perceived characters. In Unicode parlance these human perceived characters are referred to as Grapheme Clusters.
  • s̄ sequence is U+0073 LATIN SMALL LETTER S U+0304 COMBINING MACRON
  • 🇰🇷 sequence is U+1F1F0 REGIONAL INDICATOR SYMBOL LETTER K U+1F1F7 REGIONAL INDICATOR SYMBOL LETTER R
  • 👩🏿‍🔬 sequence is U+1F469 WOMAN U+1F3FF EMOJI MODIFIER FITZPATRICK TYPE-6 U+200D ZERO WIDTH JOINER U+1F52C MICROSCOPE
  • 👨‍👩‍👧‍👧 sequence is U+1F468 MAN ‎U+200D ZERO WIDTH JOINER U+1F469 WOMAN U+200D ZERO WIDTH JOINER U+1F467 GIRL U+200D ZERO WIDTH JOINER U+1F467 GIRL
The Regular Expression (regex) construct \X will match with grapheme clusters, or rather it should. I tried several regex implementations in programming languages and from the unix command line. None that I tried would match with sequences containing U+200D ZERO WIDTH JOINER. They did though work fine with sequences not containing U+200D. Then I discovered PCRE2 (Perl Compatible Regular Expressions). It worked just fine with all the grapheme clusters, including those having U+200D ZERO WIDTH JOINER in the sequence. The one thing one does need to remember is to use the -u option which specifies utf-8 encoding. For my own files I cannot recollect ever using an encoding other than utf-8.

My test regex command was pcre2grep -u '^\X{1}$' which will match with a single grapheme cluster.

If you would like to use pcre2grep, one has to do a little work as it is not a standard OSX App. Firstly, one needs to have a package manager. I chose to install the homebrew package manager brew.sh. With homebrew, installation of pcre2 is achieved using the command brew install pcre2 which, when I did it, installed pcre2 version 10.30.

Running the command pcre2test -C reveals that the version of Unicode supported by PCRE2 is version 10. It is therefore, impressively up to date as Unicode 10 was released June 20, 2017 and PCRE2 10.30 was released 14 August 2017.