Monday 18 December 2017

Computer Science Internationalization - Grapheme Clusters

Unicode has sequences of codepoints which reduce to single human perceived characters. In Unicode parlance these human perceived characters are referred to as Grapheme Clusters.
  • s̄ sequence is U+0073 LATIN SMALL LETTER S U+0304 COMBINING MACRON
  • 🇰🇷 sequence is U+1F1F0 REGIONAL INDICATOR SYMBOL LETTER K U+1F1F7 REGIONAL INDICATOR SYMBOL LETTER R
  • 👩🏿‍🔬 sequence is U+1F469 WOMAN U+1F3FF EMOJI MODIFIER FITZPATRICK TYPE-6 U+200D ZERO WIDTH JOINER U+1F52C MICROSCOPE
  • 👨‍👩‍👧‍👧 sequence is U+1F468 MAN ‎U+200D ZERO WIDTH JOINER U+1F469 WOMAN U+200D ZERO WIDTH JOINER U+1F467 GIRL U+200D ZERO WIDTH JOINER U+1F467 GIRL
The Regular Expression (regex) construct \X will match with grapheme clusters, or rather it should. I tried several regex implementations in programming languages and from the unix command line. None that I tried would match with sequences containing U+200D ZERO WIDTH JOINER. They did though work fine with sequences not containing U+200D. Then I discovered PCRE2 (Perl Compatible Regular Expressions). It worked just fine with all the grapheme clusters, including those having U+200D ZERO WIDTH JOINER in the sequence. The one thing one does need to remember is to use the -u option which specifies utf-8 encoding. For my own files I cannot recollect ever using an encoding other than utf-8.

My test regex command was pcre2grep -u '^\X{1}$' which will match with a single grapheme cluster.

If you would like to use pcre2grep, one has to do a little work as it is not a standard OSX App. Firstly, one needs to have a package manager. I chose to install the homebrew package manager brew.sh. With homebrew, installation of pcre2 is achieved using the command brew install pcre2 which, when I did it, installed pcre2 version 10.30.

Running the command pcre2test -C reveals that the version of Unicode supported by PCRE2 is version 10. It is therefore, impressively up to date as Unicode 10 was released June 20, 2017 and PCRE2 10.30 was released 14 August 2017.

Thursday 14 December 2017

Computer Science Internationalization - Googleʼs Japanese TLDs

Google currently has two Japanese TLDs (Top Level Domains): グーグル meaning Google and みんな meaning everyone. グーグル is currently dormant but みんな now has quite a number of registered domains. I used the site command to search for みんな domains google.co.uk/search?q=site:.みんな. Within the first five pages I found the following Japanese domains. The domain names are well integrated into these websites. My standard practice is to only list on my blog those sites that have well integrated domain names.
  1. 車査定.みんな
  2. 思春期ニキビ洗顔料.みんな
  3. 母の日.みんな
  4. バイク売る.みんな
  5. 子供ニキビ.みんな
  6. 通信教育.みんな
  7. 車買取.みんな
  8. 二子玉川美容室.みんな
  9. 柏美容室.みんな
  10. ミュゼ.みんな
  11. 資格.みんな
  12. かくめい.みんな
  13. 渡辺工業.みんな
  14. 洋服お直し.みんな
  15. メンズヘアー.みんな
  16. 営業.みんな
  17. ワーホリデビュー.みんな
  18. おきなわ.みんな
  19. 育毛剤.みんな
  20. ドラマ動画ネタバレあらすじ感想.みんな
  21. 債務整理で借金返済.みんな
  22. あったかマネー.みんな