Monday 18 December 2017

Computer Science Internationalization - Grapheme Clusters

Unicode has sequences of codepoints which reduce to single human perceived characters. In Unicode parlance these human perceived characters are referred to as Grapheme Clusters.
  • s̄ sequence is U+0073 LATIN SMALL LETTER S U+0304 COMBINING MACRON
  • 🇰🇷 sequence is U+1F1F0 REGIONAL INDICATOR SYMBOL LETTER K U+1F1F7 REGIONAL INDICATOR SYMBOL LETTER R
  • 👩🏿‍🔬 sequence is U+1F469 WOMAN U+1F3FF EMOJI MODIFIER FITZPATRICK TYPE-6 U+200D ZERO WIDTH JOINER U+1F52C MICROSCOPE
  • 👨‍👩‍👧‍👧 sequence is U+1F468 MAN ‎U+200D ZERO WIDTH JOINER U+1F469 WOMAN U+200D ZERO WIDTH JOINER U+1F467 GIRL U+200D ZERO WIDTH JOINER U+1F467 GIRL
The Regular Expression (regex) construct \X will match with grapheme clusters, or rather it should. I tried several regex implementations in programming languages and from the unix command line. None that I tried would match with sequences containing U+200D ZERO WIDTH JOINER. They did though work fine with sequences not containing U+200D. Then I discovered PCRE2 (Perl Compatible Regular Expressions). It worked just fine with all the grapheme clusters, including those having U+200D ZERO WIDTH JOINER in the sequence. The one thing one does need to remember is to use the -u option which specifies utf-8 encoding. For my own files I cannot recollect ever using an encoding other than utf-8.

My test regex command was pcre2grep -u '^\X{1}$' which will match with a single grapheme cluster.

If you would like to use pcre2grep, one has to do a little work as it is not a standard OSX App. Firstly, one needs to have a package manager. I chose to install the homebrew package manager brew.sh. With homebrew, installation of pcre2 is achieved using the command brew install pcre2 which, when I did it, installed pcre2 version 10.30.

Running the command pcre2test -C reveals that the version of Unicode supported by PCRE2 is version 10. It is therefore, impressively up to date as Unicode 10 was released June 20, 2017 and PCRE2 10.30 was released 14 August 2017.

Thursday 14 December 2017

Computer Science Internationalization - Googleʼs Japanese TLDs

Google currently has two Japanese TLDs (Top Level Domains): グーグル meaning Google and みんな meaning everyone. グーグル is currently dormant but みんな now has quite a number of registered domains. I used the site command to search for みんな domains google.co.uk/search?q=site:.みんな. Within the first five pages I found the following Japanese domains. The domain names are well integrated into these websites. My standard practice is to only list on my blog those sites that have well integrated domain names.
  1. 車査定.みんな
  2. 思春期ニキビ洗顔料.みんな
  3. 母の日.みんな
  4. バイク売る.みんな
  5. 子供ニキビ.みんな
  6. 通信教育.みんな
  7. 車買取.みんな
  8. 二子玉川美容室.みんな
  9. 柏美容室.みんな
  10. ミュゼ.みんな
  11. 資格.みんな
  12. かくめい.みんな
  13. 渡辺工業.みんな
  14. 洋服お直し.みんな
  15. メンズヘアー.みんな
  16. 営業.みんな
  17. ワーホリデビュー.みんな
  18. おきなわ.みんな
  19. 育毛剤.みんな
  20. ドラマ動画ネタバレあらすじ感想.みんな
  21. 債務整理で借金返済.みんな
  22. あったかマネー.みんな

Thursday 30 November 2017

Computer Science Internationalization - Korean IDNs

Below are a set of Korean IDNs (Internationalised Domain Names). I have listed them because they are good examples of well integrated IDNs. They do not redirect to an ASCII Domain Name. They do not require the prefix www nor redirects to a www prefixed Domain Name. Forcing the www prefix means having mixed human language script Domain Names which I consider to be a bad thing. Single script Domain Names are a good thing. The IDN shows in the address bar and is correctly integrated into the navigation which means the site is using relative addressing which is also a good thing😀  The TLD (Top Level Domain) 한국 means Korea.
  1. 한국인터넷정보센터.한국
  2. 로봇사이언스.한국
  3. 용늪펜션.한국
  4. 흙사랑부동산.한국
  5. 국립어린이청소년도서관.한국
  6. 학원관리no1.한국
  7. 판촉통.한국
  8. 알코올전문병원협의회.한국
  9. 장사도해운.한국
  10. 신흥프라스틱.한국 & 플라스틱용기.한국
  11. 삼청동맛집.한국
  12. 고려대학교.한국
판촉통.한국 (7th in the above list) is a Korean site selling promotional merchandise. Each product has photos and a twitter sharing button which when clicked results in the site composing a tweet which has a link to the product. Here is one such composed tweet:

오죽막선사군자 (1009044) 152,500원 http://www.87tong.com/product/product_view.php?code=1009044 via @판촉통 1544-8759

The important and relevant issue here is that the Domain Name in the composed tweet is ASCII instead of Korean. Fortunately, this problem can be easily fixed. Before tweeting I replace the ASCII Domain Name www.87tong.com with the Korean Domain Name 판촉통.한국. The edited tweet is now:

오죽막선사군자 (1009044) 152,500원 http://판촉통.한국/product/product_view.php?code=1009044 via @판촉통 1544-8759

...and here is a link to the tweeted tweet twitter.com/andreschappo/status/936633558923440128

판촉통.한국 has 7 pages of fans 부채 and some of them are very beautiful. I recommend you browse through their fans section, which you will find at 판촉통.한국/product/cate_mid.php?cate=11300

고려대학교.한국 is the Korean IDN for Korea University. Goryeo 고려 (the first 2 characters of this Korean IDN) is the name of an old Korean kingdom en.wikipedia.org/wiki/Goryeo

Sunday 26 November 2017

Computer Science Internationalization - identifiers

When programming I sometimes use Unicode identifiers instead of ASCII identifiers. The basic principle of Unicode identifiers in a programming context is that the characters used in a Unicode identifier should belong to some human language script. There are of course exceptions, such as, Apple's Swift programming language which allows use of Emoji in program identifiers.

Egyptian Hieroglyphs have the Unicode general category Lo Letter other. A letter of the Egyptian Hieroglyphs script. I have just tested using an Egyptian Hieroglyph (codepoints.net/U+13001) as a variable name and it works fine.

I recently decided to identify my html code files with my adopted Chinese name 小山. My standard working practice is to now have the start html tag of my code files as <html id="小山">. It is so cool to be able to write document.getElementById("小山").

Many programming languages now support Unicode identifiers. How good is the support and how far can one go? I chose one of my html code files, which apart from <html id="小山">, all the identifiers are ASCII identifiers: ASCII Javascript variable names, ASCII Javascript function names and ASCII CSS class and id names. My aim was to replace all these ASCII identifiers with Unicode identifiers. Furthermore, I chose to have my Unicode identifiers written in CJK (Chinese, Japanese and Korean scripts.

The end result was an html code file with all ASCII identifiers replaced by Unicode identifiers and it all works just fine. I was expecting it to work fine but this is the first time I have had an all Unicode identifiers code file so best to check that it does work fine.

My Unicode identifiers code file is a bit too long to include in this blog article. Instead, I list some of the original ASCII identifiers and their Unicode replacements.

Firstly CSS — I used CJK identifiers for class and id names. Below I show before (ASCII) and after (Unicode). I think you can guess that ✅️  means they worked just fine
.keys .keys:hover & class="keys" ➽ .ボタン .ボタン:hover & class="ボタン" ✅️ .emphasise class="emphasise" ➽ .エンファサイズ class="エンファサイズ" ✅️ #footy id="footy" ➽ #바닥글 id="바닥글" ✅️ #earth id="earth" getElementById("earth") ➽ #地球 id="地球" getElementById("地球") ✅️

I changed all function names to Unicode CJK names. Here are 2 examples of my changes.
moveMoon() onclick="moveMoon()" ➽ 달을움직이다() onclick="달을움직이다()" ✅️ stopMoon() onclick="stopMoon()" ➽ 달을멈추다() onclick="달을멈추다()" ✅️

Finally, I changed all variable names to Unicode CJK names. Here are 2 examples.
increment ➽ 增量 ✅️ raceTrack1Width ➽ 跑道一宽度 ✅️

...and here is one of my JavaScript functions, with an embedded function
function 달을움직이다(){ var 月亮=document.getElementById("まんげつ"), 位置左月亮=0, 月亮增量=增量; if(月亮在移动)return; 身份月亮=setInterval(달케이크,月亮时间); 月亮在移动=true; function 달케이크(){ if(位置左月亮>跑道二宽度-50||位置左月亮<0)月亮增量*=(-1); 位置左月亮+=月亮增量; 月亮.style.left=位置左月亮+"px"; } }

My Unicodified html code file was successfully validated by validator.w3.org

Here is the Unicode Consortium's take on Unicode identifiers:— Unicode Standard Annex #31 - Unicode Identifier and Pattern Syntax unicode.org/reports/tr31/

Monday 20 November 2017

Computer Science Internationalization - Teaching Excellence for Student Success

The Higher Education Academy have initiated a debate on Teaching Excellence for Student Success and are inviting comments www.heacademy.ac.uk/individuals/strategic-priorities/teaching-excellence-for-student-success. Below are my comments which I sent them a couple of weeks ago.

Computer Science departments should be teaching students how to build software for the world ie they should be internationalising their curricula. I am a long time practitioner of internationalised Computer Science teaching. I do though appear to be a solitary voice. Industry, on the other hand, is building software for the world and needs Computer Scientists with Global Skills.

Over the years, the Global Skills I have taught students include:

① Internationalisation and Localisation of software
② Character sets
③ Unicode and Unicode encodings
④ Internationalising websites and building Adaptive Internationalised websites
⑤ Usage of language tags
⑥ Fonts - glyph variants and relationship to Unicode
⑦ Keyboard Mappings and Input Methods
⑧ Internationalised Domain Names & Internationalised Email Addresses
⑨ Unicode Regular Expressions
⑩ Characteristics of English, Chinese, Japanese, Korean, Russian and Arabic languages/scripts

One misconception is that one needs to know multiple (human) languages in order to build software for the world. Not necessary. But one does need to have an understanding of the characteristics of (human) languages/scripts, for example:

• Chinese and Japanese do not have spaces separating words
• Arabic is written right to left
• Korean Hangeul letters, Jamo, are joined into syllabic blocks
• ...etc...

If one wants to internationalise Computer Science Curricula the obvious thing to do is to teach students Global Skills, as above.

Let me give an example. If you look at speakerdeck.com/andre_schappo/unicode-regular-expressions you will see that up to and including slide 10, I am using ASCII regex patterns and strings. If you search the internet you will find thousands of regex examples using ASCII only. Now go to slide 11 and suddenly the regex world changes dramatically. In slide 11, I have Chinese and Emoji text. Fast forward to slide 33 and you will see Egyptian Hieroglyphs.

In addition to making students World Ready, internationalising Computer Science Curricula makes for a richer and more interesting subject.

Time for Computer Science Departments to embrace the world by teaching students Global Skills.

Friday 10 November 2017

Computer Science Internationalization - html tables


Subtitled: html tables with Chinese characteristics


Subtitled: applying Chinese and Korean font metrics to html tables


I think html tables look much better when the table cells have a square aspect ration ie a perfect square rather than rectangular.

There is actually a very easy way of making your html table cells square. Firstly one needs to have some understanding of the characteristics of human language scripts. One characteristic of the Chinese language script is that each character 汉字 occupies a square. We can use this characteristic when coding html tables.

I wrote some html code for a Sudoku board. In each cell I have an input. The relevant code, for this article, is:

<td class='bordy'><input type='text' class='bigcol' size='1'></td>

Let's look at relevant CSS in class bigcol, specifically the CSS property font-family —

font-family:Courier,monospace; and font-family:monospace; give a rectangular aspect ratio, the height being greater than the width. Not good.

Let's now use a Chinese font: font-family:"Hannotate SC",monospace;.  With Firefox, we now have a really good looking table with square aspect ratio table cells. Chrome, on the other hand, produces vertical rectangular aspect ratio table cells.

Using the template font-family: font name ,monospace;, let's try some more Chinese fonts.

Firefox + font PingFang SC produces a perfect square. Chrome + PingFang SC gives a vertical rectangle.

Firefox + Yuanti SC 圆体-简 produces a perfect square. Chrome + Yuanti SC 圆体-简 gives a vertical rectangle.

Firefox + Baoli SC 报隶-简 produces a perfect square. Chrome + Baoli SC 报隶-简 gives a vertical rectangle.

Firefox + Lantinghei SC 兰亭黑-简 produces a perfect square. Chrome + Lantinghei SC 兰亭黑-简 gives a vertical rectangle.

Using font Heiti SC, Chrome gives much better results than Firefox. Firefox is way out. Chrome + Heiti SC produces a near perfect square. Firefox + Heiti gives a horizontal rectangle.

My expectation is that all browsers, when Chinese fonts are specified, should give perfect squares for the table cells as the squared Chinese character is a fundamental characteristic of the Chinese language. From my experimentation, one can see that there are differences between browsers and a perfect square is not always produced.

One of the characteristics of the Korean language script Hangul (also romanised as Hangeul) 한글 is that the individual letters Jamo are formed into squared syllable blocks. Letʼs explore this characteristic with the Korean font  Nanum Myeongjo 나눔 명조.  Firefox + Nanum Myeongjo  produces a near perfect square. Chrome produces a vertical rectangle.

Using font "Noto Sans Korean"  results in really awful aspect ratios in both browsers. Chrome  + "Noto Sans Korean" produces vertical rectangles. Firefox + "Noto Sans Korean" gives horizontal rectangles.

I reason that there are dependencies between browsers and font metrics. In the "Noto Sans Korean" case, it seems that the 2 browsers have a 90 degree difference in interpretation of this fontʼs metrics.

My favourite combinations, so far, are: Firefox + Hannotate SC 手札体-简, Firefox + PingFang SC 苹方-简, Firefox + Yuanti SC 圆体-简, Firefox + Baoli SC 报隶-简, Lantinghei SC 兰亭黑-简, Firefox + Nanum Myeongjo 나눔 명조 and Chrome +Heiti SC 黑体-简.

〖 thoughts — Should look at more fonts. Are there any other human language scripts that are squared?〗

Here is what my Sudoku board looks like with Firefox + Hannotate SC font. This font contains glyphs for English as well as Chinese. I do like the Hannotate SC font style used for English as well as the Hannotate SC font style used for Chinese. I had a bit of fun with the Emoji as I selected them by Chinese name. Actually, an Emoji Sudoku might be popular. Instead of the numbers 1 thru 9, an Emoji Sudoku would have 9 different Emoji.


Firefox + Hannotate SC
Environment: OSX Sierra 10.12.6, Firefox 56.0.2, Chrome 62.0.3202.89

Friday 3 November 2017

Computer Science Internationalization - Unicode Emoji

A commonly occurring problem is that of databases and/or associated code not being able to handle Unicode or only able to handle part of Unicode. I will be dealing with the case of partial support for Unicode in this article.

Letʼs examine MySQL. Prior to version 5.5, MySQL could only handle a part of Unicode, specifically the BMP (Basic Multilingual Plane). The reason being that itʼs Unicode UTF8 encoding is a 3 byte encoding. In order to access the whole of Unicode, a 4 byte UTF8 encoding is required. MySQL version 5.5 and greater have a 4 byte UTF8 encoding called utf8mb4 and thus can address and store the whole of Unicode.

Unicode has 17 planes, each of which has FFFF codepoints. The 2 most commonly used planes are: Plane 0, the BMP (Basic Multilingual Plane) and Plane 1, the SMP (Supplementary Multilingual Plane). The BMP codepoints range is 0➜FFFF and the SMP range is 10000➜1FFFF. A 3 byte UTF8 encoding can only address the BMP codepoints range 0➜FFFF. Therefore MySQL versions < 5.5 cannot handle SMP characters. More seriously, with MySQL versions < 5.5, if any SMP characters are used then on storage in a database the first encountered SMP character and all subsequent characters are discarded. The discarded characters cannot be recovered. There are still many systems with this problem. Even those systems that have been upgraded to MySQL 5.5 or greater can still have the problem because of associated code and/or tables not yet updated for 4 byte UTF8 encoding.

So, basically, SMP characters break some MySQL setups. I will refer to these breakable MySQL setups as BMP only MySQL. There are solutions and work arounds that can be used to make BMP only MySQL handle the whole of Unicode.

Letʼs first look at Emoji as they are hugely popular. Most Emoji are in the SMP but there are a small number Emoji which have been in Unicode for many a year which are in the BMP. These can be safely used with BMP only MySQL. I have listed those I have found below. There may be more. I have appended to each of these Emoji the single BMP character variation selector FE0F. This variation selector directs a rendering agent to use, if available, an emojified glyph for the Emoji. What these Emoji look like on your device will depend on the fonts your device has and the browser you are using. On my OSX Mac, all these Emoji look really good and they all have emojified glyphs. With my Android phone, only about half of the below Emoji have emojified glyphs.

The SMP contains much more than Emoji. SMP characters include: Cuneiform, Egyptian Hieroglyphs, Byzantine Musical Symbols, Mahjong Tiles, Playing Cards and much more. All these SMP characters will break BMP only MySQL. WRT web technologies we can fix this SMP breaking problem. Rather than encoding these SMP characters as UTF8 we can instead encode them as NCRs (Numeric Character References). NCRs only use ASCII characters and so are BMP only MySQL safe. ASCII is a subset of Unicode and occupies the first 128 codepoints of the BMP. NCRs take the form &#xnumber;. The NCR for the SMP character 🀣 is &#x1F023, the NCR for the SMP character 𓀬 is &#x1302C; and the NCR for the SMP character 🎋  is &#x1F38B;.

‼️ ⁉️ ℹ️ ↔️ ↕️ ↖️ ↗️ ↘️ ↙️ ↩️ ↪️ ⌚️ ⌛️ ⌨️ ⏩️ ⏪️ ⏫️ ⏬️ ⏭️ ⏮️ ⏰️ ⏱️ ⏲️ ⏳️ ⏸️ ⏹️ ⏺️ Ⓜ️ ▶️ ☀️ ☁️ ☂️ ☃️ ☄️ ☎️ ☑️ ☔️ ☕️ ☘️ ☝️ ☠️ ☢️ ☣️ ☦️ ☪️ ☮️ ☯️ ☸️ ☹️ ☺️ ♈️ ♉️ ♊️ ♋️ ♌️ ♍️ ♎️ ♏️ ♐️ ♑️ ♒️ ♓️ ♠️ ♣️ ♥️ ♦️ ♨️ ♻️ ♿️ ⚒️ ⚓️ ⚔️ ⚖️ ⚗️ ⚙️ ⚛️ ⚜️ ⚠️ ⚡️ ⚪️ ⚫️ ⚰️ ⚱️ ⚽️ ⚾️ ⛄️ ⛅️ ⛈️ ⛎️ ⛏️ ⛑️ ⛓️ ⛔️ ⛩️ ⛪️ ⛰️ ⛱️ ⛲️ ⛳️ ⛴️ ⛵️ ⛷️ ⛸️ ⛹️ ⛺️ ⛽️ ✂️ ✅️ ✈️ ✉️ ✊️ ✋️ ✌️ ✍️ ✏️ ✒️ ✝️ ✡️ ✨️ ✳️ ✴️ ❄️ ❇️ ❌️ ❎️ ❓️ ❔️ ❕️ ❗️ ❣️ ❤️ ➡️ ⬅️ ⬆️ ⬇️ ⭐️ ⭕️ 〽️ ㊗️ ㊙️ 

Wednesday 1 November 2017

Computer Science Internationalization - i18n links

In this article I list i18n (Internationalisation) relevant links. This article will evolve as and when I add and remove links.

Apple: OSX and iOS
  1. m10lmac.blogspot.co.uk — Multilingual Mac
Blogs
  1. mathiasbynens.be — Mathias Bynens: Unicode, HTML, CSS, JavaScript and more
Computer Science/IT/ICT Curricula Internationalisation
  1. groups.google.com/forum/#!forum/computer-science-curriculum-internationalization — I am administrator for this forum
Fonts
  1. babelstone.co.uk/Fonts — BabelStone Fonts: Centaurian, Goblin, Han, Khitan, Ogham, Tangut and more
  2. google.com/get/noto — Google Noto Fonts: Arabic, Armenian, Balinese, Buhid, Cherokee, Lao, Mongolian, Myanmar, Sinhala, Thai and much much more
IDNs (Internationalised Domain Names) and EAI (Email Address Internationalisation)
  1. uasg.tech — Universal Acceptance Steering Group - I am a member of their email discussion list
  2. idnforums.com — IDN Forums - a Domainers forum - I am a member of this forum.
  3. Get a free
    1. داده.امارات — Arabic ‫العربية‬ email address or
    2. 电邮.在线 — Chinese 中文 email address or
    3. ডাটামেল্.ভারত — Bengla বাংলা email address or
    4. ડાટામેલ.ભારત — Gujarati ગુજરાતી email address or
    5. डाटामेल.भारत — Hindi हिन्दी email address or
    6. データメール.コム — Japanese 日本語 email address or
    7. 우편.닷컴 — Korean 한국어 email address or
    8. डेटामेल.भारत — Marathi मराठी email address or
    9. ਡਾਟਾਮੇਲ.ਭਾਰਤ — Punjabi ਪੰਜਾਬੀ email address or
    10. දත්තතැපැල.ලංකා — Sinhala සිංහල email address or
    11. இந.இந்தியா — Tamil தமிழ் email address or
    12. డేటామెయిల్.భారత్ — Telugu తెలుగు email address or
    13. ดาต้าเมล.ไทย — Thai ไทย email address or
    14. ڈاٹامیل.بھارت — Urdu ‫اردو‬ email address or
    15. datamail.in — English email address
JavaScript
  1. jsfiddle.net/user/coas/fiddles — my internationalized programming challenges
Regular Expressions (regex)
  1. regular-expressions.info/refunicode.html — Regular Expression Unicode Syntax Reference
  2. speakerdeck.com/andre_schappo/unicode-regular-expressions — one of my presentations
twitter
  1. twitter.com/r12a — Richard Ishida
Unicode
  1. unicode.org — the definitive source for all things Unicode - I am a member of their email discussion list
  2. babelstone.co.uk/Unicode/unicode.html — unicode, the movie
Useful Utilities and Web Apps
  1. ftfy.now.sh — fix mojibake 文字化け
  2. scripts.sil.org/ukelele — an OSX keyboard layout editor
  3. r12a.github.io — Richard Ishida's treasure trove of Web Apps
  4. r12a.github.io/app-encodings — Richard Ishidaʼs Encoding Converter
  5. usefulwebtool.com — online keyboards, character sets and much more
W3
  1. w3.org/International — W3C Internationalization (i18n) Activity
  2. validator.w3.org/i18n-checker/ — i18n checker
  3. developer.mozilla.org/en-US/docs/Web/CSS/list-style-type — CSS list-style-type property

Wednesday 25 October 2017

Computer Science Internationalization - Experimentation

Subtitle: "How I Discovered the Undiscoverable!"

I was writing some demonstrator code for an Introductory JavaScript class. I intended the code to illustrate expected and unexpected behaviour of the length property. Expected behaviour is when the result of the length property is equal to the number of human perceived characters. Unexpected behaviour is when the result of the length property is not equal to the number of human perceived characters.

"诺丁汉".length returns 3 (3 encoding units)
"ノッティンガム".length returns 7 (7 encoding units)
"노팅엄".length returns 3 (3 encoding units)

All good so far. These are answers that anyone would expect. Now letʼs try some Unicode Emoji.

"🐟".length returns 2 (2 encoding units)
"🐕".length returns 2 (2 encoding units)

...and, some non Emoji SMP (Supplementary Multilingual Plane) Unicode characters

"𓀌".length returns 2 (2 encoding units)
"🀤".length returns 2 (2 encoding units)

And now we observe some wierdness. In terms of human perceived characters the answer should, of course, be 1 so for most people this behaviour is unexpected. It is not unexpected for me as I know that the length property counts in UTF-16 encoding units rather than human perceived characters. I have written the number of UTF-16 encoding in brackets so that you can now understand the answer the length property returns.

Before we proceed further I need to give you further information. I can write Chinese on a Computer and Emoji can be selected by Chinese name using OSX Sierra's Simplified Pinyin Input Method. See schappo.blogspot.co.uk/2016/01/emoji-by-name.html

When I want  Emoji I sometimes use OSX's Emoji and Symbol Viewer and sometimes select by Chinese name.

Now we come to the random bit. I typed yu in the Simplified Pinyin Input Method and there were 6 different Emoji to choose from. I chose 🌧️ . I had no reason to type yu nor to chose 🌧️ , I was just experimenting. Now we come back to the length property.

"🌧️".length returns 3 (??????????!) [U+1F327 U+FE0F]
"🌦️".length returns 3 (??????????!) [U+1F326 U+FE0F]

It was most definitely not the answer I was expecting. After some 10 minutes investigation I discovered the reason for this unexpected answer. With these two Emoji the variation selector U+FE0F codepoints.net/U+FE0F is being appended thus giving a count of 3. We now have the answer to the length anomaly. But why do some Emoji have the variation selector appended and not others?

Peter Edberg gives this excellent explanation.

This is about characters U+1F327,U+1F326

The variation selector FE0F is *not* unnecessary with these. Looking at unicode.org/Public/emoji/5.0/emoji-data.txt those characters do *not* have the Emoji-Presentation property set, and they do have variation sequences defined.

From unicode.org/reports/tr51/#Emoji_Variation_Selector_Notes, such singleton emoji characters “should have emoji presentation selectors on base characters with Emoji_Presentation=No whenever an emoji presentation is desired”

I stated: I see that U+1F321➜1F32C do not have the Emoji_Presentation property set.

Peter Edberg responded: From unicode.org/emoji/charts-5.0/emoji-versions-sources.html you can see that these characters came into Unicode as a result of their being in the Webdings/Wingdings set, where they had a prior history of being non-emoji text characters. That is why they have Emoji_Presentation=No by default.

Letʼs now examine my bold claim "I discovered the Undiscoverable"

In order to make this discovery there is a set of required knowledge, skills and personality traits. These include:
  • A Knowledge of JavaScript
  • A good understanding of Unicode
  • The ability to write Chinese using OSX Sierra's Pinyin Simplified Input Method
  • Knowing that Emoji can be selected by Chinese name using Sierra's Pinyin Simplified Input Method
  • Being aware of the JavaScript length property quirk
  • A desire to experiment and explore
Considering that the World population is less than 8 billion (estimate) I think it (near) impossible that any other person (in Academia, Staff or Student) would at the instant of time I made the discovery meet the requirements necessary to make the same discovery. By instant of time I do mean as perceived by a person 〖 say less than one second. I need to research this!! 〗 because, of course, our thought process is not instant even though we experience it as such.

Window of opportunity for my discovery — I reason that the window of opportunity for the discovery started when 🌧️  was available. It was added to Unicode version 7.0 in 2014. It would probably have been another year before it became available and integrated into Apple's OSX. I made the discovery on Saturday 21st October 2017. twitter.com/andreschappo/status/921722952504238081 Given this reasoning the window of opportunity for this discovery is approximately 3 years.

Consequences: My first thought was that this anomaly would cause problems with Emoji Domains. Using mothereff.in/punycode, 🌧️.ws (with variation selector) gives the punycode address xn--v86c7044b.ws and 🌧.ws (without variation selector) gives the punycode address xn--kh8h.ws So, these are obviously different addresses. When 🌧️.ws is pasted into the Firefox address bar it needs to convert from the Unicode form 🌧️.ws to the punycode form. The punycode form it uses is xn--kh8h.ws, it is therefore evident that Firefox disregards the variation selector on conversion. Computers and Routers use the punycode form, the Unicode form is used for display to humans.

I realise that Emoji Domains are IDNA2008 disallowed, but, I figure they will be around for a good number of years yet to come.

Why was this my first thought. I am a long time practitioner of internationalised Computer Science teaching and IDNs Internationalised Domain Names (Emoji Domains are a controversial subset of IDNs) are an important part of i18n. I am an active member of the UASG discussion email list uasg.tech. I am also an active member of IDN Forums idnforums.com. I have learned much from the Domainers on IDN Forums. Thanks guys 👋

Environment: OSX Sierra version 10.12.6, FireFox version 56.0.2

Monday 10 July 2017

Computer Science Internationalization - Unicode Encoding & Decoding

Several years ago I devised this visual and fun way to teach and practise encoding and decoding Unicode. I used this method in my International Computing class. This method involves use of pencil and eraser.

The codepoints and the UTF-8 are all written in hexadecimal(hex). The binary bits are an intermediate form for the purposes of encoding and decoding.

We start with the following form which is designed for encoding Unicode codepoints to UTF-8 and decoding UTF-8 to Unicode codepoints.
Encoding: We will start with encoding Unicode codepoints to UTF-8.

The first thing we can do is fill in the fixed bits. They are the fixed bits defined by the encoding scheme. I have entered the fixed bits in red to make them distinct from variable bits.
Now we will write one or more Unicode codepoints on the form. These will be the codepoints we will encode into UTF-8. The codepoints should be written in hexadecimal. I will use the codepoints U+0444 and U+597D.

So, how do we determine where the codepoints go on the form. We need to look at the free bits to determine the range of values that can be accommodated.

  • 1 byte row - 7 free variable bits giving a range of 0 ➔ 7F
  • 2 byte row - 11 free variable bits giving a range of 80 ➔ 7FF
  • 3 byte row - 16 free variable bits giving a range of 800 ➔ FFFF
  • 4 byte row - 21 free variable bits giving a range of 10000 ➔ 1FFFFF (the actual maximum value of a codepoint is 10FFFF)
Now we know the ranges we can put U+0444 and U+597D in the correct places of the form.

We have empty boxes into which we write the binary values of the codepoints.
Finally, we take the complete bytes and write them as hexadecimal values to form the UTF-8 encoded forms. U+0444 encoded is D184, U+597D encoded is E5A5BD.
Decoding: Now onto decoding from UTF-8 to Unicode codepoints. We will decode the UTF-8 F0AA9FB7 which I have entered onto the form. I have used spaces on the form to make the byte boundaries more obvious.
Complete the bytes by writing the binary variable values.
Extract the variable binary values to form the hex Unicode codepoint U+2A7F7.
Whilst I was at it, I completed a single byte entry. The single byte characters are ASCII characters. ASCII is a subset of Unicode.

It is a Unicode convention, when writing codepoints, to use a minimum of four hex digits. So for codepoints <1000, one should left pad with zeroes. Hence my entries U+0444 and U+0057 rather than U+444 and U+57.

Sunday 2 July 2017

Computer Science Internationalization - Text Search

So, you have just written some Cool Code which will search for and find occurrences of specified text strings. You have access to Big Data text eg all the text in all public webpages. You will,of course, want to test your Cool Code. Letʼs perform some, seemingly, very simple tests.

Letʼs search for the word 'Scorpion'. Your code works just fine and hence finds all occurrences of the word 'Scorpion'.

Now test with the following two words.

  • Scorpion
  • Scorpion

Your Cool Code works fine as all I have done is applied some CSS styling, thus giving each of the two words differing appearance.

Now test you Cool Code with the following two words.

  • 𝑆𝑐𝑜𝑟𝑝𝑖𝑜𝑛
  • 𝐒𝐜𝐨𝐫𝐩𝐢𝐨𝐧

If you have only programmed for ASCII text then your now not so Cool Code will fail. These two words have differing appearance because they are not made up of the ASCII characters you are familiar with. These words use characters from the Unicode Math Alphanumeric Symbols block, U+1D400-1D4FF.

Should the Math Alphanumeric Symbols Scorpion be treated the same as the ASCII Scorpion wrt the search results of your code? In this context I think "Yes", most definitely. A person reading this blog, for example, will just perceive the word Scorpion whatever characters are used to write the word. The reader may well also visualise the insect with a "sting in the tail"😱

What of current working practice?

With twitter, a user has no means of changing text style within a tweet. It has thus become common to use Unicode Math Alphanumeric Symbols to change appearance. I could, for example, use Unicode Math Alphanumeric Symbols to emphasise a word (eg Scorpion) or phrase within a tweet. The meaning of the tweet remains the same.

Google returns the same number of search results whichever of the above forms of Scorpion I use. At time of writing this is "About 144,000,000 results". I deduce Google is treating ASCII Scorpion and Unicode Math Alphanumeric Symbols 𝑆𝑐𝑜𝑟𝑝𝑖𝑜𝑛 & 𝐒𝐜𝐨𝐫𝐩𝐢𝐨𝐧 as equivalent.

Sogou 搜狗 is a Chinese search engine. Using Sogou: ASCII Scorpion returns 93,341 results, Math Alphanumeric Symbols 𝑆𝑐𝑜𝑟𝑝𝑖𝑜𝑛 returns 4,738, Math Alphanumeric Symbols 𝐒𝐜𝐨𝐫𝐩𝐢𝐨𝐧 returns 61. I think it evident that Sogou does not treat my three forms of Scorpion as equivalent.

I side with Google on this.

Here is a taster of what is happening in the behind the scenes technicalities of Unicode. Letʼs take just one of the Unicode Math Alphanumeric Symbols I have used, 𝐒 MATHEMATICAL BOLD CAPITAL S U+1D412. If you visit codepoints.net/U+1D412 you will see a wealth of information about this character. Of relevance to this blog is the Decomposition Mapping which is to the, oh so familiar, ASCII uppercase capital S. This Unicode information can be used to compute string equivalents which can then be used for search thus providing all relevant results.

The moral of this "Sting in the Tale" is: If you do not already know it, you must learn Unicode, it is essential.

Friday 28 April 2017

Computer Science Internationalization - Hieroglyphs in Domain Names

I have been aware for a long time that domains such as .com support many human language scripts. Verisign's .com includes support for Hiragana, Gurmukhi, Han, Tibetan, Sinhala, Devanagari, Hangul and many more.

But what of Verisign's .com equivalents .コム (Japanese) and .닷컴 (Korean)? Both of these support a multitude of human language scripts. The supported scripts for many, but not all, Domains are listed in the IANA Repository of IDN Practices iana.org/domains/idn-tables.

Whilst browsing this repository, I discovered there are sixteen domains, all belonging to Verisign, which support Egyptian Hieroglyphs which I think is totally cool! Verisign's .com, .コム and .닷컴 all support Egyptian Hieroglyphs. This means one can register domain names such as:-

  1. 𓇋𓈖𓏏𓂝𓂋𓈖𓄿𓏏𓇋𓍯𓈖𓄿𓃭.com
  2. 𓇋𓈖𓏏𓂝𓂋𓈖𓄿𓏏𓇋𓍯𓈖𓄿𓃭.コム
  3. 𓇋𓈖𓏏𓂝𓂋𓈖𓄿𓏏𓇋𓍯𓈖𓄿𓃭.닷컴

It is possible you do not have an Egyptian Hieroglyph font on your device so here are the domain names in image format.

Google provide a free Egyptian Hieroglyph font which you can download from google.com/get/noto/

Does the Egyptian Hieroglyph string I have used above mean anything? It is actually a transliteration of the English word international. I used ngm.nationalgeographic.com/ngm/egypt/translator.html for the transliteration. The hieroglyphs translator presents the Egyptian Hieroglyphs as images. So, no simple copy and paste of Egyptian Hieroglyph text. I had to match with the appropriate Unicode characters by visual inspection. I cannot guarantee I made all the correct matches but I think I have them correct.

Here are some registered and live Egyptian Hieroglyph Domain Names egyptianhieroglyphic.com/egypt/egyptian-hieroglyphics/

Friday 31 March 2017

Computer Science Internationalization - Adaptive URL

A URL can consist of a Domain Name and a pathname. In the examples below x.y.z represents the Domain Name, the remainder being the pathname. My experience of the internet is that the pathname is usually written in English or more accurately ASCII. The below ASCII pathname represents a multi-page website in the form of a journey from home to a hotel in Korea.

x.y.z/home/bus/airplane/korea/taxi/hotel

Websites, such as Google, adapt the language of their text content according to the browser preferred display language (BL). This browser preferred language can be set by the user. Letʼs go one step further than Google and adapt the language of the URL pathname according to the BL. Here is the ASCII pathname rewritten into Chinese, Japanese and Korean.

x.y.z/家/公共汽车/飞机/韩国/出租车/饭店

x.y.z/ホーム/バス/飛行機/韓国/タクシー/ホテル

x.y.z/홈/버스/비행기/한국/택시/호텔

So, how do we implement these language adaptive URL parthnames? Firstly, we need to programmatically determine the BL. One way of achieving this is to examine the Accept-Language http header sent from the browser to the server. This will contain one or more language tags. If there is more than one language tag they are presented in priority order. Language tags can take many forms. They include: zh, zh-CN and cmn for Mandarin Chinese; ja for Japanese and ko for Korean. Now that we can determine the BL we can select the appropriate URL pathname, thus internationalizing our website with a language adaptive URL pathname.

On a Linux machine, each component of the pathname will be a directory. In my schema I am assuming an index.html or index.php, per directory. A requirement of this schema is that we do not want a directory hierarchy for each language, nor do we want an index.html or index.php for each language.

My native language is English so I will make my master pathname directory names English ie home, bus, airplane, korea, taxi and hotel. I will make the Chinese, Japanese and Korean directory names as aliases to the English named master directories. This can be easily achieved on Linux with the ln -s command, where ln means link and the -s option means create symbolic link, as opposed to a hard link.

ln -s home 家
ln -s home ホーム
ln -s home 홈

ln -s hotel 饭店
ln -s hotel ホテル
ln -s hotel 호텔

What if your native language is not English? In that case, create the master pathname directory names in your native language. If your native language is Korean then the master directory names will be 집, 버스, 비행기, 한국, 택시 and 호텔 and your links will be:

ln -s 홈 home
ln -s 홈 家
ln -s 홈 ホーム

ln -s 호텔 hotel
ln -s 호텔 饭店
ln -s 호텔 ホテル

Emoji are hugely popular so letʼs construct a totally cool Emoji pathname.

x.y.z/🏡/🚌/🛩/🇰🇷/🚕/🏨

ln -s home 🏡
ln -s bus 🚌
ln -s airplane 🛩
ln -s korea 🇰🇷
ln -s taxi 🚕
ln -s hotel 🏨

I have never encountered an Emoji URL pathname on a website and so implementing such a pathname on your website would be both totally cool and unique. You could also use an Emoji pathname for those languages your website does not support. My schema only supports Chinese, English, Japanese and Korean. If the BL was an unsupported language, such as Arabic, then the Emoji pathname could be displayed in the browser address bar instead of, for example, defaulting to English.

I have used x.y.x to represent the Domain Name, the implication being it is ASCII. We can complete the language adaptive equation by having Domain Names in supported BL languages. Thus my completed equation schema would have Chinese, Japanese and Korean Domain Names in addition to an ASCII Domain Name.

Friday 17 March 2017

Computer Science Internationalization - EAI

As I stated in schappo.blogspot.co.uk/2017/01/chinese-email-address.html both DataMail and Google mail support Email Address Internationalization (EAI). DataMail provides a complete EAI service which includes both support and creation of Internationalized email addresses. Google Mail provides a partial EAI service, in that, it supports EAI but does not yet provide for creation of internationlized email accounts with internationalized email addresses. Thus organisations using Google Mail have an advantage over those organisations having an ASCII addresses only email service and have a head start in provision of a complete EAI service.

Given the Domain name of an organisation, the Unix host command can be used to determine the mail service provider. Here are some of the organisations using Google Mail:

苹果电脑 ~: host spotify.com
spotify.com has address 194.132.198.198
spotify.com has address 194.132.197.198
spotify.com has address 194.132.198.149
spotify.com mail is handled by 10 ASPMX3.GOOGLEMAIL.com.
spotify.com mail is handled by 1 ASPMX.L.GOOGLE.com.
spotify.com mail is handled by 10 ASPMX2.GOOGLEMAIL.com.
spotify.com mail is handled by 5 ALT2.ASPMX.L.GOOGLE.com.
spotify.com mail is handled by 10 ASPMX5.GOOGLEMAIL.com.
spotify.com mail is handled by 5 ALT1.ASPMX.L.GOOGLE.com.
spotify.com mail is handled by 10 ASPMX4.GOOGLEMAIL.com.
苹果电脑 ~: host twitter.com
twitter.com has address 104.244.42.129
twitter.com has address 104.244.42.1
twitter.com mail is handled by 30 aspmx3.googlemail.com.
twitter.com mail is handled by 10 aspmx.l.google.com.
twitter.com mail is handled by 20 alt1.aspmx.l.google.com.
twitter.com mail is handled by 30 aspmx2.googlemail.com.
twitter.com mail is handled by 20 alt2.aspmx.l.google.com.
苹果电脑 ~: host mixi.jp # ミクシィ
mixi.jp has address 52.198.59.66
mixi.jp has address 54.92.71.226
mixi.jp has address 52.198.89.90
mixi.jp mail is handled by 30 aspmx2.googlemail.com.
mixi.jp mail is handled by 10 aspmx.l.google.com.
mixi.jp mail is handled by 20 alt2.aspmx.l.google.com.
mixi.jp mail is handled by 20 alt1.aspmx.l.google.com.
mixi.jp mail is handled by 30 aspmx3.googlemail.com.
苹果电脑 ~: host bristol.ac.uk # University of Bristol
bristol.ac.uk has address 137.222.0.38
bristol.ac.uk mail is handled by 5 ALT1.ASPMX.L.GOOGLE.COM.
bristol.ac.uk mail is handled by 10 ASPMX2.GOOGLEMAIL.COM.
bristol.ac.uk mail is handled by 1 ASPMX.L.GOOGLE.COM.
bristol.ac.uk mail is handled by 10 ASPMX3.GOOGLEMAIL.COM.
bristol.ac.uk mail is handled by 5 ALT2.ASPMX.L.GOOGLE.COM.
苹果电脑 ~: host bathspa.ac.uk # Bath Spa University
bathspa.ac.uk has address 194.83.160.0
bathspa.ac.uk has address 162.13.24.154
bathspa.ac.uk has address 72.47.217.0
bathspa.ac.uk mail is handled by 10 ALT4.ASPMX.L.GOOGLE.COM.
bathspa.ac.uk mail is handled by 5 ALT2.ASPMX.L.GOOGLE.COM.
bathspa.ac.uk mail is handled by 1 ASPMX.L.GOOGLE.COM.
bathspa.ac.uk mail is handled by 5 ALT1.ASPMX.L.GOOGLE.COM.
bathspa.ac.uk mail is handled by 10 ALT3.ASPMX.L.GOOGLE.COM.
Providing a full EAI service involves going beyond ASCII. It entails supporting Unicode email addresses. Unicode email addresses such as my Chinese email 小山@电邮.在线

Tuesday 31 January 2017

Computer Science Internationalization - Unicode Terminal Session

Below is an OSX bash shell command line terminal session. It is a real, working terminal session using basic unix commands. It does, though, look significantly different from a standard terminal session. If you know basic unix commands such as ls and cd, you should/may be able to work out what is happening.

苹果电脑 ~: 妈 我的目录
苹果电脑 ~: 茶 我的目录
苹果电脑 我的目录: 丽
苹果电脑 我的目录: 头 文档一 文档二 文档三
苹果电脑 我的目录: 丽
文档一  文档三  文档二
苹果电脑 我的目录: 词 > 文档四
一
二
三
四
五
六
苹果电脑 我的目录: 词 文档四
一
二
三
四
五
六
苹果电脑 我的目录: 丽
文档一  文档三  文档二  文档四
苹果电脑 我的目录: ⇉ 文档四 文档五
苹果电脑 我的目录: 丽
文档一  文档三  文档二  文档五  文档四
苹果电脑 我的目录: → 文档一 文档六
苹果电脑 我的目录: 丽
文档三  文档二  文档五  文档六  文档四
苹果电脑 我的目录: 

So, what is happening!?

Firstly I am using Unicode characters. If you search the internet you will find many examples of terminal sessions but they will invariably be using ASCII characters only. In my above terminal session I am using Unicode characters, mostly Chinese/Japanese and two arrow symbol characters.

Where are the commands such as ls and cd? I have mapped a set of commands to Unicode characters using the alias command eg alias 丽='ls'

I have changed the command line prompt.

If you understand basic bash commands, I believe I have now given you sufficient information in order for you to work out what is happening in the terminal session. Knowing Chinese or Japanese gives a slight advantage but it is not essential to understanding this terminal session. The Chinese/Japanese characters I chose for the command mappings are somewhat random so it will not help you to google translate them.

I actually devised these command mappings and the terminal session several years ago. Today, I decided it was time to put it onto my blog. My main purpose was and still is, to encourage students to think beyond ASCII. I believe it has impact because it is so unexpected when one first sees this terminal session.

There can be many different permutations on the session using different human language scripts and unicode symbols. It makes for an interesting and unusual exercise for students studying unix. Absolutely no reason why one should not, for example, use emoji for the command mappings.

Monday 9 January 2017

Chinese Email Address

The latest and hottest news is that I now have a Chinese email address➜ 小山@电邮.在线 😄

  1. 小山 is my adopted Chinese name
  2. 电邮 means email
  3. 在线 means online

I acquired my free Chinese email address from DataMail which supports email addresses in twelve languages: العَرَبِيَّة‎‎ Arabic, বাংলা Bengali, 中文 Chinese, English, ગુજરાતી Gujarati, हिन्दी Hindi, मराठी Marathi, ਪੰਜਾਬੀ Punjabi, ру́сский Russian, தமிழ் Tamil, తెలుగు Telugu, اُردُو‎ Urdu.

Additionally, DataMail has an impressive family of IDNs (Internationalized Domain Names) with each language having itʼs own IDN.
  1. Arabic داده.امارات
  2. Bengali ডাটামেল্.ভারত
  3. Chinese 电邮.在线
  4. English datamail.in
  5. Gujarati ડાટામેલ.ભારત
  6. Hindi डाटामेल.भारत
  7. Marathi डेटामेल.भारत
  8. Punjabi ਡਾਟਾਮੇਲ.ਭਾਰਤ
  9. Russian почта.рус
  10. Tamil இந.இந்தியா
  11. Telugu డేటామెయిల్.భారత్
  12. Urdu ڈاٹامیل.بھارت

If you would like your own DataMail email address in one of the above languages then just click one of the above links. The website directs you to download an Android or iOS App. One uses the App to actually register a DataMail email address.

The main points in the registration process using the DataMail App are:

  1. The crucial part of this process is that firstly you need to select the language for the email address you are about to register. Subsequent instructions will be in the language you have selected. So, I chose Chinese in order to register 小山@电邮.在线.
  2. Validation of your phone number - the DataMail App will, with your approval, send an SMS text to DataMail in India to confirm your phone number. If the validation process fails, it could be that your phone contract does not cover the sending of international SMS text.
  3. Choosing the local-part which in my case is 小山. The Domain Name part is fixed and is provided by DataMail. There is a Domain Name per language, as above.

I have successfully exchanged emails between Gmail ASCII emails addresses and my DataMail Chinese email address. Gmail supports Internationalized Email Addresses (IEAs) but one cannot create IEAs in Gmail. DataMail, to my knowledge, is currently the only production email system that both supports and allows creation of IEAs. I used Gmail with a browser when testing exchange of IEAs. If you are accessing your Gmail using IMAP or POP then IEAs may or may not work. It all depends on whether or not your client software supports IEAs.

I have sent email from DataMail using my Chinese email address 小山@电邮.在线 to several Gmail users. My current experience is that for some of the Gmail users, my email goes to their spam folder instead of their primary inbox. If this is happening to you or your recipients, please mark the Gmail email as 'not spam' to help prevent reoccurrences of this problem.

In addition to the App, DataMail can be used with a web bowser ➜ 邮.电邮.在线

Currently, the few systems supporting internationalized email addresses are DataMail, Gmail and Outlook 2016. So, what to do when exchanging email with a system that only supports ASCII email addresses? DataMail have thought about this issue and offer email aliasing. One can create ASCII email aliases and use them to exchange email with systems that do not yet support international email addresses. My DataMail mailbox has the Chinese email address 小山@电邮.在线 and ASCII @datamail.in addresses thus allowing me to communicate with any email system.

DataMail is a good example of an AI (Adaptive Internationalized) website. It adapts to the language of the web address used for access. The most obvious adaptation is the text content is in the language of the web address. Secondly, the appropriate language button is highlighted. Finally, and perhaps less obviously, in the top right corner there is a DataMail support email address which is in the current web address language. In the case of 电邮.在线 the DataMail support email address is 支持@电邮.在线

Letʼs examine some of the technicalities of EAI (Email Address Internationalization). The structure of an email address is local-part@Domain Name where the Domain Name identifies a mail server and local-part identifies a mailbox on said mail server. The email addresses you will be most familiar with are ASCII local-part@ASCII Domain Name. IEAs, on the other hand, are of the form Unicode local-part@Unicode Domain Name. In order to make this form work we need to encode both parts with one encoding for the Unicode local-part and a different encoding for the Unicode Domain Name. The encoded email address is UTF-8@punycode. Users see the Unicode email address and Computers work with the encoded address.

For further technical reading, these are the primary EAI RFCs:

  1. tools.ietf.org/html/rfc6531
  2. tools.ietf.org/html/rfc6532
  3. tools.ietf.org/html/rfc6533
  4. tools.ietf.org/html/rfc6534