André 小山 Schappo: November 2017

Thursday 30 November 2017

Computer Science Internationalization - Korean IDNs

Below are a set of Korean IDNs (Internationalised Domain Names). I have listed them because they are good examples of well integrated IDNs. They do not redirect to an ASCII Domain Name. They do not require the prefix www nor redirects to a www prefixed Domain Name. Forcing the www prefix means having mixed human language script Domain Names which I consider to be a bad thing. Single script Domain Names are a good thing. The IDN shows in the address bar and is correctly integrated into the navigation which means the site is using relative addressing which is also a good thing😀 The TLD (Top Level Domain) 한국 means Korea.

판촉통.한국 (7th in the above list) is a Korean site selling promotional merchandise. Each product has photos and a twitter sharing button which when clicked results in the site composing a tweet which has a link to the product. Here is one such composed tweet:

오죽막선사군자 (1009044) 152,500원 http://www.87tong.com/product/product_view.php?code=1009044 via @판촉통 1544-8759

The important and relevant issue here is that the Domain Name in the composed tweet is ASCII instead of Korean. Fortunately, this problem can be easily fixed. Before tweeting I replace the ASCII Domain Name www.87tong.com with the Korean Domain Name 판촉통.한국. The edited tweet is now:

오죽막선사군자 (1009044) 152,500원 http://판촉통.한국/product/product_view.php?code=1009044 via @판촉통 1544-8759

...and here is a link to the tweeted tweet twitter.com/andreschappo/status/936633558923440128

판촉통.한국 has 7 pages of fans 부채 and some of them are very beautiful. I recommend you browse through their fans section, which you will find at 판촉통.한국/product/cate_mid.php?cate=11300

고려대학교.한국 is the Korean IDN for Korea University. Goryeo 고려 (the first 2 characters of this Korean IDN) is the name of an old Korean kingdom en.wikipedia.org/wiki/Goryeo

Sunday 26 November 2017

Computer Science Internationalization - identifiers

When programming I sometimes use Unicode identifiers instead of ASCII identifiers. The basic principle of Unicode identifiers in a programming context is that the characters used in a Unicode identifier should belong to some human language script. There are of course exceptions, such as, Apple's Swift programming language which allows use of Emoji in program identifiers.

Egyptian Hieroglyphs have the Unicode general category Lo Letter other. A letter of the Egyptian Hieroglyphs script. I have just tested using an Egyptian Hieroglyph (codepoints.net/U+13001) as a variable name and it works fine.

I recently decided to identify my html code files with my adopted Chinese name 小山. My standard working practice is to now have the start html tag of my code files as <html id="小山">. It is so cool to be able to write document.getElementById("小山").

Many programming languages now support Unicode identifiers. How good is the support and how far can one go? I chose one of my html code files, which apart from <html id="小山">, all the identifiers are ASCII identifiers: ASCII Javascript variable names, ASCII Javascript function names and ASCII CSS class and id names. My aim was to replace all these ASCII identifiers with Unicode identifiers. Furthermore, I chose to have my Unicode identifiers written in CJK (Chinese, Japanese and Korean scripts.

The end result was an html code file with all ASCII identifiers replaced by Unicode identifiers and it all works just fine. I was expecting it to work fine but this is the first time I have had an all Unicode identifiers code file so best to check that it does work fine.

My Unicode identifiers code file is a bit too long to include in this blog article. Instead, I list some of the original ASCII identifiers and their Unicode replacements.

Firstly CSS — I used CJK identifiers for class and id names. Below I show before (ASCII) and after (Unicode). I think you can guess that ✅️ means they worked just fine

.keys .keys:hover & class="keys" ➽ .ボタン .ボタン:hover & class="ボタン" ✅️ .emphasise class="emphasise" ➽ .エンファサイズ class="エンファサイズ" ✅️ #footy id="footy" ➽ #바닥글 id="바닥글" ✅️ #earth id="earth"　getElementById("earth") ➽ #地球 id="地球"　getElementById("地球") ✅️

I changed all function names to Unicode CJK names. Here are 2 examples of my changes.

moveMoon() onclick="moveMoon()" ➽ 달을움직이다() onclick="달을움직이다()" ✅️
stopMoon() onclick="stopMoon()" ➽ 달을멈추다() onclick="달을멈추다()" ✅️

Finally, I changed all variable names to Unicode CJK names. Here are 2 examples.

increment ➽ 增量 ✅️
raceTrack1Width ➽ 跑道一宽度 ✅️

...and here is one of my JavaScript functions, with an embedded function

  function 달을움직이다(){
    var 月亮=document.getElementById("まんげつ"),
        位置左月亮=0,
        月亮增量=增量;
    if(月亮在移动)return;   
    身份月亮=setInterval(달케이크,月亮时间);
    月亮在移动=true;
    function 달케이크(){
      if(位置左月亮>跑道二宽度-50||位置左月亮<0)月亮增量*=(-1);
      位置左月亮+=月亮增量;
      月亮.style.left=位置左月亮+"px";
    }
  }

My Unicodified html code file was successfully validated by validator.w3.org

Here is the Unicode Consortium's take on Unicode identifiers:— Unicode Standard Annex #31 - Unicode Identifier and Pattern Syntax unicode.org/reports/tr31/

Monday 20 November 2017

Computer Science Internationalization - Teaching Excellence for Student Success

The Higher Education Academy have initiated a debate on Teaching Excellence for Student Success and are inviting comments www.heacademy.ac.uk/individuals/strategic-priorities/teaching-excellence-for-student-success. Below are my comments which I sent them a couple of weeks ago.

Computer Science departments should be teaching students how to build software for the world ie they should be internationalising their curricula. I am a long time practitioner of internationalised Computer Science teaching. I do though appear to be a solitary voice. Industry, on the other hand, is building software for the world and needs Computer Scientists with Global Skills.

Over the years, the Global Skills I have taught students include:

① Internationalisation and Localisation of software
② Character sets
③ Unicode and Unicode encodings
④ Internationalising websites and building Adaptive Internationalised websites
⑤ Usage of language tags
⑥ Fonts - glyph variants and relationship to Unicode
⑦ Keyboard Mappings and Input Methods
⑧ Internationalised Domain Names & Internationalised Email Addresses
⑨ Unicode Regular Expressions
⑩ Characteristics of English, Chinese, Japanese, Korean, Russian and Arabic languages/scripts

One misconception is that one needs to know multiple (human) languages in order to build software for the world. Not necessary. But one does need to have an understanding of the characteristics of (human) languages/scripts, for example:

• Chinese and Japanese do not have spaces separating words
• Arabic is written right to left
• Korean Hangeul letters, Jamo, are joined into syllabic blocks
• ...etc...

If one wants to internationalise Computer Science Curricula the obvious thing to do is to teach students Global Skills, as above.

Let me give an example. If you look at speakerdeck.com/andre_schappo/unicode-regular-expressions you will see that up to and including slide 10, I am using ASCII regex patterns and strings. If you search the internet you will find thousands of regex examples using ASCII only. Now go to slide 11 and suddenly the regex world changes dramatically. In slide 11, I have Chinese and Emoji text. Fast forward to slide 33 and you will see Egyptian Hieroglyphs.

In addition to making students World Ready, internationalising Computer Science Curricula makes for a richer and more interesting subject.

Time for Computer Science Departments to embrace the world by teaching students Global Skills.

Friday 10 November 2017

Computer Science Internationalization - html tables

Subtitled: html tables with Chinese characteristics

Subtitled: applying Chinese and Korean font metrics to html tables

I think html tables look much better when the table cells have a square aspect ration ie a perfect square rather than rectangular.

There is actually a very easy way of making your html table cells square. Firstly one needs to have some understanding of the characteristics of human language scripts. One characteristic of the Chinese language script is that each character 汉字 occupies a square. We can use this characteristic when coding html tables.

I wrote some html code for a Sudoku board. In each cell I have an input. The relevant code, for this article, is:

Let's look at relevant CSS in class bigcol, specifically the CSS property font-family —

font-family:Courier,monospace; and font-family:monospace; give a rectangular aspect ratio, the height being greater than the width. Not good.

Let's now use a Chinese font: font-family:"Hannotate SC",monospace;. With Firefox, we now have a really good looking table with square aspect ratio table cells. Chrome, on the other hand, produces vertical rectangular aspect ratio table cells.

Using the template font-family: font name ,monospace;, let's try some more Chinese fonts.

Firefox + font PingFang SC produces a perfect square. Chrome + PingFang SC gives a vertical rectangle.

Firefox + Yuanti SC 圆体-简 produces a perfect square. Chrome + Yuanti SC 圆体-简 gives a vertical rectangle.

Firefox + Baoli SC 报隶-简 produces a perfect square. Chrome + Baoli SC 报隶-简 gives a vertical rectangle.

Firefox + Lantinghei SC 兰亭黑-简 produces a perfect square. Chrome + Lantinghei SC 兰亭黑-简 gives a vertical rectangle.

Using font Heiti SC, Chrome gives much better results than Firefox. Firefox is way out. Chrome + Heiti SC produces a near perfect square. Firefox + Heiti gives a horizontal rectangle.

My expectation is that all browsers, when Chinese fonts are specified, should give perfect squares for the table cells as the squared Chinese character is a fundamental characteristic of the Chinese language. From my experimentation, one can see that there are differences between browsers and a perfect square is not always produced.

One of the characteristics of the Korean language script Hangul (also romanised as Hangeul) 한글 is that the individual letters Jamo are formed into squared syllable blocks. Letʼs explore this characteristic with the Korean font Nanum Myeongjo 나눔 명조. Firefox + Nanum Myeongjo produces a near perfect square. Chrome produces a vertical rectangle.

Using font "Noto Sans Korean" results in really awful aspect ratios in both browsers. Chrome + "Noto Sans Korean" produces vertical rectangles. Firefox + "Noto Sans Korean" gives horizontal rectangles.

I reason that there are dependencies between browsers and font metrics. In the "Noto Sans Korean" case, it seems that the 2 browsers have a 90 degree difference in interpretation of this fontʼs metrics.

My favourite combinations, so far, are: Firefox + Hannotate SC 手札体-简, Firefox + PingFang SC 苹方-简, Firefox + Yuanti SC 圆体-简, Firefox + Baoli SC 报隶-简, Lantinghei SC 兰亭黑-简, Firefox + Nanum Myeongjo 나눔 명조 and Chrome +Heiti SC 黑体-简.

〖 thoughts — Should look at more fonts. Are there any other human language scripts that are squared?〗

Here is what my Sudoku board looks like with Firefox + Hannotate SC font. This font contains glyphs for English as well as Chinese. I do like the Hannotate SC font style used for English as well as the Hannotate SC font style used for Chinese. I had a bit of fun with the Emoji as I selected them by Chinese name. Actually, an Emoji Sudoku might be popular. Instead of the numbers 1 thru 9, an Emoji Sudoku would have 9 different Emoji.

Firefox + Hannotate SC

Environment: OSX Sierra 10.12.6, Firefox 56.0.2, Chrome 62.0.3202.89

Friday 3 November 2017

Computer Science Internationalization - Unicode Emoji

A commonly occurring problem is that of databases and/or associated code not being able to handle Unicode or only able to handle part of Unicode. I will be dealing with the case of partial support for Unicode in this article.

Letʼs examine MySQL. Prior to version 5.5, MySQL could only handle a part of Unicode, specifically the BMP (Basic Multilingual Plane). The reason being that itʼs Unicode UTF8 encoding is a 3 byte encoding. In order to access the whole of Unicode, a 4 byte UTF8 encoding is required. MySQL version 5.5 and greater have a 4 byte UTF8 encoding called utf8mb4 and thus can address and store the whole of Unicode.

Unicode has 17 planes, each of which has FFFF codepoints. The 2 most commonly used planes are: Plane 0, the BMP (Basic Multilingual Plane) and Plane 1, the SMP (Supplementary Multilingual Plane). The BMP codepoints range is 0➜FFFF and the SMP range is 10000➜1FFFF. A 3 byte UTF8 encoding can only address the BMP codepoints range 0➜FFFF. Therefore MySQL versions < 5.5 cannot handle SMP characters. More seriously, with MySQL versions < 5.5, if any SMP characters are used then on storage in a database the first encountered SMP character and all subsequent characters are discarded. The discarded characters cannot be recovered. There are still many systems with this problem. Even those systems that have been upgraded to MySQL 5.5 or greater can still have the problem because of associated code and/or tables not yet updated for 4 byte UTF8 encoding.

So, basically, SMP characters break some MySQL setups. I will refer to these breakable MySQL setups as BMP only MySQL. There are solutions and work arounds that can be used to make BMP only MySQL handle the whole of Unicode.

Letʼs first look at Emoji as they are hugely popular. Most Emoji are in the SMP but there are a small number Emoji which have been in Unicode for many a year which are in the BMP. These can be safely used with BMP only MySQL. I have listed those I have found below. There may be more. I have appended to each of these Emoji the single BMP character variation selector FE0F. This variation selector directs a rendering agent to use, if available, an emojified glyph for the Emoji. What these Emoji look like on your device will depend on the fonts your device has and the browser you are using. On my OSX Mac, all these Emoji look really good and they all have emojified glyphs. With my Android phone, only about half of the below Emoji have emojified glyphs.

The SMP contains much more than Emoji. SMP characters include: Cuneiform, Egyptian Hieroglyphs, Byzantine Musical Symbols, Mahjong Tiles, Playing Cards and much more. All these SMP characters will break BMP only MySQL. WRT web technologies we can fix this SMP breaking problem. Rather than encoding these SMP characters as UTF8 we can instead encode them as NCRs (Numeric Character References). NCRs only use ASCII characters and so are BMP only MySQL safe. ASCII is a subset of Unicode and occupies the first 128 codepoints of the BMP. NCRs take the form &#xnumber;. The NCR for the SMP character 🀣 is &#x1F023, the NCR for the SMP character 𓀬 is 𓀬 and the NCR for the SMP character 🎋 is 🎋.

‼️ ⁉️ ℹ️ ↔️ ↕️ ↖️ ↗️ ↘️ ↙️ ↩️ ↪️ ⌚️ ⌛️ ⌨️ ⏩️ ⏪️ ⏫️ ⏬️ ⏭️ ⏮️ ⏯ ⏰️ ⏱️ ⏲️ ⏳️ ⏸️ ⏹️ ⏺️ Ⓜ️ ▶️ ☀️ ☁️ ☂️ ☃️ ☄️ ☎️ ☑️ ☔️ ☕️ ☘️ ☝️ ☠️ ☢️ ☣️ ☦️ ☪️ ☮️ ☯️ ☸️ ☹️ ☺️ ♈️ ♉️ ♊️ ♋️ ♌️ ♍️ ♎️ ♏️ ♐️ ♑️ ♒️ ♓️ ♠️ ♣️ ♥️ ♦️ ♨️ ♻️ ♿️ ⚒️ ⚓️ ⚔️ ⚖️ ⚗️ ⚙️ ⚛️ ⚜️ ⚠️ ⚡️ ⚪️ ⚫️ ⚰️ ⚱️ ⚽️ ⚾️ ⛄️ ⛅️ ⛈️ ⛎️ ⛏️ ⛑️ ⛓️ ⛔️ ⛩️ ⛪️ ⛰️ ⛱️ ⛲️ ⛳️ ⛴️ ⛵️ ⛷️ ⛸️ ⛹️ ⛺️ ⛽️ ✂️ ✅️ ✈️ ✉️ ✊️ ✋️ ✌️ ✍️ ✏️ ✒️ ✝️ ✡️ ✨️ ✳️ ✴️ ❄️ ❇️ ❌️ ❎️ ❓️ ❔️ ❕️ ❗️ ❣️ ❤️ ➡️ ⬅️ ⬆️ ⬇️ ⭐️ ⭕️ 〽️ ㊗️ ㊙️

Wednesday 1 November 2017

Computer Science Internationalization - i18n links

In this article I list i18n (Internationalisation) relevant links. This article will evolve as and when I add and remove links.

Apple: OSX and iOS

m10lmac.blogspot.co.uk — Multilingual Mac

Blogs

mathiasbynens.be — Mathias Bynens: Unicode, HTML, CSS, JavaScript and more

Computer Science/IT/ICT Curricula Internationalisation

groups.google.com/forum/#!forum/computer-science-curriculum-internationalization — I am administrator for this forum

Fonts

babelstone.co.uk/Fonts — BabelStone Fonts: Centaurian, Goblin, Han, Khitan, Ogham, Tangut and more
google.com/get/noto — Google Noto Fonts: Arabic, Armenian, Balinese, Buhid, Cherokee, Lao, Mongolian, Myanmar, Sinhala, Thai and much much more

IDNs (Internationalised Domain Names) and EAI (Email Address Internationalisation)

uasg.tech — Universal Acceptance Steering Group - I am a member of their email discussion list
idnforums.com — IDN Forums - a Domainers forum - I am a member of this forum.
Get a free
1. داده.امارات — Arabic ‫العربية‬ email address or
2. 电邮.在线 — Chinese 中文 email address or
3. ডাটামেল্.ভারত — Bengla বাংলা email address or
4. ડાટામેલ.ભારત — Gujarati ગુજરાતી email address or
5. डाटामेल.भारत — Hindi हिन्दी email address or
6. データメール.コム — Japanese 日本語 email address or
7. 우편.닷컴 — Korean 한국어 email address or
8. डेटामेल.भारत — Marathi मराठी email address or
9. ਡਾਟਾਮੇਲ.ਭਾਰਤ — Punjabi ਪੰਜਾਬੀ email address or
10. දත්තතැපැල.ලංකා — Sinhala සිංහල email address or
11. இந.இந்தியா — Tamil தமிழ் email address or
12. డేటామెయిల్.భారత్ — Telugu తెలుగు email address or
13. ดาต้าเมล.ไทย — Thai ไทย email address or
14. ڈاٹامیل.بھارت — Urdu ‫اردو‬ email address or
15. datamail.in — English email address

JavaScript

jsfiddle.net/user/coas/fiddles — my internationalized programming challenges

Regular Expressions (regex)

regular-expressions.info/refunicode.html — Regular Expression Unicode Syntax Reference
speakerdeck.com/andre_schappo/unicode-regular-expressions — one of my presentations

twitter

twitter.com/r12a — Richard Ishida

Unicode

unicode.org — the definitive source for all things Unicode - I am a member of their email discussion list
babelstone.co.uk/Unicode/unicode.html — unicode, the movie

Useful Utilities and Web Apps

ftfy.now.sh — fix mojibake 文字化け
scripts.sil.org/ukelele — an OSX keyboard layout editor
r12a.github.io — Richard Ishida's treasure trove of Web Apps
r12a.github.io/app-encodings — Richard Ishidaʼs Encoding Converter
usefulwebtool.com — online keyboards, character sets and much more

w3.org/International — W3C Internationalization (i18n) Activity
validator.w3.org/i18n-checker/ — i18n checker
developer.mozilla.org/en-US/docs/Web/CSS/list-style-type — CSS list-style-type property