Monday, 14 January 2019

Computer Science Internationalization - Adaptive URL

I recently came across an impressive Serbian website, Serbian National Internet Domain Registry (RNIDS/РНИДС). This siteʼs content text is available by user selection in English or Serbian, Cyrillic script or Serbian, Latin script.

What is happening in the browser address bar is what I find most interesting and impressive. Firstly, RNIDS has a Serbian Cyrillic Domain Name, рнидс.срб. This Domain Name is properly integrated into the site. It does not redirect to an ASCII domain, nor does it use a Frame redirect/forward. It is correctly displayed in the browser address bar.

Secondly, the pathname part of the URL is displayed in the currently selected language/script. If you browse round the site and change language/script you will see the URL pathname instantly adapt to your selected language/script. Here is an example page:

  1. Ћирилица (Serbian/Cyrillic Script): рнидс.срб/национални-домени/регистрација-националних-домена
  2. Latinica (Serbian/Latin Script): рнидс.срб/lat/nacionalni-domeni/registracija-nacionalnih-domena
  3. English: рнидс.срб/en/national-domains/registering-national-domains

I consider RNIDS to be an excellent example of usage of FULLY internationalised URLs. There are, nowadays, many sites with an Internationalised Domain Name in a multitude of languages/scripts. Most though still have an ASCII/English pathname. I consider this to be a missed opportunity. I highly recommend that sites fully internationalise their URLs. One way of achieving this is by the use of aliases schappo.blogspot.com/2017/03/computer-science-internationalization_31.html

Sunday, 10 June 2018

BBC International Websites

If you are in the UK, most of you will be familiar with the BBC website bbc.co.uk. The BBC does have several localised websites for non English languages and regional news. If you are browsing from the UK it is not at all obvious how to visit their localised websites as there are no links on their UK website.

Here is how I found their localised websites. Firstly I visited bbc.co.uk and, as one would expect, I landed on their UK homepage. But, there are no links to their localised sites. If there were I would be happy and would not be writing this article. Next I visited the wikipedia BBC article en.wikipedia.org/wiki/BBC and saw that there is a different address bbc.com. I used this address but as I am browsing from the UK it redirects to bbc.co.uk which is not at all helpful. So, I am back to where I started.

To the rescue comes the Opera browser with itʼs builtin VPN. I set the VPN to a non UK location and now when I use bbc.com I do not get redirected to bbc.co.uk. Scroll down to the bottom of the page and I now see links to their localised websites. If you are browsing from the UK without such a VPN service these links will not redirect to bbc.co.uk

For your convenience I list below all the links to the BBC localised websites.

  1. Arabic عربي bbc.com/arabic
  2. Azeri AZƏRBAYCAN bbc.com/azeri
  3. Bangla বাংলা bbc.com/bengali
  4. Burmese မြန်မာစာ bbc.com/burmese
  5. Chinese 中文 bbc.com/zhongwen/simp
  6. French bbc.com/afrique
  7. Hausa bbc.com/hausa
  8. Hindi हिन्दी bbc.com/hindi
  9. Indonesian bbc.com/indonesia
  10. Japanese 日本語 http://bbc.com/japanese
  11. Kinyarwanda & Kirundi bbc.com/gahuza
  12. Kyrgyz Кыргыз bbc.com/kyrgyz
  13. Marathi मराठी bbc.com/marathi
  14. Nepali नेपाली bbc.com/nepali
  15. Pashto پښتو bbc.com/pashto
  16. Persian فارسی bbc.com/persian
  17. Portuguese bbc.com/portuguese
  18. Russian ру́сский bbc.com/russian
  19. Sinhala සිංහල https://bbc.com/sinhala
  20. Somali bbc.com/somali
  21. Spanish bbc.com/mundo
  22. Swahili bbc.com/swahili
  23. Tamil தமிழ் bbc.com/tamil
  24. Turkish TÜRKÇE bbc.com/turkce
  25. Ukrainian УКРАЇНСЬКA bbc.com/ukrainian
  26. Urdu اردو bbc.com/urdu
  27. Uzbek O'ZBEK bbc.com/uzbek
  28. Vietnamese TIẾNG VIỆT bbc.com/vietnamese

Sunday, 22 April 2018

Computer Science Internationalization - Ideographic Description Characters

I recently created a new Chinese character. This is the very first time I have done so. I created the character to write on a farewell card.

Firstly some background. About two years ago I gave a printout to a colleague, Katherine Hollingsworth, of the Chinese character 好 which means good. I explained how this character has two components, the left part meaning woman and the right part meaning child. Woman 女 + child 子 is something good, hence the meaning good. Katherine, at this time had one child.

Fast forward two years and Katherine is leaving us. As is traditional, there was a farewell card for us to write our best wishes. That is when I had the idea of creating a new Chinese character. Katherine now has two children. The character I created was ⿰好子 which is a woman with two children. I handwrote this character onto the card. Chinese characters are written into a square which is what I did when handwriting the combination of 好 and 子.

Now to the Computer Science part. In Unicode there are twelve Ideographic Description Characters: ⿰ ⿱ ⿲ ⿳ ⿴ ⿵ ⿶ ⿷ ⿸ ⿹ ⿺ ⿻, U+2FF0➔2FFB. These can be used to construct new characters from combinations of existing characters and/or components. They represent the topological relationship between the components. ⿰ is used to represent a character with two components, a left part and a right part. The ideographic description sequence for my new character is thus ⿰好子. I have given this character the name 双好 which means double good😀

I told my Chinese project students about my new Chinese character. One of the students, 王国旭 Wang Guoxu, suggested an additional way of constructing the character using a left part, a middle part and a right part. The sequence for his suggestion is ⿲子女子. I really like this suggestion as we now have a woman surrounded by her two children.

Update: I have now devised a second new Chinese character. It is related to the above character and consists of four, side by side components. The ideographic description sequence is: ⿰⿰子男⿰女子. I will leave it to you, the reader, to determine what it represents.

Update 2: Another arrangement of the 双好 components is to have the woman above the children. The ideographic description sequence is: ⿱女⿰子子. See twitter.com/andreschappo/status/1046367105141153793 for a calligraphic version of this arrangement.

Saturday, 10 March 2018

Computer Science Internationalization - Validating People Names

When coding, it is essential to consider and account for edge cases. Whilst coding Internationalised Programming Challenge 17 jsfiddle.net/coas/4djhso1y I happened upon an unexpected and fascinating edge case.

Before I reveal the edge case I need to give you some background information, starting with some Chinese characters.

娥,鄂,鹅,仒,厄,戹*,屵*,阨*,阿*,呝,俄,砨*,偔,堨*,圔*,誒*,噁*,儑*,貖*,礘*,櫮,鰪*,岋,阸*,妸,咢,匎*,卾,隘*,廅*,僫,蕚,噩,鍔,額,鰐,讹,吪,妿,咹,胺,啞*,蛯*,搤,磀,遻*,嶭*,騀,顎,鶚 ...and many more at chinese-tools.com/tools/sinograms.html?p=e

All these Chinese characters can be written in pinyin as E or e. Those characters marked with *, have multiple meanings, hence multiple pronunciations, hence multiple ways of writing in pinyin. Those characters not marked with * are only written in pinyin as E or e. Some of you may well be thinking, what of tone marks. Well, unless I explicitly request it, I have never seen a Chinese person write pinyin with tone marks.

Some of these Chinese characters are family names and some would be suitable for given names.

So, now to the edge case. A Chinese name when written in pinyin could be E E or Ee E. I asked on Weibo 微博 whether anyone knew of any Chinese name which when written in pinyin is E E or Ee E. One person responded with the Chinese name 鄂娥 which written in pinyin is E E.

I reason that a person whose only language is English would think E E are initials and not the full name. Actually, before I considered name edge cases I would probably also have thought E E are initials. I have been aware for a long time that some Chinese characters can be written in pinyin as single letters such as e or a but I had not made the connection with people names.

This example illustrates that programmers need to thoroughly research naming conventions in different countries/cultures/languages before writing validation code.

I would like to encompass several international naming conventions in my Challenge 17. So far, I have coded validation rules for Chinese 中文, Khmer ភាសាខ្មែរ, Korean 한국어, Vietnamese Tiếng Việt and a catchall. I welcome contributions of international naming rules which I will code and incorporate into Challenge 17. You can email me, or if you do not know my email me you can tweet me @andreschappo or contact me on Weibo 微博 @schappo

Techie stuff: The regex I use for Chinese name validation is:

XRegExp("^((?![\\p{InKangxi_Radicals}\\p{InCJK_Radicals_Supplement}\\p{InCJK_Symbols_and_Punctuation}])\\p{Han}){2,4}$","u")

My regex is using negative look-ahead, recognisable by the ?! construct. For each character, I am checking that it is a Han* character and is not a radical or symbol or punctuation character. This can be generalised to:

(?!Character_Set_B)Character_Set_A

which reads as: a character must not be in Character_Set_B and must be in Character_Set_A in order to be valid.

The same can be achieved using negative look-behind:

XRegExp("^(\\p{Han}(?<![\\p{InKangxi_Radicals}\\p{InCJK_Radicals_Supplement}\\p{InCJK_Symbols_and_Punctuation}])){2,4}$","u")

This can be generalised as:

Character_Set_A(?<!Character_Set_B)

which reads as: a character must be in Character_Set_A and not in Character_Set_B in order to be valid.

The Chinese for Regular Expression (regex) is 正则表达式.

* A Han character, in this context, is actually a CJK (Chinese or Japanese or Korean) character but that is far too long a story for this blog article.

Update 30th March 2018: In my XRegExp, above, I am using the u flag. This enables Unicode but only the BMP (Basic Multilingual Plane). There is an A flag which enables the whole of Unicode, BMP + Astral characters, but, for quite some time, I could not get this to work. I did eventually find the problem and a solution. The problem is that jsdelivr minification breaks XRegExp. My solution was simply not to use the jsdelivr minified version of XRegExp. I am, therefore, using cdn.jsdelivr.net/npm/xregexp@4.1.1/xregexp-all.js and not cdn.jsdelivr.net/npm/xregexp@4.1.1/xregexp-all.min.js.

In addition to using the A flag I have had to make some minor changes to my regex. My updated regex is:

XRegExp("^((?!\\p{InKangxi_Radicals}|\\p{InCJK_Radicals_Supplement}|\\p{InCJK_Symbols_and_Punctuation})\\p{Han}){2,4}$","A")

The positive outcome for Chinese and Korean Hanja names validation is that these names can now contain characters from the Unicode SIP (Supplementary Ideographic Plane) in addition to characters in all the other Unicode planes.

Saturday, 3 March 2018

Computer Science Internationalization - Bidirectional Text

For this blog article I have taken the last two text boxes from my internationalised programming challenge 9 at jsfiddle.net/coas/qa8190kn. Both text boxes contain the same text which is a mix of English and Hebrew. English is written and read left to right. Hebrew is written and read right to left. We are going to look more closely at bidirectional (bidi) text in browsersʼ text boxes. The two text boxes are:

text box 1: direction is Left ➜ Right


text box 2: direction is Left Right

The order of the text in these two boxes is in display order which is the order presented to users. The order in which it is actually stored is called memory order or logical order. It is the order in which I typed the text. The memory order is:

So, a is the first character and h is the last character of the text. Selection of text is determined by memory order. Selecting characters 3 thru 6 will give cdבא, characters 6 thru 9 gives הדגב, characters 11 thru 14 gives תזef and so forth.

Now onto the selection process. One way of selecting text is to use shift in conjunction with the arrow keys. You should make the following associations:

  • text box 1, Left ➜ Right
    • right arrow key becomes forward
    • left arrow key becomes back
  • text box 2, Left Right
    • left arrow key becomes forward
    • right arrow key becomes back

Having made the above associations, forget all about left and right and think only of forward and back where forward is moving forwards through the text (memory order) and back is moving backwards through the text (memory order).

Now to selection, starting with the cursor between b and c:

  • shift forward forward forward, will select cdא
  • shift back back, will select ab

These key sequences will select the same text in both textboxes 1 & 2. The difference is in how it is presented to the user. It is presented to the user in display order.

I suggest you practice selecting text in text boxes 1 & 2. Initially it will seem somewhat confusing. When you are selecting text in text boxes 1 & 2, use the memory order text box as a reference to more easily determine what should be selected.

Now try selecting text in the following two text boxes. They both contain the same text, which is a mix of Khmer and Arabic. Additionally, paste your selections into a word processor.

text box 3: direction is Left ➜ Right


text box 4: direction is Left Right

Onto a related topic which is cursor movement in browser text boxes. Forget text selection. Cursor movement with left and right arrow keys is in display order only. I think there should be a browser option to switch between cursor movement by display order and cursor movement by memory order. I would like the same option in word processors and text editors.

Tuesday, 20 February 2018

Computer Science Internationalization - Mandarin Chinese Tones

Standard Mandarin Chinese uses four tones marks for pronunciation: ¯ ´ ˇ `. Normally, one only encounters these tone marks with Chinese written in pinyin but there is absolutely no reason why these tone marks cannot be used with Chinese characters 汉字.

Letʼs use the sentence: Nottingham is the home of Robin Hood. In pinyin this would be written: nuò dīng hàn shì luó bīn hàn de gù xiāng. In Mandarin Chinese this would normally be written: 诺丁汉是罗宾汉的故乡.

Here are some more simple Chinese sentences with tone marks. I have made the text a little larger so you can see the tone marks more clearly.

For the four tone marks I am using Unicode Combining Diacritical Marks, specifically: U+0304 ¯ COMBINING MACRON, U+0301 ´ COMBINING ACUTE ACCENT, U+030C ˇ COMBINING CARON, U+0300 ` COMBINING GRAVE ACCENT. These diacritics combine with the immediately preceding character.

Some Chinese characters have different pronunciations (tones) with different meanings. 与 and 为 are two such characters. Now, suppose I am not sure which is the correct tone, which is highly likely as my knowledge of Chinese is only basic. A single base character can have more than one Unicode combining diacritical mark. So, when I am uncertain I can combine all the relevant diacritics and let people knowledgeable in Chinese decide which is the correct tone from the context. Letʼs take sentence 二 and apply multiple tone marks to 与 and 为.

Actually, I can imagine those knowledgeable in Chinese, using my multiple diacritics methodology illustrated in sentence 四, being able to write a sentence having multiple sensible meanings.

So, how to type the tone marks? Here is one method using OSX and the ABC - Extended keyboard. Firstly, type your Chinese text or copy paste some Chinese text. Now switch to the ABC - Extended keyboard. Place your cursor immediately after a Chinese character and then use one of the following key combinations.

  • first tone ¯, use the key combination: alt ⇧ A
  • second tone ´, use the key combination: alt ⇧ E
  • third tone ˇ, use the key combination: alt ⇧ V
  • fourth tone `, use the key combination: alt ⇧ grave

Repeat for each Chinese character in your text, excepting those that do not have a tone mark. 的 when used as a possessive particle does not have a tone mark. If your OSX system is not setup for the ABC - Extended keyboard, go to System Preferences ➜ Keyboard ➜ Input Sources, and click + to add the ABC - Extended keyboard.

Here is a classic tongue twister: māma qímǎ, mǎ màn, māma mà mǎ.

There is much variance in how well or how badly browsers, word processors and text editors display Chinese characters with combining diacritical marks. Over the years, many times, I have found that TextEdit succeeds where other word processors and text editors fail. I have used TextEdit to produce the above five Chinese sentences and included them as images.

With html documents we can use ruby annotation and CSS to combine Chinese characters and tone marks. Here is sentence 三 rewritten using ruby annotation and CSS.

  1. ˍˎˏ ̬ˍˏˎˍˎˏˍˎ

英国,诺丁汉市,秋月茶 ➜ augustmoontea.com

Environment: OSX High Sierra version 10.13.2

Friday, 16 February 2018

Computer Science Internationalization - Time Zones

Yesterday, I wrote some code especially for the Chinese New Year 狗年. Look at the date in the last text box at jsfiddle.net/coas/zvubxato. I wanted to test that my code worked correctly for browsers in different time zones. I am in the UK so my time zone is currently GMT+0000. It is actually really easy to change time zone in OSX.

Go to System Preferences ➜ Date & Time ➜ Time Zone. Change of time zone is live and immediate. No need to close the time zone window nor restart your Mac. In the below screen shot I have selected Australian Eastern Daylight Time. Using JavaScript Date() I get the output Fri Feb 16 2018 22:05:34 GMT+1100 (AEDT) . This easy changing of time zones made testing of my code very easy and I did test my code with many different time zones. Why did I test with so many time zones? Well, because it was fun exploring time zones round the world😀 Did you know, for instance, that North Korea 조선민주주의인민공화국 and South Korea 대한민국 have different time zones. North Korea is currently GMT+0830 and South Korea is currently GMT+0900.

There are differences between browsers in how well they pick up the OSX timezone. Safari and Google Chrome behave best as whenever I change timezone and run my JavaScript Date() code the new timezone date is displayed. With Firefox, Opera and Yandex one has to either restart or open up a new browser window in order to get the new timezone date.

Environment: OSX High Sierra version 10.13.2