André 小山 Schappo: 2018

Sunday, 10 June 2018

BBC International Websites

If you are in the UK, most of you will be familiar with the BBC website bbc.co.uk. The BBC does have several localised websites for non English languages and regional news. If you are browsing from the UK it is not at all obvious how to visit their localised websites as there are no links on their UK website.

Here is how I found their localised websites. Firstly I visited bbc.co.uk and, as one would expect, I landed on their UK homepage. But, there are no links to their localised sites. If there were I would be happy and would not be writing this article. Next I visited the wikipedia BBC article en.wikipedia.org/wiki/BBC and saw that there is a different address bbc.com. I used this address but as I am browsing from the UK it redirects to bbc.co.uk which is not at all helpful. So, I am back to where I started.

To the rescue comes the Opera browser with itʼs builtin VPN. I set the VPN to a non UK location and now when I use bbc.com I do not get redirected to bbc.co.uk. Scroll down to the bottom of the page and I now see links to their localised websites. If you are browsing from the UK without such a VPN service these links will not redirect to bbc.co.uk

For your convenience I list below all the links to the BBC localised websites.

Arabic عربي bbc.com/arabic
Azeri AZƏRBAYCAN bbc.com/azeri
Bangla বাংলা bbc.com/bengali
Burmese မြန်မာစာ bbc.com/burmese
Chinese 中文 bbc.com/zhongwen/simp
French bbc.com/afrique
Hausa bbc.com/hausa
Hindi हिन्दी bbc.com/hindi
Indonesian bbc.com/indonesia
Japanese 日本語 http://bbc.com/japanese
Kinyarwanda & Kirundi bbc.com/gahuza
Kyrgyz Кыргыз bbc.com/kyrgyz
Marathi मराठी bbc.com/marathi
Nepali नेपाली bbc.com/nepali
Pashto پښتو bbc.com/pashto
Persian فارسی bbc.com/persian
Portuguese bbc.com/portuguese
Russian ру́сский bbc.com/russian
Sinhala සිංහල https://bbc.com/sinhala
Somali bbc.com/somali
Spanish bbc.com/mundo
Swahili bbc.com/swahili
Tamil தமிழ் bbc.com/tamil
Turkish TÜRKÇE bbc.com/turkce
Ukrainian УКРАЇНСЬКA bbc.com/ukrainian
Urdu اردو bbc.com/urdu
Uzbek O'ZBEK bbc.com/uzbek
Vietnamese TIẾNG VIỆT bbc.com/vietnamese

Sunday, 22 April 2018

Computer Science Internationalization - Ideographic Description Characters

I recently created a new Chinese character. This is the very first time I have done so. I created the character to write on a farewell card.

Firstly some background. About two years ago I gave a printout to a colleague, Katherine Hollingsworth, of the Chinese character 好 which means good. I explained how this character has two components, the left part meaning woman and the right part meaning child. Woman 女 + child 子 is something good, hence the meaning good. Katherine, at this time had one child.

Fast forward two years and Katherine is leaving us. As is traditional, there was a farewell card for us to write our best wishes. That is when I had the idea of creating a new Chinese character. Katherine now has two children. The character I created was ⿰好子 which is a woman with two children. I handwrote this character onto the card. Chinese characters are written into a square which is what I did when handwriting the combination of 好 and 子.

Now to the Computer Science part. In Unicode there are twelve Ideographic Description Characters: ⿰ ⿱ ⿲ ⿳ ⿴ ⿵ ⿶ ⿷ ⿸ ⿹ ⿺ ⿻, U+2FF0➔2FFB. These can be used to construct new characters from combinations of existing characters and/or components. They represent the topological relationship between the components. ⿰ is used to represent a character with two components, a left part and a right part. The ideographic description sequence for my new character is thus ⿰好子. I have given this character the name 双好 which means double good😀

I told my Chinese project students about my new Chinese character. One of the students, 王国旭 Wang Guoxu, suggested an additional way of constructing the character using a left part, a middle part and a right part. The sequence for his suggestion is ⿲子女子. I really like this suggestion as we now have a woman surrounded by her two children.

Update: I have now devised a second new Chinese character. It is related to the above character and consists of four, side by side components. The ideographic description sequence is: ⿰⿰子男⿰女子. I will leave it to you, the reader, to determine what it represents.

Update 2: Another arrangement of the 双好 components is to have the woman above the children. The ideographic description sequence is: ⿱女⿰子子. See 고.한국/hao2 or twitter.com/andreschappo/status/1046367105141153793 for calligraphic versions of this arrangement.

Update 3: Katherine now has 3 children and so it is time for a triple good 三好 new Chinese character with 3 children components. Possible arrangements of the components include ⿱⿰女子⿰子子 and ⿱女⿲子子子. Please see 고.한국/hao3 or twitter.com/andreschappo/status/1093833016353406976 for calligraphic versions of these arrangements

There is a useful online utility which can be used to compose and visualise new characters. Here is a new character representing a woman with three children zi.tools/?secondary=ids&seq=⿱女⿲子子子

Saturday, 10 March 2018

Computer Science Internationalization - Validating People Names

When coding, it is essential to consider and account for edge cases. Whilst coding Internationalised Programming Challenge 17 jsfiddle.net/coas/4djhso1y I happened upon an unexpected and fascinating edge case.

Before I reveal the edge case I need to give you some background information, starting with some Chinese characters.

娥，鄂，鹅，仒，厄，戹*，屵*，阨*，阿*，呝，俄，砨*，偔，堨*，圔*，誒*，噁*，儑*，貖*，礘*，櫮，鰪*，岋，阸*，妸，咢，匎*，卾，隘*，廅*，僫，蕚，噩，鍔，額，鰐，讹，吪，妿，咹，胺，啞*，蛯*，搤，磀，遻*，嶭*，騀，顎，鶚 ...and many more at chinese-tools.com/tools/sinograms.html?p=e

All these Chinese characters can be written in pinyin as E or e. Those characters marked with *, have multiple meanings, hence multiple pronunciations, hence multiple ways of writing in pinyin. Those characters not marked with * are only written in pinyin as E or e. Some of you may well be thinking, what of tone marks. Well, unless I explicitly request it, I have never seen a Chinese person write pinyin with tone marks.

Some of these Chinese characters are family names and some would be suitable for given names.

So, now to the edge case. A Chinese name when written in pinyin could be E E or Ee E. I asked on Weibo 微博 whether anyone knew of any Chinese name which when written in pinyin is E E or Ee E. One person responded with the Chinese name 鄂娥 which written in pinyin is E E.

I reason that a person whose only language is English would think E E are initials and not the full name. Actually, before I considered name edge cases I would probably also have thought E E are initials. I have been aware for a long time that some Chinese characters can be written in pinyin as single letters such as e or a but I had not made the connection with people names.

This example illustrates that programmers need to thoroughly research naming conventions in different countries/cultures/languages before writing validation code.

I would like to encompass several international naming conventions in my Challenge 17. So far, I have coded validation rules for Chinese 中文, Khmer ភាសាខ្មែរ, Korean 한국어, Vietnamese Tiếng Việt and a catchall. I welcome contributions of international naming rules which I will code and incorporate into Challenge 17. You can email me, or if you do not know my email me you can tweet me @andreschappo or contact me on Weibo 微博 @schappo

Techie stuff: The regex I use for Chinese name validation is:

XRegExp("^((?![\\p{InKangxi_Radicals}\\p{InCJK_Radicals_Supplement}\\p{InCJK_Symbols_and_Punctuation}])\\p{Han}){2,4}$","u")

My regex is using negative look-ahead, recognisable by the ?! construct. For each character, I am checking that it is a Han* character and is not a radical or symbol or punctuation character. This can be generalised to:

(?!Character_Set_B)Character_Set_A

which reads as: a character must not be in Character_Set_B and must be in Character_Set_A in order to be valid.

The same can be achieved using negative look-behind:

XRegExp("^(\\p{Han}(?<![\\p{InKangxi_Radicals}\\p{InCJK_Radicals_Supplement}\\p{InCJK_Symbols_and_Punctuation}])){2,4}$","u")

This can be generalised as:

Character_Set_A(?<!Character_Set_B)

which reads as: a character must be in Character_Set_A and not in Character_Set_B in order to be valid.

The Chinese for Regular Expression (regex) is 正则表达式.

* A Han character, in this context, is actually a CJK (Chinese or Japanese or Korean) character but that is far too long a story for this blog article.

Update 30th March 2018: In my XRegExp, above, I am using the u flag. This enables Unicode but only the BMP (Basic Multilingual Plane). There is an A flag which enables the whole of Unicode, BMP + Astral characters, but, for quite some time, I could not get this to work. I did eventually find the problem and a solution. The problem is that jsdelivr minification breaks XRegExp. My solution was simply not to use the jsdelivr minified version of XRegExp. I am, therefore, using cdn.jsdelivr.net/npm/xregexp@4.1.1/xregexp-all.js and not cdn.jsdelivr.net/npm/xregexp@4.1.1/xregexp-all.min.js.

In addition to using the A flag I have had to make some minor changes to my regex. My updated regex is:

XRegExp("^((?!\\p{InKangxi_Radicals}|\\p{InCJK_Radicals_Supplement}|\\p{InCJK_Symbols_and_Punctuation})\\p{Han}){2,4}$","A")

The positive outcome for Chinese and Korean Hanja names validation is that these names can now contain characters from the Unicode SIP (Supplementary Ideographic Plane) in addition to characters in all the other Unicode planes.

Saturday, 3 March 2018

Computer Science Internationalization - Bidirectional Text

For this blog article I have taken the last two text boxes from my internationalised programming challenge 9 at jsfiddle.net/coas/qa8190kn. Both text boxes contain the same text which is a mix of English and Hebrew. English is written and read left to right. Hebrew is written and read right to left. We are going to look more closely at bidirectional (bidi) text in browsersʼ text boxes. The two text boxes are:

text box 1: direction is Left ➜ Right

text box 2: direction is Left ➜ Right

The order of the text in these two boxes is in display order which is the order presented to users. The order in which it is actually stored is called memory order or logical order. It is the order in which I typed the text. The memory order is:

So, a is the first character and h is the last character of the text. Selection of text is determined by memory order. Selecting characters 3 thru 6 will give cdבא, characters 6 thru 9 gives הדגב, characters 11 thru 14 gives תזef and so forth.

Now onto the selection process. One way of selecting text is to use shift in conjunction with the arrow keys. You should make the following associations:

text box 1, Left ➜ Right
- right arrow key becomes forward
- left arrow key becomes back
text box 2, Left ➜ Right
- left arrow key becomes forward
- right arrow key becomes back

Having made the above associations, forget all about left and right and think only of forward and back where forward is moving forwards through the text (memory order) and back is moving backwards through the text (memory order).

Now to selection, starting with the cursor between b and c:

shift forward forward forward, will select cdא
shift back back, will select ab

These key sequences will select the same text in both textboxes 1 & 2. The difference is in how it is presented to the user. It is presented to the user in display order.

I suggest you practice selecting text in text boxes 1 & 2. Initially it will seem somewhat confusing. When you are selecting text in text boxes 1 & 2, use the memory order text box as a reference to more easily determine what should be selected.

Now try selecting text in the following two text boxes. They both contain the same text, which is a mix of Khmer and Arabic. Additionally, paste your selections into a word processor.

text box 3: direction is Left ➜ Right

text box 4: direction is Left ➜ Right

Onto a related topic which is cursor movement in browser text boxes. Forget text selection. Cursor movement with left and right arrow keys is in display order only. I think there should be a browser option to switch between cursor movement by display order and cursor movement by memory order. I would like the same option in word processors and text editors.

Tuesday, 20 February 2018

Computer Science Internationalization - Mandarin Chinese Tones

Standard Mandarin Chinese uses four tones marks for pronunciation: ¯ ´ ˇ `. Normally, one only encounters these tone marks with Chinese written in pinyin but there is absolutely no reason why these tone marks cannot be used with Chinese characters 汉字.

Letʼs use the sentence: Nottingham is the home of Robin Hood. In pinyin this would be written: nuò dīng hàn shì luó bīn hàn de gù xiāng. In Mandarin Chinese this would normally be written: 诺丁汉是罗宾汉的故乡.

Here are some more simple Chinese sentences with tone marks. I have made the text a little larger so you can see the tone marks more clearly.

For the four tone marks I am using Unicode Combining Diacritical Marks, specifically: U+0304 ¯ COMBINING MACRON, U+0301 ´ COMBINING ACUTE ACCENT, U+030C ˇ COMBINING CARON, U+0300 ` COMBINING GRAVE ACCENT. These diacritics combine with the immediately preceding character.

Some Chinese characters have different pronunciations (tones) with different meanings. 与 and 为 are two such characters. Now, suppose I am not sure which is the correct tone, which is highly likely as my knowledge of Chinese is only basic. A single base character can have more than one Unicode combining diacritical mark. So, when I am uncertain I can combine all the relevant diacritics and let people knowledgeable in Chinese decide which is the correct tone from the context. Letʼs take sentence 二 and apply multiple tone marks to 与 and 为.

Actually, I can imagine those knowledgeable in Chinese, using my multiple diacritics methodology illustrated in sentence 四, being able to write a sentence having multiple sensible meanings.

So, how to type the tone marks? Here is one method using OSX and the ABC - Extended keyboard. Firstly, type your Chinese text or copy paste some Chinese text. Now switch to the ABC - Extended keyboard. Place your cursor immediately after a Chinese character and then use one of the following key combinations.

first tone ¯, use the key combination: alt ⇧ A
second tone ´, use the key combination: alt ⇧ E
third tone ˇ, use the key combination: alt ⇧ V
fourth tone `, use the key combination: alt ⇧ grave

Repeat for each Chinese character in your text, excepting those that do not have a tone mark. 的 when used as a possessive particle does not have a tone mark. If your OSX system is not setup for the ABC - Extended keyboard, go to System Preferences ➜ Keyboard ➜ Input Sources, and click + to add the ABC - Extended keyboard.

Here is a classic tongue twister: māma qímǎ, mǎ màn, māma mà mǎ.

There is much variance in how well or how badly browsers, word processors and text editors display Chinese characters with combining diacritical marks. Over the years, many times, I have found that TextEdit succeeds where other word processors and text editors fail. I have used TextEdit to produce the above five Chinese sentences and included them as images.

With html documents we can use ruby annotation and CSS to combine Chinese characters and tone marks. Here is sentence 三 rewritten using ruby annotation and CSS.

秋ˍ 月ˎ 茶ˏ 以 ̬ 英ˍ 国ˏ 诺ˎ 丁ˍ 汉ˎ 为ˏ 基ˍ 地ˎ 。

英国，诺丁汉市，秋月茶 ➜ augustmoontea.com

Environment: OSX High Sierra version 10.13.2

Friday, 16 February 2018

Computer Science Internationalization - Time Zones

Yesterday, I wrote some code especially for the Chinese New Year 狗年. Look at the date in the last text box at jsfiddle.net/coas/zvubxato. I wanted to test that my code worked correctly for browsers in different time zones. I am in the UK so my time zone is currently GMT+0000. It is actually really easy to change time zone in OSX.

Go to System Preferences ➜ Date & Time ➜ Time Zone. Change of time zone is live and immediate. No need to close the time zone window nor restart your Mac. In the below screen shot I have selected Australian Eastern Daylight Time. Using JavaScript Date() I get the output Fri Feb 16 2018 22:05:34 GMT+1100 (AEDT) . This easy changing of time zones made testing of my code very easy and I did test my code with many different time zones. Why did I test with so many time zones? Well, because it was fun exploring time zones round the world😀 Did you know, for instance, that North Korea 조선민주주의인민공화국 and South Korea 대한민국 have different time zones. North Korea is currently GMT+0830 and South Korea is currently GMT+0900.

There are differences between browsers in how well they pick up the OSX timezone. Safari and Google Chrome behave best as whenever I change timezone and run my JavaScript Date() code the new timezone date is displayed. With Firefox, Opera and Yandex one has to either restart or open up a new browser window in order to get the new timezone date.

Environment: OSX High Sierra version 10.13.2

Sunday, 21 January 2018

Computer Science Internationalization - Exercises

My specialism is Computer Science internationalisation (i18n) and I encourage students to think globally when programming. Recently I gave an i18n workshop to school students. If you look at schappo.blogspot.co.uk/2018/01/computer-science-internationalization_17.html you will see that all the activities are international. Take special note of the feedback at the end of my blog article.

A couple of weeks ago I gave a programming challenge to University students. If you look at https://jsfiddle.net/coas/wda45gLp you will see that the focus is on diacritics with people and place names.

Another example is a use of CSS applied to the Chinese culture schappo.blogspot.co.uk/2018/01/computer-science-internationalization_16.html. One of the activities in the aforementioned workshop was to change 👀 from right looking to left looking 👀 . Unfortunately there was not sufficient time to task the students with writing CSS to produce the Chinese upside down 福.

One of the classic Computer Science exercises when teachings students a new computer language is to task them to write the code to print "Hello World". It has become almost traditional and expected. Why not task the students to write code to print "你好世界" which is Chinese for Hello World.

Another classic Computer Science exercise is to code a bubble sort. In addition to sorting ASCII text, why not also task the students to bubble sort Korean and Thai text. This will give students an appreciation of sorting order in non English languages. 〖 Chinese has at least 2 different sorting orders which will give different results. I will write about this another time. 〗

This will be one of my final year projects which I will offer to students for the 2018/2019 academic year, and which I will supervise. Here is a draft specification:

< start student project specification >

Title: Internationalised Computer Science Exercises

One way in which Computer Science departments could encourage global thinking and develop global skills is to internationalise the exercises set to students. Your task will be to produce a set of exercises along with solutions. You can choose the computer languages but you should encompass at least 2 languages. Python and JavaScript would seem to be good choices. Try to find both a school and a university Computer Science department that would be willing to trial your internationalised Computer Science exercises. You should classify your exercises as beginner, intermediate or advanced so educators can use the most appropriate exercises. You should encompass several world cultures in your exercises. I can help you with the Chinese culture. Here are some examples of internationalised Computer Science exercises which I have made public as programming challenges. All my internationalised programming challenges are available at jsfiddle.net/user/coas/fiddles

Maybe you are not confident in your programming abilities and are thinking this project is beyond you. No worries! Just write internationalised exercises at the beginner level.

Maybe you are a gifted programming and want to be challenged. If that is the case, write internationalised exercises at the advanced level.

I envisage you will make your exercises publicly available online but without the solutions. Some educators and students will request the solutions from you. Do not give them your solutions. Instead, guide them to a solution or solutions. They will learn much more this way. It will also give you an insight into how people solve problems and how much they know about internationalised programming. It is my experience that the majority in Academia know nothing about internationalised programming. You can use a method I devised for guiding people to solutions which I describe in my Internationalised Programming Challenge 1, above.

I will be attempting to solve some of your exercises using the methodology described above. I may well ask you for clue or clues. I will give you feedback on how I tackled the exercise and I will give you my solution(s). If you are a gifted programmer try writing an exercise which you think I cannot solve or to put it another way, challenge me! (hint: I am rubbish at recursion😰 )

You should endeavour to obtain feedback from both educators and students.

I will be accepting more than one student for this project. Each student will write internationalised Computer Science exercises for a different set of world cultures. I will allow some overlap with other students but this will be strictly limited. I will be offering this project to students for a number of years. My vision is that we will build up an extensive library of exercises that cover many world cultures. Educators can then mix and match exercises from different cultures. One scenario that comes to mind is of a school that teaches both Geography and Computer Science. During term 1 the Geography teacher covers, say, Poland. The Computer Science teacher could, as a consequence, choose Polish culture Computer Science exercises.

< end student project specification >

I have described setting Computer Science exercises in the context of world cultures but there is no reason why the same principles could not be applied to other disciplines. So, letʼs also have a set of Computer Science exercises that encompass the Arts. It may well also have the added benefit of encouraging collaboration between Artists and Computer Scientists, Art organisations and Computer Science departments.

Here is a first thought for a Computer Science exercise that embeds art:

You are given a table of data. Each record in the table consists four fields: colour name, red value, green value, blue value. Write a program to display the colour names of all those colours where, for example, the green value is greater than 200. You should use variables for the red, green, blue thresholds. You will then be able to run your program with different threshold values and see which colour names are displayed. Red, green and blue values are in the range 0 to 255, inclusive.

I may well set another student final year project to produce a set of Arts Computer Science exercises.

Thursday, 18 January 2018

Computer Science Internationalization - Kazakhstan

I discovered, just yesterday, that Қазақстан Kazakhstan is to change itʼs writing system from the Cyrillic script to the Latin script. One proposal for the new Latin script based writing system makes extensive use of apostrophes nytimes.com/2018/01/15/world/asia/kazakhstan-alphabet-nursultan-nazarbayev.html

One argument against this apostrophe based system, from the above article is that it will be impossible to create twitter hashtags. Well yes and no. If you use the standard apostrophe which most standard keyboards will produce, then yes it will break twitter hashtags. There is though more than one apostrophe in Unicode. If you look at twitter.com/andreschappo/status/953903964722024448 and twitter.com/andreschappo/status/953944089896083456 you will see that there is one apostrophe which does not break twitter hashtags.

With my OSX ABC - Extended keyboard I can, as with any standard keyboard, type the usual apostrophe ' U+0027 APOSTROPHE but also I can type the non breaking apostrophe ʼ U+02BC MODIFIER LETTER APOSTROPHE using the key combination alt i. There is one minor complication. If the next character you will be typing, after typing the non breaking apostrophe, is either u or o then you will end up with ư U1B0 LATIN SMALL LETTER U WITH HORN or ơ U+01A1 LATIN SMALL LETTER O WITH HORN, respectively. A simple way round this problem is to type the key sequence alt i followed by space. Actually, I will make it my working practice to always use the key sequence because, inevitably, I will forget about the consequences of typing u or o after the non breaking apostrophe.

Your keyboard may well have some key combination or key sequence for typing the non breaking apostrophe. If not, then it is not actually that hard to modify a keyboard to produce the non breaking apostrophe. The last time I modified a keyboard I used Ukelele scripts.sil.org/ukelele

I have been searching for new Kazakh Latin script words and so far have found three. I have replaced breaking apostrophes with the non breaking apostrophe and put them in a tweet as hashtags ➜ twitter.com/andreschappo/status/954034152621387776

If you are wondering why I am using a lot of upper case words, it is a Unicode convention.

Wednesday, 17 January 2018

Computer Science Internationalization - Internationalization Workshop

This morning I gave a Computer Science Internationalization Workshop to school students. Twelve students attended my workshop which lasted just over two hours. As is my usual practice for such workshops, I made a list of topics to cover and more topics than could be covered in the time available. I then adapt as necessary. I placed emphasis on incorporating world cultures into one's thinking and programming practices. This was a participatory workshop so the students were fully involved.

I started the workshop by saying: Bon Matin, Guten Morgen, 안녕하세요, おはよございます, 早上好. So, right from the start, it is evident that this workshop is international in nature. I explained that one does not need to know multiple languages in order to build software for the world but one does need to understand characteristics of human language scripts. Chinese is much more compact than English which gives a big advantage when using microblog services which have a character limit, twitter now has a 280 character limit. Chinese and Japanese do not use spaces to separate characters and so detecting word boundaries is much harder than English which uses spaces and punctuations characters to separate words.

Topics covered in this workshop included:

McDonalds I explained that common practice with global companies is to have a global gateway which has a set of links to their localised websites. Starting at UK Mcdonalds I asked the students to, without googling, find the McDonalds global gateway. It used to be possible but is no longer. I am not sure why McDonalds removed the link. Next I asked the students to find the global gateway but this time they can google. They soon found it. I explained: the way I found it was by using Opera with VPN set to America, visiting McDonalds USA and on their page was a link to the global gateway corporate.mcdonalds.com/mcd/country/map.html. I then asked the students to explore the different localisations and look for differences in foods and styling to the UK website. Next was the Asia Pacific section of the global gateway. I told the students there is a problem and then asked them what the problem is. The problem is that for China the icon displayed is Youtube but the link is actually for the Chinese video service Youku i.youku.com/u/UMzg1MDY4MTg0. I explained: China blocks access to many sites and services, youtube and twitter are blocked in China.
Writing names correctly: Next was how to write my name correctly ie André and not Andre. First I asked the students to write André. They discovered that by holding down the e key down a popup appears showing e combined with various diacritics. Next, using the keyboard viewer, I showed the general procedure: hold down the alt key to see all the diacritics supported by that keyboard mapping, release the alt key to see all the letter/diacritic combinations supported by the keyboard mapping. We used the ABC - Extended keyboard mapping.
Writing Chinese: Next was the basics of writing Chinese using the pinyin Simplified Chinese Input Method. I tasked the students with writing: xiao shan (小山), dian you (电邮) and zai xian (在线). Some of you will have already worked out which way this is going😀 Put these Chinese characters together and we have my Chinese email address 小山@电邮.在线. I explained: "We are on the threshold of a huge growth in registration and usage of internationalised email addresses. Rajasthan recently launched a Hindi email service whereby residents can have a free hindi email address. There is a group dedicated to promoting usage of internationalised domain names and internationalised email addresses by the name of UASG (Universal Acceptance Steering Group)." I then tasked the students to read and browse uasg.tech.
Character counting: Next was character counting. I had 2 text strings: ① one (two)! [three] four five. ② pear（plum）grape！apple. I asked: from a computer's point of view, how many characters are there in these text strings. Reader: I will let you count for yourselves😀 I explained: "the brackets are full width forms and so what might be a space character followed by an opening bracket is in actual fact is just a single character. Such characters are used in Chinese and Japanese as every character, including punctuation, has the same width."
Numbered webpage lists: The next task was a numbered list. I had a pre-prepared html template which the students used. We started with a standard decimal ordered list. The I showed the students the inline CSS — <ol style="list-style-type:thai;">. I then tasked the students: "Without googling, guess at language names and try them with the list-style-type." Next task was: "You can now google. Find the valid list-style-type language names." Several quickly found developer.mozilla.org/en-US/docs/Web/CSS/list-style-type which is a page I frequently use. I also showed codepoints.net/search?gc=Nd to show how many different numbering systems there are which will all, hopefully, be supported at some time in the future.
Rotating Unicode Characters: This programming challenge was, using CSS, change the Unicode character 👀 from right looking to left looking. I allowed googling right from the start. I did not use the word "rotate" or rotating when giving this challenge. Yesterday I wrote a blog article for this challenge. I showed this blog article briefly towards the end of the workshop 👀 ➜ schappo.blogspot.co.uk/2018/01/computer-science-internationalization_16.html. In my blog article I demonstrate a cultural application.
Language Adaptive Web App: In my introductory web programming module, I write demonstrator code for lectures which I then make the source available to the students to use or not use, as they wish. The web app I used in this workshop was one that is language adaptive, English and Chinese. I firstly showed the school students how to change the "preferred language for webpages" which is set by users in preferences. I used Firefox for this demo. I firstly showed my Web App with my Firefox set to preferred language = English. So the text in my buttons was in English. I then changed the preferred language to Chinese. Refresh the page and my button text is now in Chinese. I emphasised that this does not happen automatically and it does require programming to make it happen and again with most everything else I did in this workshop it is simple to do. I showed my source code. The crucial statement is: if(/zh|zh-CN|zh-TW|zh-SG|zh-HK|cmn/i.test(navigator.languages[0])){.... One thing I forgot to do was show my CJK version of my web app in which I have changed every identifier to Chinese, Japanese or Korean😀
Unicode: I explained: "There are 136000+ characters in Unicode. In ASCII there are just 128 characters. Each Unicode character has a unique codepoint and when represented is usually prefixed with U+. The Unicode consortium guarantee that once a character is officially included into Unicode, it's codepoint will never change. The Unicode character set is continually being developed and a new version is released every summer. Anyone can submit a proposal for character(s) to be included into the Unicode character set. There is a Unicode Consortium group which deals specifically with Emoji proposals as Emoji are hugely popular." I then tasked the students to browse the Unicode character set using the "Emoji & Symbols" viewer.
Regular Expressions (regex): We visited speakerdeck.com/andre_schappo/unicode-regular-expressions. I quickly went through some of the ASCII based slides. I stated: "If you search the net you will finds thousands upon thousands of ASCII based regex examples and explanations but what you will not see is...". I then proceeded to slide 11. I explained the cultural references and humour in slides 11,12 and 13. This is a good example of how world cultures can be incorporated into one's programming practices.
A programming challenge: I finished with a programming challenge ➜ jsfiddle.net/coas/wda45gLp. I briefly explained the code but did not have time to show my solutions to the challenge.

During the workshop I said: "One problem is that there is no culture of programming internationalisation in school, college and university Computer Science departments. Therefore, staff and students do not even think to ask themselves questions such as: "I wonder if it is possible to number html list with numbers from other languages?". If they did ask themselves such a question then a quick google and they would soon discover that it possible and it is so easy to do.

I did not have much opportunity to chat with the students but one student told me: "In his school, they are only taught ASCII programming and ASCII text processing. He learned about Unicode by his own efforts and outside of the classroom." This is typical of schools, colleges and Universities. Unicode is not being taught but it most definitely should be taught.

I concluded with: "I hope you will teach your classmates and teachers about Computer Science internationalisation. If your school would like me to visit and give a workshop or presentation, I am more than happy to do so. All I ask is that my visit expenses are paid by your school.

Readers of this article: I extend the same offer to you. If you would like me to give a workshop or presentation on Computer Science internationalisation at your company, organisation, school, university, I am happy to do so. All I ask is that my expenses are paid. Computer Science internationalisation is my specialism and my passion. I can talk for hours, days, weeks, months, years about Computer Science internationalisation😀 I am not hard to find on the internet but one way of contacting me is to tweet me at twitter.com/andreschappo

Workshop participants: I hope you enjoyed this workshop. I enjoyed giving it. If there is anything I have forgotten that you would like to see included in this article, please email or tweet me. During the workshop there was not sufficient time to show you a youtube video in which Chris Broad explains one of the ways in which Japanese people celebrate Christmas. See ➜ youtube.com/watch?v=SFw-TZzqX8M. Can you spot the word play in the name of his youtube channel?

Eddy Dunton, a workshop participant, emailed me this feedback: "Overall I thoroughly enjoyed it, it touched on a topic which I had never even thought of, never mind thought to address, furthermore I thought the structure was good and it never felt like it was dragging. Personally I think it could benefit from more practical activities, although the ones we did were good quality, I didn’t feel there was enough of them. It would also be interesting to comment on the cultural differences on the different McDonald’s pages (for example I noticed the US page focused on the price of the food whereas the UK page pushed the healthier options)."

Tuesday, 16 January 2018

Computer Science Internationalization - Rotating Unicode Characters

I sometimes encounter Unicode characters that have direction and sometimes that direction is not the direction I want. Letʼs take the character 👀 . The eyes are looking to the right. Recently I wanted the eyes to look to the left. Using CSS we can change the direction so that we have left looking eyes 👀 . Here is the inline CSS I am using in this blog article to produce left looking eyes 👀 .

<span style="display:inline-block;transform:rotate(180deg);vertical-align:15%;">
👀
</span>

福 is considered an auspicious character in the Chinese culture. If you search the net google.co.uk/search?q=福&tbm=isch you will find many images of 福 and many of these images are of an upside down 福。The CSS I use to display 福 is exactly the same as above.

Letʼs display a 福 that has a bit more impact.

福

Here is the CSS I use to display the big red 福.

<span style="display:block;transform:rotate(180deg);text-align:center;
text-shadow: -4px -4px 4px Grey;text-weight:800;font-size:100pt;
font-family:'Hannotate SC',cursive;color:Red;">
福
</span>

Why 福 ? It is all to do with homophones, words that sound the same(ish). Here is an explanation from en.wikipedia.org/wiki/Fu_(character)

When displayed as a Chinese ideograph, Fú (福) is often displayed upside-down on diagonal red squares. The reasoning is based on a wordplay: in nearly all varieties of Chinese: the words for "upside-down" (倒, Pinyin: dào) and "to arrive" (到, Pinyin: dào) are homophonous. Therefore, the phrase an "upside-down Fú （福）" sounds nearly identical to the phrase "Good luck arrives". Pasting the character upside-down on a door or doorpost thus translates into a wish for prosperity to descend upon a dwelling.

Sunday, 14 January 2018

Computer Science Internationalization - Email Address Internationalization (EAI)

An internationalised email address is of the form Unicode@IDN where Unicode is a name consisting of Unicode characters (excluding the ASCII subset of Unicode) and an IDN (Internationalised Domain Name) consists of Unicode characters (again excluding the ASCII subset of Unicode). There are hybrid email addresses, such as, ASCII@IDN. My focus for this blog article is fully internationalised email addresses eg a fully Hindi email address.

Currently, internationalised email addresses are somewhat of a rarity and few people know they even exist or that they can register such an email. I believe, at time of writing, that we are on the threshold of a significant growth in registration and usage of internationalised email addresses.

I consider it is time to raise awareness of internationalised email addresses and that is the reason for this blog article. I will publish internationalised email addresses. I will only publish those emails which are in the public domain and which are fully internationalised. I will publish non public domain, personal email addresses but only if I am given explicit permission by the holder to publish in this blog article. This will be an ongoing article and I will endeavour to update on a regular basis.

I am aiming for some ten emails per language. When I reach my quota I will endeavour to change some of the emails so that if you revisit you will see a different set of emails.

If you do come across any fully internationalised email addresses in the public domain, please do let me know. You can tweet me at twitter.com/andreschappo.

Chinese 中文
1. 小山@电邮.在线 — my Chinese email address. 小山 is my adopted Chinese name. There is a story behind this name ➜ schappo.blogspot.co.uk/2012/07/my-adopted-chinese-name.html
2. 阿賈伊@电邮.在线 — Ajay Data अजय डाटा, CEO of Data Xgen Technologies. 阿賈伊 is a transliteration of अजय and written in hanyu pinyin is a jia yi.
Hindi हिन्दी
1. वसुंधरा@राजस्थान.भारत — Vasundhara Raje वसुंधरा राजे - At time of writing, the 13th Chief Minister of Rajasthan, India. Here is a video, from December 2017, of the official launch of the Hindi email service and of this Hindi email address ➜ youtube.com/watch?v=XmUW4n9UmFk
2. अजय@डाटा.भारत — Ajay Data अजय डाटा, CEO of Data Xgen Technologies.
Russian Русский
- аяйдата@датамэйл.рус — Ajay Data अजय डाटा, CEO of Data Xgen Technologies. аяйдата is a transliteration of अजय डाटा.
- поддержка@почта.рус — оддержка@почта.рус translates as support@mail.rus and is the email address of a Russian email service provider 👀 ➜ почта.рус. Launch of the Russian language email service, December 2016 👀 ➜ youtube.com/watch?v=urw0UYjkSx8

Friday, 12 January 2018

Copy and Paste — A Tip

My standard practice for mouse based copy & paste is:

move the mouse to the text I want to copy
mouse down and drag to select the target text
release the mouse
press the key combination ⌘C to copy the text
paste to destination

I have been using this method for many many years and so have billions of other people. My very first use of a mouse was on an early Apple Mac computer so I have been using a computer mouse since around 1985.

There is though a problem with this method which I usually encounter when selecting text on a webpage. Sometimes this copy & paste method can result in the unintentional activation of a link or too much text or too little text being selected. It can be quite irritating, especially when one is in a hurry.

Several months ago I accidentally discovered a solution to this problem which works every time, or at least it has for me. The solution requires a small change to the copy & paste method. The new method is:

move the mouse to the text I want to copy
mouse down and drag to select the target text
do NOT release the mouse ie keep your finger pressed on the mouse
press the key combination ⌘C to copy the text
paste to destination

On occasions, people have said that I am a slow learner. This certainly proves them right as it has taken me some 32 years to discover this copy & paste method 😀

Environment: OSX High Sierra 10.13.2

Wednesday, 10 January 2018

Computer Science Internationalization - A Programming Challenge

I recently issued a simple (or is it?) programming challenge to 140+ university students. I will be giving this same challenge to school students at a participatory internationalisation workshop. The challenge is on jsfiddle. You too can try this challenge. You do not need to join jsfiddle and when you connect to jsfiddle you will be given your own copy of my code. You can thus modify the code as you wish and no changes will be made to my master version. You can also run your code on jsfiddle.

Here is the challenge. Either click on this link ➜ jsfiddle.net/coas/wda45gLp or click "Edit in JSFiddle" in the top right of the following window. Once connected to jsfiddle you can execute your code by clicking Run in the top left corner.

There are several solutions to this challenge. I will put at least two solutions in this blog article but not until after my internationalisation workshop. If you would like to share your solution(s) with me, tweet me at twitter.com/andreschappo.

Sunday, 7 January 2018

Computer Science Internationalization - Presentation of Links

All of us encounter links in documents, webpages and email. They are readily identified as they are usually coloured blue if not visited and red if visited. In the case of IDNs (Internationalised Domain Names) there are several ways of presenting links to users and I have encountered all of these variants. Letʼs look at Korea University's IDN 고려대학교.한국. This can be presented as www.고려대학교.한국, http://고려대학교.한국, http://www.고려대학교.한국, www.xn--299a9hr4mn4fgs6b.xn--3e0b707e, http://xn--299a9hr4mn4fgs6b.xn--3e0b707e, http://www.xn--299a9hr4mn4fgs6b.xn--3e0b707e. If you click on these links you will see that they all work. I do not like any of these ways of presenting links. xn--299a9hr4mn4fgs6b.xn--3e0b707e is the punycode form of the domain name and should never be presented to users. It is used for behind the scenes communication between internet devices. (I still prefer the name punnycode 😁 )

I have used different forms over the years and I do consider there is a best way of presenting links and I have done this on many occasions. But I have mostly done it by way of experimentation and I have not done it consistently. As of yesterday, I have decided to have a consistent working practice for presenting both IDN and ASCII links. My personal rules for presenting links are:-

I will not use the www or http(s) prefix and will most definitely not use the punycode form. The link for Korea University now becomes ➜ 고려대학교.한국. We now have a simple and elegant presentation of the link. Note also that it is a single human language script which in this case is Korean Hangeul. Therefore when one is typing this link there is no need to switch between English and Korean on oneʼs device. This, to me, is the most important part of presenting links as a single human language script.
Presentation of email addresses is well established and presented email address links are not prefixed with the "mailto://" scheme name. We can do the same thing with internationalised email addresses. The link for my Chinese email address is 小山@电邮.在线?Subject=你好小山😜. We can easily distinguish between website links and email links because an email link has the @ symbol. If your email client works correctly, then, when you click my Chinese email link the To: field should be filled in with 小山@电邮.在线 and the Subject: field with 你好小山😜
I will always show the real address in the link eg ko.wikipedia.org/wiki/고려대학교. I will never use anything of the form " … please click here for further information. ". I consider this to be extremely bad security. I abandoned this practice many years ago. How many of you hover over a link to determine the real address before you click the link? I do sometimes, but mostly I do not. With my links there are no surprises, what you see is what you will get. The one thing I have no control over is redirection. A website can, at anytime, redirect to a different web address. Such redirection does sometimes happen though not very often and usually it is for legitimate reasons such as redirection to a new version of a website.
When the url is extremely long which I cannot reasonably fit into say a presentation slide, I will use ellipsis to indicate this is not the complete address eg once.upon/a/time/there/was/a/beautiful/…
So far we have only considered the two most common schemes, http(s) and mailto. There are many other schemes, such as smb, sftp and imap. There is a list of the registered schemes at iana.org/assignments/uri-schemes/uri-schemes.xhtml. In cases like this I will include the scheme prefix in my link so that it can be easily distinguished from the aforementioned type of links eg sftp://some.fileserver.somewhere/freestuff/user-manual.txt"

Some systems and apps will break my working practice as they will not allow me to present my links as I consider they should be presented. I will endeavour to find work arounds for such systems.

Techie Tip: When I was setting up my Chinese email address in my email signature, my email client insisted on decoding the domain name 电邮.在线 to the punycode form xn--wny099c.xn--3ds443g. When something like this happens I add an extra level of encoding so that the system decodes to the level I want not the level the system wants. This sometimes works and sometimes not. In this case it worked. I added percent encoding. I percent encoded to %E5%B0%8F%E5%B1%B1@%E7%94%B5%E9%82%AE.%E5%9C%A8%E7%BA%BF and then my email client decoded to 小山@电邮.在线 which is precisely what I wanted. A really useful web app for doing such conversions is Richard Ishidaʼs Unicode code converter r12a.github.io/apps/conversion

Friday, 5 January 2018

Computer Science Internationalization - Composed v Decomposed Text

My given name André, has the diacritic, acute accent over the letter e. In Unicode there are two ways of constructing and storing é. Either as a single composed character U+00E9 LATIN SMALL LETTER E WITH ACUTE or decomposed as the two characters e U+0065 LATIN SMALL LETTER E and ´ U+0301 COMBINING ACUTE ACCENT. Processes can and do convert between composed and decomposed forms. How can we know which of these forms is being stored?

Letʼs take the case of pasting into a browser web form. I will use André as my test data and I will make use of Richard Ishidaʼs Uniview r12a.github.io/uniview. To use Uniview, paste text into the largish white textbox and then click the white down arrow ⇪. You will then see information about all the characters in the textbox. Pasting André into Uniview gives the following results:-

Chrome — é is decomposed
Firefox — é is decomposed
Safari — é is composed

The text André in this blog article is in decomposed form, except where I indicate otherwise. In my rather limited tests, Chrome and Firefox do not convert the text, so if the text starts as decomposed it arrives in Uniview as decomposed. Safari, on the other hand converts decomposed text to composed.

Letʼs now go back a step. The copy operation involves copying text to the clipboard and the paste operation takes text from the clipboard. Can we determine which form is in the clipboard? Yes we can and here is one way of doing it.

We are now going to use the terminal app. Typing the command pbpaste|hexdump -C will show the contents of the clipboard at byte level. Copying André and running the command pbpaste|hexdump -C will give 41 6e 64 72 65 cc 81 . This is André displayed in Unicode UTF-8 encoding, where 41 = A; 6e = n; 64 = d; 72 = r; 65 = e; cc 81 = combining acute accent. If we copy the composed form of André (⬅︎ composed ⬅︎ André) we get 41 6e 64 72 c3 a9 where c3 a9 = composed é.

Conclusion: Itʼs complicated! Not being aware of these different forms of text can lead to, very difficult to find, bugs in code. Given different combinations of apps, versions and processes the results may well be different. One certainty though, is that one needs to have a good understanding of Unicode.

Environment: OSX High Sierra 10.13.2, Chrome 63.0.3239.84, Firefox 57.0.4, Safari 11.0.2