André 小山 Schappo

Saturday, 26 November 2016

Domain Name Registrations

To keep up to date with Domain Name registrations I highly recommend gd-domains. It gives listings of newly registered Domain Names on a daily basis. Listings for individual TLDs (Top Level Domains) are available. It is thanks to this site that I discovered the below impressive and sizeable family of Korean Domain Names. They were registered on 11th and 22nd November 2016. The TLD used is 닷컴 which is Verisign's Korean equivalent to com.

I think embedding telephone numbers into these IDNs (Internationalized Domain Names) is clever marketing ☎️

www.gd-domains.com/20161111-229 is the link for 닷컴 registrations on 11th November 2016. www.gd-domains.com/20161122-229 is the direct link for all 닷컴 registrations on 22nd November 2016, 2016년 11월 22일 화요일.

Wednesday, 26 October 2016

Family of Korean IDNs

The following is a list of functioning Korean IDNs (Internationalized Domain Names). The TLD (Top Level Domain) used is 닷컴 which is Verisign's Korean language equivalent to their com TLD.

This family of Korean IDNs are concerned with Computer Repair 컴퓨터수리. The first two characters of the first 6 IDNs are names of Korean cities: 김포 Gimpo, 안양 Anyang, 용인 Yongin, 대구 Daegu, 파주 Paju, 성남 Seongnam. I think the first 2 characters of the next 2 IDNs are districts or neighbourhoods of 서울 Seoul: 용산 Yongsan, 종로 Jongno. The last one 일산 Ilsan, is a 동 neighbourhood of 고양 Goyang.

The following family of Korean IDNs all resolve to Fun Design website. I discovered this family at newly.domains/20171128-229 홈페이지제작 means home page creation. The first 2 Korean characters of the first 9 IDNs are Korean cities: 과천 Gwacheon, 광주 Gwangju, 대구 Daegu, 대전 Daejeon, 부산 Busan, 서울 Seoul, 수원 Suwon, 울산 Ulsan and 인천 Incheon. The last one, 분당 Bundang, I am less certain about. I think it is a district of 성남 Seongnam.

Friday, 7 October 2016

Computer Science Internationalization — Bidi

Scripts such as Latin are written from Left to Right (L➡︎R). Scripts such as Arabic and Hebrew are written Right to Left (L⬅︎R). What happens when we mix L➡︎R and L⬅︎R scripts within a document? Here is an exercise in mixing scripts.

Take a mixed bidi (bidirectional) string consisting of Latin and Hebrew characters in a L➡︎R paragraph.

abcאבגdef

...and here is the same string in a L⬅︎R paragraph.

abcאבגdef

Now to the actual exercise. Copy the above stings to your text editor or word processor. You will need to setup the 2nd occurrence of the string as a L⬅︎R paragraph. I am assuming that your directionality is L➡︎R by default. Each string has two boundaries where the text changes direction. For each boundary you are going to insert a character, either a L➡︎R, such as x, or a L⬅︎R, such as ד. For each insertion operation use the initial mixed bidi string. There are two mixed strings above and so there are a total of 8 insertion operations. The challenge is to predict where in the strings the inserted character will appear before you actually insert the character. Give it a go! Good luck😀

If I did this exercise before I ever studied bidi, I would probably have scored 4/8. Now I understand how the computer is processing this bidi text and so I usually score full marks for such exercises. It is though not an intuitive process for me as I have spent most of my life reading and writing L➡︎R scripts only. I have to think very carefully as to how the computer does it in order to determine the correct answers.

The main purpose of this exercise is to think about the ordering of the characters in the strings. There are two orderings to consider: memory order and display order. Memory order is how it is logically saved in memory which in this case is the order in which I typed it. The memory order of the string I have used above is "abcגבאdef". Display order is how it is presented to the viewer. You have already seen, above, the two possible display orders for the single string "abcגבאdef".

I have used TextEdit for this exercise. In order to set paragraph text direction in TextEdit follow the path: "TextEdit➜ Format➜ Text➜ Writing Direction". Now set paragraph text direction to Right to Left. TextEdit correctly handles bidi text but that is not the case for all word processors or text editors.

There are several permutations of this exercise, including:

What happens at the boundaries with forward delete and back delete?
What happens if the initial memory order character(s) are L⬅︎R instead of L➡︎R?
Use Arabic instead of Hebrew as this introduces the additional challenge of letters changing shape according to preceding and following characters.

This article is aimed at L➡︎R reading/writing people. If you are a L⬅︎R person then you will need to invert some of my instructions. Actually, if you are a L⬅︎R person you will be totally familiar with mixing bidi text and so will fully understand this exercise.

Environment: OSX v10.12 (Sierra), TextEdit v1.12

Wednesday, 21 September 2016

Computer Science Curriculum Internationalization

I have been a long time practitioner and advocate of internationalising Computer Science teaching. My fundamental aim is to give students global computing skills. One such global skill, for example, is the processing of Unicode text rather than the very restricted ASCII text. Once one encompasses Unicode then one is encompassing most languages and scripts of the world.

Over the years I have tried to find other like minded Computer Science educators but have had no success. I had more or less concluded I am a solitary voice when it comes to Computer Science internationalisation. There does though appear to be some light as I recently discovered two organisations that promote internationalisation of teaching curricula.

 The Centre for Curriculum Internationalisation (CCI) which is based at Oxford Brookes University, UK. brookes.ac.uk/services/cci/index.html In addition to their website they have a google discussion group. I posted some information on my Computer Science Internationalisation initiatives and practices to this google forum. Please see groups.google.com/forum/#!topic/cicin/6XJCrqcdLD4

 Internationalisation of the Curriculum (IoC) in action which is based in Australia. ioc.global

Update 19th March 2017: I reached out to people and groups and I conclude I am still a solitary voice with respect to Computer Science Curricula Internationalization in UK Universities. I do believe UK Universities will have to embrace Computer Science Internationalization but I think it will be at least ten years before that happens. So, why do I persist? Am I wrong? Well, if I am wrong then so are, Google, Wikipedia, Facebook, Nivea, Booking.com, Nestlé, Hotels.com, Pampers, Intel, Microsoft, Philips, Adobe, Twitter and many many more companies. They all operate globally and are all producing software for the world. These global companies need graduates who have the skills and attitude necessary for building global software.
Note: I have taken these company names from The top 25 global websites from the 2017 Web Globalization Report Card globalbydesign.com/2017/02/16/the-top-25-global-websites-from-the-2017-web-globalization-report-card/

Update 1st October 2017: I recently created an open forum specifically for discussion on the topic of Computer Science/ICT/IT curricula internationalisation. If this topic interests you, please become a member and join in the discussions. It is open to all. Please see groups.google.com/forum/m/#!forum/computer-science-curriculum-internationalization

Thursday, 25 August 2016

Internationalizing Regular Expressions

The purpose of this post is to encourage all of you who are teaching Regular Expressions (RegExp) or are learning RegExp to think international. Think beyond ASCII. Thinking international means thinking Unicode instead of ASCII. Once one thinks Unicode then one is encompassing the world.

My RegExp teaching slides use ASCII only as a starting point. They then progress to Unicode. I give one of my slides as an example.

There is a lot of information packed into this one slide which needs some explanation. My example slide is using Unicode Chinese characters and Unicode Emoji characters.

人 is a Unicode Chinese character meaning person
鸭 is a Unicode Chinese character meaning duck
鸡 is a Unicode Chinese character meaning chicken

This slide also contains a cultural reference. Some time ago I came across a Weibo 微博 post about the visit to Hong Kong by the big floating yellow duck http://edition.cnn.com/2013/05/02/travel/hong-kong-giant-duck/ The Weibo post had a photo containing many people looking at the duck. The text of the Weibo post was:-

人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人鸭人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人人

When I saw this I thought it so funny and very clever. It just would not work in English but works so perfectly in Chinese. When writing my RegExp slides I remembered this Weibo post and thought this would make for an excellent cultural connection. Thus my slide is internationalized by using Unicode and incorporating a cultural reference. The use of Unicode is essential for internationalisation. Incorporating a cultural reference is optional but it does add an extra dimension that may well serve to make RegExp slides more interesting and encourage readers to explore the boundless potential of internationalized Regular Expressions.

Note: I have tried to find the Weibo post but have been unsuccessful so I cannot, unfortunately, provide a reference.

Sunday, 13 March 2016

Arabic Email Addresses

Most human language scripts are written from Left to Right (L➡︎R). Arabic is written Right to Left (L⬅︎R). An email address written in the Latin script would be displayed L➡︎R — username@domain-name. An Arabic email address, on the other hand, would normally and without intervention be displayed L⬅︎R as domain-name@username.

Letʼs take a fictitious Arabic email address — خالد@الدوحة.قطر

خالد is the username Khalid
الدوحة is the 2nd level domain name Doha
قطر is the Top Level Domain (TLD) Qatar. This part is not fictitious as قطر is a valid ccTLD.

Your browser should be displaying the email address خالد@الدوحة.قطر in L⬅︎R order which is not an order familiar to most L➡︎R readers and so requires some effort to parse.

When text has mixed L➡︎R and L⬅︎R characters it is referred to as Bidirectional (bidi) text. There is a complex Unicode algorithm specifically to determine display order of bidi text unicode.org/reports/tr9/ If you read this report you will see something called Directional Isolates.

In the html world there are tags and attributes specific to bidi. One such tag is <bdi> which is bidi isolate. Using such html bidi isolation one can incorporate Arabic email addresses that are natural to read for both L➡︎R and L⬅︎R readers. These addresses can be written such that their overall text direction adheres to the text direction of the context. This context may be the direction of the whole html document or some subpart such as a paragraph.

First we will setup our html with a L➡︎R context for L➡︎R readers. The below paragraph (p) is setup with dir (direction) to ltr (left to right). The email address has 3 components: username, 2nd level domain name and TLD. Each component is direction isolated. This gives an email address whose overall direction is L➡︎R. The text of each component is, as it should be, L⬅︎R. I posit that this is much easier for a L➡︎R reader to comprehend. It is now obvious, for instance, to determine which is the username and which is the TLD.

The html code
<p dir="ltr"><bdi>خالد</bdi>@<bdi>الدوحة</bdi>.<bdi>قطر</bdi></p>
displays the address as

خالد@الدوحة.قطر

But how will the address be displayed if the context is changed to rtl (right to left). The code correctly displays the whole address in L⬅︎R order, both overall direction and text direction of each component. Thus we have also catered for L⬅︎R readers without changing relevant address display html code.

The html code
<p dir="rtl"><bdi>خالد</bdi>@<bdi>الدوحة</bdi>.<bdi>قطر</bdi></p>
displays the address as

خالد@الدوحة.قطر

Just in case your browser cannot, as yet, handle bidi isolates here are the 2 contexts in image format.

Friday, 1 January 2016

Emoji by Name

Here is a method for typing Emoji by name but not by English name. This method is for writing Emoji by Chinese name. OSX provides a Pinyin Input Method for writing Chinese. Pinyin is a romanization of Chinese. When writing in pinyin a candidate window pops up which lists all possible Chinese characters 汉字 and Emoji.

Candidate Window - Frequency

Candidate Window — Emoji

Here is a small sample of the Emoji which can be typed using this Pinyin Input Method. Each line below starts with the pinyin followed by the Emoji. The pinyin can have multiple meanings, multiple candidate Chinese characters and hence multiple Emoji. Hopefully for the examples I have given below you will be able to work out the meanings from the Emoji. Some of the below pinyin represent objects and some represent emotions.

ai — ❤️ 😘 💗 💓 😍
bei shang — 😢 😭
che — 🚗 🚘
hou — 🐒 🐵
hua — 🌹 🌼 💐 🌷 🌸 🌺
ka fei — ☕️
kai xin — 😄 😺 😃 😆 ☺️
mao — 🐱 🐈 ⚓️
niu — 🐂 🐃 🐄 🐮
pi jiu — 🍺 🍻
sheng qi — 😠 😡 💢 😾
shu — 🌲 🌳 🌴 🐭
shui guo — 🍉 🍊 🍇 🍈 🍌 🍍 🍎 🍑 🍒 🍓 🍅 🍆 🍋 🍏 🍐
tuo la ji — 🚜
xiang — 🐘
xiao — 😊 😄
xie — 👟 👠
xue ren — ⛄️ ☃️
yin yue — 🎵 🎷 🎶 🎸 🎹 🎺 🎻 🎼 🎤 🎧 📯
yu — 🐟 🐠

Environment: OSX El Capitan v10.11.2

Friday, 11 December 2015

Unicode Regular Expressions

I have long been familiar with processing Unicode characters with RegExp (Regular Expressions). I was also aware that RegExp could be used to match Unicode characters based upon their Unicode assigned character properties. I had not yet though coded such property based RegExp. A few days ago I decided to explore this area.

An interesting property, for example, is the script to which a character belongs. e.g. \p{Hangul} will match with any character which belongs to the Hangul script. Hangul is the script used to write Korean.

I started with Perl and here is my simple Perl program:

#!/usr/bin/perl

if("노팅엄"=~/^\p{Hangul}+$/){print "korean\n";}else{print "not korean\n";}

...and this code did not work. I know that 노팅엄 is Korean hangul but my code disagreed. After much searching I discovered I needed to include the statement use utf8 which instructs Perl to use Unicode UTF8 encoding. So my working version of the code is:

#!/usr/bin/perl

use utf8;

if("노팅엄"=~/^\p{Hangul}+$/){print "korean\n";}else{print "not korean\n";}

...and now onto PHP using PCRE Perl Compatible RegExp. My initial RegExp was:

preg_match('/^\p{Hangul}+$/','노팅엄')

...and this did not work! We have already established that 노팅엄 is Korean hangul but my PHP code disagreed with me. After I investigated further I discovered there is a u modifier which directs the code to use Unicode UTF8 encoding. So add the u modifier and we now have a working code!

preg_match('/^\p{Hangul}+$/u','노팅엄')

For several years now, my standard practice is to save text files as Unicode UTF8 encoded files. This includes code files. One still, though, has to repeatedly and explicitly tell systems, programs, functions, utilities, processes to use Unicode. We seem still to be a long way from a total Unicode environment with everything being seamlessly and natively Unicode.

Environments: Perl v5.18.2; PHP v5.5.29

Sunday, 27 September 2015

Nottingham on Weibo

Sina Weibo 新浪微博 is a China microblogging service en.wikipedia.org/wiki/Sina_Weibo. Nottingham now has a number organisations on Sina Weibo. Here are some of them.

Nottingham City Council [@英国诺丁汉市政厅] weibo.com/nottinghamcity
Nottingham Trent University [@英国诺丁汉特伦特大学] weibo.com/ntuinternational
Nottingham University Business School [@英国诺丁汉大学商学院] weibo.com/UoNBusiness
University of Nottingham [@英国诺丁汉大学官方微博] weibo.com/uoneao

Note: I only list Nottingham Weibo accounts that are verified and have a meaningful URL ie not the default numeric form.

Wednesday, 29 July 2015

JavaScript Variable Names

In schappo.blogspot.co.uk/2015/06/php-variable-names.html I explained how PHP variable names are determined at the byte level. JavaScript variable names are determined at a higher level and are defined in terms of Unicode Properties and General Categories.

Letʼs start with some simple Basic Latin variable name examples.

Valid Names: nottingham nottingham8
Invalid Name: 8nottingham

where 8 is Unicode character U+0038 DIGIT 8. The last name, above, being invalid because it begins with a digit.

We are in the Unicode age and so do not need to restrict ourselves to Basic Latin. Some time ago I asked myself whether the same Basic Latin rule, an initial digit is invalid, applies to other Scripts and yes it does (well mostly).

Valid Devanagari Names: नाटिंघम नाटिंघम६
Invalid Devanagari Name: ६नाटिंघम

where ६ is Unicode character U+096C DEVANAGARI DIGIT SIX

Valid Thai Names: นอตทิงแฮม นอตทิงแฮม๘
Invalid Thai Name: ๘นอตทิงแฮม

where ๘ is Unicode character U+0E58 THAI DIGIT EIGHT

Valid Telugu Names: నాటింగ్‌హామ్ నాటింగ్‌హామ్౬
Invalid Telugu Name: ౬నాటింగ్‌హామ్

where ౬ is Unicode character U+0C6C TELUGU DIGIT SIX

Valid Chinese Names: 诺丁汉诺丁汉八八诺丁汉

where 八 is the digit 8, aka Unicode character U+516B CJK UNIFIED IDEOGRAPH-516B

So, Unicode CJK Ideographs are the exception as a name can begin with a digit. Unicode CJK Ideographs encompass Chinese Hanzi, Japanese Kanji and Korean Hanja. The reason for this exception is because of the Unicode General Category to which these digits are assigned. The digits 8, ६, ๘, ౬ have the General Category Nd which is Decimal Number and thus, cannot be the first character of a variable name. 八 has the General Category Lo which is Other Letter and thus, can be the first character of a variable name even though semantically it is a digit.

The above holds true for ECMAScript 5 and 6 and possibly earlier versions but it is unlikely I will test earlier versions. You can validate your variable names online at mothereff.in/js-variables.

For further reading I suggest:

ECMAScript 5 Names ecma-international.org/ecma-262/5.1/#sec-7.6
ECMAScript 6 Names ecma-international.org/ecma-262/6.0/index.html#sec-names-and-keywords
Unicode Identifier and Pattern Syntax unicode.org/reports/tr31/
Derived Properties ID_Start and ID_Continue in unicode.org/Public/UNIDATA/DerivedCoreProperties.txt

Friday, 17 July 2015

Multilingual PDF

There is an OS X extension lpdf, which is the Multilingual PDF of the title of this blog article. Letʼs look at System Information.app with terminal.

find . -name '*.lpdf'
./Contents/Resources/ProductGuides/productinfo1.lpdf
./Contents/Resources/ProductGuides/productinfo2.lpdf
./Contents/Resources/ProductGuides/productinfo3.lpdf
./Contents/Resources/ProductGuides/regulatory-022-00023.lpdf
./Contents/Resources/ProductGuides/regulatory-022-00272.lpdf
./Contents/Resources/ProductGuides/regulatory-022-00432.lpdf
./Contents/Resources/ProductGuides/regulatory-022-00458.lpdf
./Contents/Resources/ProductGuides/regulatory-022-00459.lpdf
./Contents/Resources/ProductGuides/regulatory-022-00465.lpdf
./Contents/Resources/ProductGuides/regulatory-022-00466.lpdf
./Contents/Resources/ProductGuides/regulatory-022-5167.lpdf
./Contents/Resources/ProductGuides/regulatory-022-6097.lpdf
./Contents/Resources/ProductGuides/regulatory-022-6098.lpdf
./Contents/Resources/ProductGuides/regulatory-022-6346.lpdf
./Contents/Resources/ProductGuides/regulatory-022-6347.lpdf

Looking inside, for example, productinfo1.lpdf/Contents/Resources we can see a set of lproj directories.


ar.lproj      es.lproj      id.lproj      pl.lproj      th.lproj
ca.lproj      es_MX.lproj   it.lproj      pt.lproj      tr.lproj
cs.lproj      fi.lproj      ja.lproj      pt_PT.lproj   uk.lproj
da.lproj      fr.lproj      ko.lproj      ro.lproj      vi.lproj
de.lproj      he.lproj      ms.lproj      ru.lproj      zh_CN.lproj
el.lproj      hr.lproj      nl.lproj      sk.lproj      zh_TW.lproj
en.lproj      hu.lproj      no.lproj      sv.lproj

The above directory names are of the form language tag.lproj. Languages include: ar (arabic), fr (french), ja (japanese), ko (korean), th (thai) and zh_CN (chinese in China). Inside each of the above directories is a language localized productinfo1.pdf e.g. ko.lproj contains a productinfo1.pdf which is the Korean language version of the document.

More concisely: A lpdf is an OS X Package containing a set of language localized PDFs.

Note: This exploration of lpdf was carried out using OS X Yosemite 10.10.4.

Monday, 22 June 2015

PHP Variable Names

For years I thought PHP variable names could only be constructed from ASCII characters. Actually, maybe I had not really thought about it but rather just followed common practice without question. The common practice being something like

a variable name is prefixed with $
the first character must be a letter (a-z, A-Z) or an underscore (_)
subsequent characters can be any mix of letters or digits (0-9) or underscore

So, examples of valid PHP variable names include

$Andre $age $previous_total

But!!!!! We are in the Unicode age and so variable names are NOT restricted to the above common practice. We can be much more creative. We can, for instance, localise our code. Examples of valid variable names include

$André $小山 $エクセレント $우수한 $🐉

For the following explanation I am assuming your source code file is saved as Unicode UTF-8. If not, it should be.

Letʼs refer to the the definitive PHP documentation concerning variable names which is at php.net/manual/en/language.variables.basics.php. The key is the specified regular expression

[a-zA-Z_\x7F-\xFF][a-zA-Z0-9_\x7F-\xFF]*

The variable name 小山 UTF-8 encoded is E5 B0 8F E5 B1 B1 which is matched by above regular expression. The variable name 🐉 UTF-8 encoded is F0 9F 90 89 which is also matched by the above regular expression.

Determination of valid variable names is at a low level, the byte level. A UTF-8 encoded character will consist of 1 to 4 bytes. Only characters in the Basic Latin Unicode block (which is the same as ASCII) use 1 byte for encoding. All other characters require 2 to 4 bytes for encoding. The byte values for these All other characters are always ≥ 80. The consequence is that if one uses non Basic Latin Unicode characters there are no restrictions whatsoever! Thus one can, for example, have Chinese, Japanese, Korean, Punjabi, Russian or Egyptian Hieorglyphs variable names. One can have Currency Symbol, Mathematical Operators or Emoji variable names. An opportunity to be creative.

There are perhaps certain practices one should avoid when using Unicode for your variable names. The below are actually 3 different (valid) variable names even though they appear visually identical.

$André (uses U+00E9 LATIN SMALL LETTER E WITH ACUTE)
$André (uses U+0065 LATIN SMALL LETTER E & U+0301 COMBINING ACUTE ACCENT)
$Аndré (uses U+0410 CYRILLIC CAPITAL LETTER A)

What of the allowed byte value 7F (DELETE)? Why is this allowed in a variable name? I do not have a good answer to this. I also do not have a bad answer. At the moment I am going to leave this pending whilst I conduct further research.