André 小山 Schappo

Wednesday, 29 July 2015

JavaScript Variable Names

In schappo.blogspot.co.uk/2015/06/php-variable-names.html I explained how PHP variable names are determined at the byte level. JavaScript variable names are determined at a higher level and are defined in terms of Unicode Properties and General Categories.

Letʼs start with some simple Basic Latin variable name examples.

Valid Names: nottingham nottingham8
Invalid Name: 8nottingham

where 8 is Unicode character U+0038 DIGIT 8. The last name, above, being invalid because it begins with a digit.

We are in the Unicode age and so do not need to restrict ourselves to Basic Latin. Some time ago I asked myself whether the same Basic Latin rule, an initial digit is invalid, applies to other Scripts and yes it does (well mostly).

Valid Devanagari Names: नाटिंघम नाटिंघम६
Invalid Devanagari Name: ६नाटिंघम

where ६ is Unicode character U+096C DEVANAGARI DIGIT SIX

Valid Thai Names: นอตทิงแฮม นอตทิงแฮม๘
Invalid Thai Name: ๘นอตทิงแฮม

where ๘ is Unicode character U+0E58 THAI DIGIT EIGHT

Valid Telugu Names: నాటింగ్‌హామ్ నాటింగ్‌హామ్౬
Invalid Telugu Name: ౬నాటింగ్‌హామ్

where ౬ is Unicode character U+0C6C TELUGU DIGIT SIX

Valid Chinese Names: 诺丁汉诺丁汉八八诺丁汉

where 八 is the digit 8, aka Unicode character U+516B CJK UNIFIED IDEOGRAPH-516B

So, Unicode CJK Ideographs are the exception as a name can begin with a digit. Unicode CJK Ideographs encompass Chinese Hanzi, Japanese Kanji and Korean Hanja. The reason for this exception is because of the Unicode General Category to which these digits are assigned. The digits 8, ६, ๘, ౬ have the General Category Nd which is Decimal Number and thus, cannot be the first character of a variable name. 八 has the General Category Lo which is Other Letter and thus, can be the first character of a variable name even though semantically it is a digit.

The above holds true for ECMAScript 5 and 6 and possibly earlier versions but it is unlikely I will test earlier versions. You can validate your variable names online at mothereff.in/js-variables.

For further reading I suggest:

ECMAScript 5 Names ecma-international.org/ecma-262/5.1/#sec-7.6
ECMAScript 6 Names ecma-international.org/ecma-262/6.0/index.html#sec-names-and-keywords
Unicode Identifier and Pattern Syntax unicode.org/reports/tr31/
Derived Properties ID_Start and ID_Continue in unicode.org/Public/UNIDATA/DerivedCoreProperties.txt

Friday, 17 July 2015

Multilingual PDF

There is an OS X extension lpdf, which is the Multilingual PDF of the title of this blog article. Letʼs look at System Information.app with terminal.

find . -name '*.lpdf'
./Contents/Resources/ProductGuides/productinfo1.lpdf
./Contents/Resources/ProductGuides/productinfo2.lpdf
./Contents/Resources/ProductGuides/productinfo3.lpdf
./Contents/Resources/ProductGuides/regulatory-022-00023.lpdf
./Contents/Resources/ProductGuides/regulatory-022-00272.lpdf
./Contents/Resources/ProductGuides/regulatory-022-00432.lpdf
./Contents/Resources/ProductGuides/regulatory-022-00458.lpdf
./Contents/Resources/ProductGuides/regulatory-022-00459.lpdf
./Contents/Resources/ProductGuides/regulatory-022-00465.lpdf
./Contents/Resources/ProductGuides/regulatory-022-00466.lpdf
./Contents/Resources/ProductGuides/regulatory-022-5167.lpdf
./Contents/Resources/ProductGuides/regulatory-022-6097.lpdf
./Contents/Resources/ProductGuides/regulatory-022-6098.lpdf
./Contents/Resources/ProductGuides/regulatory-022-6346.lpdf
./Contents/Resources/ProductGuides/regulatory-022-6347.lpdf

Looking inside, for example, productinfo1.lpdf/Contents/Resources we can see a set of lproj directories.


ar.lproj      es.lproj      id.lproj      pl.lproj      th.lproj
ca.lproj      es_MX.lproj   it.lproj      pt.lproj      tr.lproj
cs.lproj      fi.lproj      ja.lproj      pt_PT.lproj   uk.lproj
da.lproj      fr.lproj      ko.lproj      ro.lproj      vi.lproj
de.lproj      he.lproj      ms.lproj      ru.lproj      zh_CN.lproj
el.lproj      hr.lproj      nl.lproj      sk.lproj      zh_TW.lproj
en.lproj      hu.lproj      no.lproj      sv.lproj

The above directory names are of the form language tag.lproj. Languages include: ar (arabic), fr (french), ja (japanese), ko (korean), th (thai) and zh_CN (chinese in China). Inside each of the above directories is a language localized productinfo1.pdf e.g. ko.lproj contains a productinfo1.pdf which is the Korean language version of the document.

More concisely: A lpdf is an OS X Package containing a set of language localized PDFs.

Note: This exploration of lpdf was carried out using OS X Yosemite 10.10.4.

Monday, 22 June 2015

PHP Variable Names

For years I thought PHP variable names could only be constructed from ASCII characters. Actually, maybe I had not really thought about it but rather just followed common practice without question. The common practice being something like

a variable name is prefixed with $
the first character must be a letter (a-z, A-Z) or an underscore (_)
subsequent characters can be any mix of letters or digits (0-9) or underscore

So, examples of valid PHP variable names include

$Andre $age $previous_total

But!!!!! We are in the Unicode age and so variable names are NOT restricted to the above common practice. We can be much more creative. We can, for instance, localise our code. Examples of valid variable names include

$André $小山 $エクセレント $우수한 $🐉

For the following explanation I am assuming your source code file is saved as Unicode UTF-8. If not, it should be.

Letʼs refer to the the definitive PHP documentation concerning variable names which is at php.net/manual/en/language.variables.basics.php. The key is the specified regular expression

[a-zA-Z_\x7F-\xFF][a-zA-Z0-9_\x7F-\xFF]*

The variable name 小山 UTF-8 encoded is E5 B0 8F E5 B1 B1 which is matched by above regular expression. The variable name 🐉 UTF-8 encoded is F0 9F 90 89 which is also matched by the above regular expression.

Determination of valid variable names is at a low level, the byte level. A UTF-8 encoded character will consist of 1 to 4 bytes. Only characters in the Basic Latin Unicode block (which is the same as ASCII) use 1 byte for encoding. All other characters require 2 to 4 bytes for encoding. The byte values for these All other characters are always ≥ 80. The consequence is that if one uses non Basic Latin Unicode characters there are no restrictions whatsoever! Thus one can, for example, have Chinese, Japanese, Korean, Punjabi, Russian or Egyptian Hieorglyphs variable names. One can have Currency Symbol, Mathematical Operators or Emoji variable names. An opportunity to be creative.

There are perhaps certain practices one should avoid when using Unicode for your variable names. The below are actually 3 different (valid) variable names even though they appear visually identical.

$André (uses U+00E9 LATIN SMALL LETTER E WITH ACUTE)
$André (uses U+0065 LATIN SMALL LETTER E & U+0301 COMBINING ACUTE ACCENT)
$Аndré (uses U+0410 CYRILLIC CAPITAL LETTER A)

What of the allowed byte value 7F (DELETE)? Why is this allowed in a variable name? I do not have a good answer to this. I also do not have a bad answer. At the moment I am going to leave this pending whilst I conduct further research.

Sunday, 14 June 2015

Chinese Name

My chinese adopted name is 小山 which consists of two chinese characters. Sometime ago I was informed that there is a single chinese character that combines 小 and 山. The ideographic description character sequence is ⿱小山

This combined character is encoded in Unicode in SIP (Supplementary Ideographic Plane), CJK Unified Ideographs Extension C at codepoint U+2AA24. The only font I have, so far, found which contains a glyph for this character is hanazono which is available for download from osdn.jp/projects/hanazono-font/releases/62072 Hanazono is actually provided as two ttf font files: HanaMinA and HanaMinB. Here is what the combined character, which is in HanaMinB, looks like in TextEdit on OSX.

Is there a sound for this uncommon character? After much searching I discovered cns11643.gov.tw/MAIDB/query_general_view.do?page=c&code=263f which represents the sound as shān (Hanyu Pinyin) and ㄕㄢ (Zhuyin). Same sound for both forms of representation. CNS 11643 is a Taiwanese character set.

Friday, 12 June 2015

New gTLDs CSV

There is a regularly updated ICANN csv file which summarises key New gTLD information in 6 fields. It is available at newgtlds.icann.org/newgtlds.csv (ref: cabforum.org/pipermail/public/2014-September/003907.html)

field 1: ASCII form (A-label)

field 2: unicode form (U-label) if an IDN otherwise empty

field 3: registry operator

field 4: date registry agreement signed

field 5: application number

field 6: date of delegation (empty if not yet delegated)

Fields 1,2 (if an IDN),3,4 from this csv are incorporated into the public suffix list publicsuffix.org/list/public_suffix_list.dat

For example, the csv entry for .コム is

xn--tckwe,コム,"VeriSign Sarl",2015-01-15,1-1254-37311,

It seems as though Chrome browser uses this public suffix list. One interesting (and IMHO good) consequence is that Chrome will recognise a .コム IDN (eg anything.コム ) as a domain name to be resolved even though it is not yet delegated. Chrome does not resort to a search. So Chrome is ready to go as soon as .コム is delegated.

Tuesday, 10 March 2015

Nottingham

Орёл и Решка is a russian language ukrainian TV travel programme. Season 9 episode 14 was all about Nottingham which in russian is Ноттингем. The presenters were Регина Тодоренко and Евгений Синельников. I watched this episode and found it fascinating to see and hear of Nottingham from a different point of view and language.

There is an English wikipedia article about Орёл и Решка https://en.wikipedia.org/wiki/Oryol_i_Reshka

A russian language writeup of the visit to Nottingham/Ноттингем is available on the programme's website orel-reshka.net/9-sezon/321-orel-i-reshka-07-12-2014-nottingem-velikobritaniya.html

Регина Тодоренко stayed in the Edwards room at Langar Hall Country House Hotel langarhall.com/rooms/edwards/

The London RouteMaster Bus was hired from Blackmore Commercials blackmorecommercials.co.uk

Throughout the video there are popups specifying the cost of various items. Interestingly, three different currencies are used - Russian Rubles (РУБ), British Pounds (£), US Dollars ($).

I used a Unicode savvy shortening service to create links to the below video - ta.gd/nottingham, ta.gd/ноттингем, ta.gd/诺丁汉, ta.gd/ノッティンガム, ta.gd/노팅엄, ta.gd/नाटिंघम, ta.gd/Νότιγχαμ doiop.com/น็อตติ้งแฮม

Letʼs also have an emojified link ➜ 🍆😁🌊🏂🎳🐤🎉🍣.🆒🔗.ws

Monday, 9 March 2015

Duang

There has been much buzz about the invented word duang. It consists of both a Chinese character (well actually two) and a romanized form. The romanized form has even acquired the Chinese 1st tone, duāng

The Chinese character is constructed from Jackie Chan's Chinese name which has both Simplified and Traditional Chinese forms - Simplified ➛ 成龙 and Traditional ➛成龍. The Chinese character for duāng is constructed by placing the first character on top of the second to form a single composite character. Using Ideographic Description Characters, the two new duāng characters can be represented as the sequences:

⿱成龙
⿱成龍

What few realise is that there exists a font which contains glyphs for the duāng Chinese characters - BabelStone Han PUA babelstone.co.uk/Fonts/PUA.html. This font maps glyphs to codepoints in the Unicode PUA (Private Use Area) in the BMP (Basic Multilingual Plane). PUA codepoints can be used for any purpose by anyone, unlike the other codepoints which have to go through an approval process before they can be used. This font maps it's two duāng glyphs to PUA codepoints U+F4E2 and U+F4E3.

I installed BabelStone Han PUA font and here is how the duāng characters look in TextEdit on my iMac.

Update: The two duāng glyphs (using the same PUA codepoints U+F4E2 and U+F4E3 ) are also available in font BabelStone Han babelstone.co.uk/Fonts/Han.html

Sunday, 4 January 2015

India ccTLDs

India now has all of it's seven IDN ccTLDs delegated to the Internet's Domain Name Root Zone. These seven ccTLDs encompass an impressive diversity of languages and scripts. All of these seven ccTLDs mean India.

.ভারত language=Bengali script=Bangla punycode=xn--45brj9c
.భారత్ language=Telugu script=Telugu punycode=xn--fpcrj9c3d
.ભારત language=Gujarati script=Gujarati punycode=xn--gecrj9c
.भारत language=Hindi script=Devanagari punycode=xn--h2brj9c
.بھارت language=Urdu script=Arabic punycode=xn--mgbbh1a71e
.ਭਾਰਤ language=Punjabi script=Gurmukhi punycode=xn--s9brj9c
.இந்தியா Language=Tamil Script=Tamil punycode=xn--xkc2dl3a5ee0h

It will take quite some time before there are live Domain Names using these ccTLDs. I will endeavour to provide examples as they go live.

Hindi ➪ सीडैक.भारत

IDN= Internationalized Domain Name
ccTLD= country code Top Level Domain

Thursday, 29 May 2014

New IDN gTLDs

The New gTLDs initiative has been ongoing for some time now so I thought it time I listed some Website Domain Names. I will only be listing IDNs (Internationalized Domain Names) which are those that consist of non Latin Scripts. My intention is to have at least one list entry for each New IDN gTLD as it goes live so this post will be ongoing for several months.

I will only be listing those IDNs I consider to be reasonably well integrated into the website. My criteria for inclusion excludes Frame redirects and IDNs that redirect to an ASCII Domain Name.

Monday, 14 April 2014

Australian Universities on Weibo

There are quite a number of Australian Universities on Sina Weibo 新浪微博. Below I list those I have found. I only include those Australian Universities that have verified (the big blue V after the username) Weibo accounts. The text in square brackets is the username on Weibo.

Australian Catholic University [@ACUInternational] weibo.com/acuinternational
Charles Darwin University [@查尔斯达尔文大学] weibo.com/charlesdarwinuni
Curtin University [@科廷大学CurtinUniversity] weibo.com/CurtinWestAustralia
Deakin University [@澳大利亚迪肯大学] weibo.com/deakinuniversity
Federation University [@澳大利亚联邦大学FedUni] weibo.com/FedUniAustralia
Flinders University [@FlindersUni弗林德斯大学] weibo.com/flinders2011
La Trobe University [@澳大利亚拉筹伯大学] weibo.com/latrobeuniaus
Macquaire University [@澳大利亚麦考瑞大学] weibo.com/mquni
Monash University [@MonashUni澳大利亚蒙纳士大学] weibo.com/monashuniversityaust
Queensland University of Technology [@QUT昆士兰科技大学] weibo.com/qutbrisbane
Southern Cross University [@澳大利亚南十字星大学] weibo.com/scuchina
Swinburne University of Technology [@澳洲斯威本科技大学] weibo.com/swinburneuniversity
University of Adelaide [@澳大利亚阿德莱德大学] weibo.com/uniadelaide
University of Canberra [@堪培拉大学] weibo.com/unicanberra
University of Melbourne [@墨尔本大学官微] weibo.com/melbourneuni
University of New South Wales [@澳洲新南威尔士大学] weibo.com/ozunsw
University of Queensland [@昆士兰大学] weibo.com/myuq
University of South Australia [@南澳大学官方微博] weibo.com/studyatunisa
University of Southern Queensland [@澳大利亚南昆士兰大学] weibo.com/usqchina
University of Western Sydney [@西悉尼大学UWS] weibo.com/uwsinternational
University of Wollongong [@澳大利亚卧龙岗大学UOW] weibo.com/uowaustralia

Friday, 11 April 2014

Regular Expressions

Regular Expressions are not just about ASCII. They are (or should be) about Unicode, with ASCII being a very small subset of Unicode. The vast majority of Regular Expressions documentation and tutorials I have seen, only deal with ASCII. The consequence is that many/most will never consider non ASCII text strings.

If one considers Unicode text strings then one can process text strings consisting of non Latin Scripts and Symbols. Scripts such as: Cyrillic, Devanagari, Tamil, Georgian, Cherokee, Chinese and Sinhala. Symbols such as: Currency, Arrows, Mathematical Operators, Mahjong Tiles and Playing Cards. Unicode has a repertoire of over 100000 characters which can be processed with Regular Expressions.

Mostly, Regular Expressions are no different when using Unicode as compared to using the very limited ASCII. I will give some simple examples using Hangul, which is the Script used for writing Korean. The Hangul characters I will be using in the examples below are in Unicode block Hangul Syllables U+AC00-D7AF. I will intersperse other Unicode characters in my examples below. I present the examples in the form of a terminal session transcript.

苹果电脑 ~: egrep '바나나'
abcdef
abc바나나def
abc바나나def

苹果电脑 ~: egrep '바.나.나'
바诺丁汉나拉夫堡나
바拉나夫나堡
바拉나夫나堡

苹果电脑 ~: egrep '[바나다]'
abcdef
보노도고로
ДЖԶख나ખ༁
ДЖԶख나ખ༁

苹果电脑 ~: egrep '[가-힣]'
abcdef
abc현def
abc현def

苹果电脑 ~: egrep '^[ 가-힣]+$'
abcdef
abc서울def
서울은 아름답다
서울은 아름답다

Where you see a line duplicated that means there was a successful match with the Regular Expression. I have used egrep on OSX.

The transcript may look a bit odd because of the variety and unfamiliarity of the Unicode characters I have used. If, though, you carefully examine the above Regular Expressions you will see they have standard syntax and are actually elementary constructs. So, if you teach Regular Expressions, why not give your students an insight into processing Unicode strings and not just ASCII strings. Or, to put it another way, give your students an insight into processing multi-language strings and not just English strings. Or, to put it yet another way, code for the whole world and not just the English speaking world.

BTW — 苹果电脑 ~: is the prompt I setup for my iMac and the first four characters are Chinese for Apple Computer.

In the examples above, I have deliberately used one of the standard and common Regular Expression engines. I have accessed this engine via egrep. This type of engine is one which you will most likely encounter. Much less common, are the Regular Expression engines that have been extended with features specifically for Unicode. Such extensions, for instance, facilitate matching with Unicode characters having some specified property e.g. \p{Hangul} will match with any character belonging to the Hangul Script. More information on such engines is available at regular-expressions.info/unicode.html and unicode.org/reports/tr18/

Thursday, 16 January 2014

Japanese Domain Name

I believe はじめよう.みんな to be the world's first live fully Japanese Domain Name! It is written with the Japanese Hiragana script. みんな is one of Google's new gTLDs icannwiki.com/index.php/.みんな.

Google translates はじめよう to "Let's start with" and みんな to "Everyone" translate.google.co.uk/#ja/en/はじめよう%0Aみんな

One can use the Ideographic Full Stop rather than the ASCII Full Stop as the separator in Internationalized Domain Names ie はじめよう。みんな. This then gives us the rather cool translation to English "Let's start with. Everyone" translate.google.co.uk/#ja/en/はじめよう。みんな