Friday, 1 January 2016

Emoji by Name

Here is a method for typing Emoji by name but not by English name. This method is for writing Emoji by Chinese name. OSX provides a Pinyin Input Method for writing Chinese. Pinyin is a romanization of Chinese. When writing in pinyin a candidate window pops up which lists all possible Chinese characters 汉字 and Emoji.

Candidate Window - Frequency

Candidate Window — Emoji

Here is a small sample of the Emoji which can be typed using this Pinyin Input Method. Each line below starts with the pinyin followed by the Emoji. The pinyin can have multiple meanings, multiple candidate Chinese characters and hence multiple Emoji. Hopefully for the examples I have given below you will be able to work out the meanings from the Emoji. Some of the below pinyin represent objects and some represent emotions.

  1. ai — ❤️ 😘 💗 💓 😍
  2. bei shang — 😢 😭
  3. che — 🚗 🚘
  4. hou — 🐒 🐵
  5. hua — 🌹 🌼 💐 🌷 🌸 🌺
  6. ka fei — ☕️
  7. kai xin — 😄 😺 😃 😆 ☺️
  8. mao — 🐱 🐈 ⚓️
  9. niu — 🐂 🐃 🐄 🐮
  10. pi jiu — 🍺 🍻
  11. sheng qi — 😠 😡 💢 😾
  12. shu — 🌲 🌳 🌴 🐭
  13. shui guo — 🍉 🍊 🍇 🍈 🍌 🍍 🍎 🍑 🍒 🍓 🍅 🍆 🍋 🍏 🍐
  14. tuo la ji — 🚜
  15. xiang — 🐘
  16. xiao — 😊 😄
  17. xie — 👟 👠
  18. xue ren — ⛄️ ☃️
  19. yin yue — 🎵 🎷 🎶 🎸 🎹 🎺 🎻 🎼 🎤 🎧 📯
  20. yu — 🐟 🐠

Environment: OSX El Capitan v10.11.2

Friday, 11 December 2015

Unicode Regular Expressions

I have long been familiar with processing Unicode characters with RegExp (Regular Expressions). I was also aware that RegExp could be used to match Unicode characters based upon their Unicode assigned character properties. I had not yet though coded such property based RegExp. A few days ago I decided to explore this area.

An interesting property, for example, is the script to which a character belongs. e.g. \p{Hangul} will match with any character which belongs to the Hangul script. Hangul is the script used to write Korean.

I started with Perl and here is my simple Perl program:

#!/usr/bin/perl
if("노팅엄"=~/^\p{Hangul}+$/){print "korean\n";}else{print "not korean\n";}

...and this code did not work. I know that 노팅엄 is Korean hangul but my code disagreed. After much searching I discovered I needed to include the statement use utf8 which instructs Perl to use Unicode UTF8 encoding. So my working version of the code is:

#!/usr/bin/perl
use utf8;
if("노팅엄"=~/^\p{Hangul}+$/){print "korean\n";}else{print "not korean\n";}

...and now onto PHP using PCRE Perl Compatible RegExp. My initial RegExp was:

preg_match('/^\p{Hangul}+$/','노팅엄')

...and this did not work! We have already established that 노팅엄 is Korean hangul but my PHP code disagreed with me. After I investigated further I discovered there is a u modifier which directs the code to use Unicode UTF8 encoding. So add the u modifier and we now have a working code!

preg_match('/^\p{Hangul}+$/u','노팅엄')

For several years now, my standard practice is to save text files as Unicode UTF8 encoded files. This includes code files. One still, though, has to repeatedly and explicitly tell systems, programs, functions, utilities, processes to use Unicode. We seem still to be a long way from a total Unicode environment with everything being seamlessly and natively Unicode.

Environments: Perl v5.18.2; PHP v5.5.29

Sunday, 27 September 2015

Nottingham on Weibo

Sina Weibo 新浪微博 is a China microblogging service en.wikipedia.org/wiki/Sina_Weibo. Nottingham now has a number organisations on Sina Weibo. Here are some of them.

  1. Nottingham City Council [@英国诺丁汉市政厅] weibo.com/nottinghamcity
  2. Nottingham Trent University [@英国诺丁汉特伦特大学] weibo.com/ntuinternational
  3. University of Nottingham [@英国诺丁汉大学官方微博] weibo.com/uoneao

Note: I only list Nottingham Weibo accounts that are verified and have a meaningful URL ie not the default numeric form.

Wednesday, 29 July 2015

JavaScript Variable Names

In schappo.blogspot.co.uk/2015/06/php-variable-names.html I explained how PHP variable names are determined at the byte level. JavaScript variable names are determined at a higher level and are defined in terms of Unicode Properties and General Categories.

Letʼs start with some simple Basic Latin variable name examples.
  • Valid Names:  nottingham  nottingham8
  • Invalid Name:  8nottingham
where 8 is Unicode character U+0038 DIGIT 8. The last name, above, being invalid because it begins with a digit.

We are in the Unicode age and so do not need to restrict ourselves to Basic Latin. Some time ago I asked myself whether the same Basic Latin rule, an initial digit is invalid, applies to other Scripts and yes it does (well mostly).
  • Valid Devanagari Names:  नाटिंघम  नाटिंघम६
  • Invalid Devanagari Name:  ६नाटिंघम
where ६ is Unicode character U+096C DEVANAGARI DIGIT SIX
  • Valid Thai Names:  นอตทิงแฮม  นอตทิงแฮม๘
  • Invalid Thai Name:  ๘นอตทิงแฮม
where ๘ is Unicode character U+0E58 THAI DIGIT EIGHT
  • Valid Telugu Names:   నాటింగ్‌హామ్  నాటింగ్‌హామ్౬
  • Invalid Telugu Name:  ౬నాటింగ్‌హామ్
where ౬ is Unicode character U+0C6C TELUGU DIGIT SIX
  • Valid Chinese Names:  诺丁汉  诺丁汉八  八诺丁汉
where 八 is the digit 8, aka Unicode character U+516B CJK UNIFIED IDEOGRAPH-516B

So, Unicode CJK Ideographs are the exception as a name can begin with a digit. Unicode CJK Ideographs encompass Chinese Hanzi, Japanese Kanji and Korean Hanja. The reason for this exception is because of the Unicode General Category to which these digits are assigned. The digits 8, ६, ๘, ౬ have the General Category Nd which is Decimal Number and thus, cannot be the first character of a variable name. 八 has the General Category Lo which is Other Letter and thus, can be the first character of a variable name even though semantically it is a digit.

The above holds true for ECMAScript 5 and 6 and possibly earlier versions but it is unlikely I will test earlier versions. You can validate your variable names online at mothereff.in/js-variables.

For further reading I suggest:

Friday, 17 July 2015

Multilingual PDF

There is an OS X extension lpdf, which is the Multilingual PDF of the title of this blog article. Letʼs look at System Information.app with terminal.

find . -name '*.lpdf'
./Contents/Resources/ProductGuides/productinfo1.lpdf
./Contents/Resources/ProductGuides/productinfo2.lpdf
./Contents/Resources/ProductGuides/productinfo3.lpdf
./Contents/Resources/ProductGuides/regulatory-022-00023.lpdf
./Contents/Resources/ProductGuides/regulatory-022-00272.lpdf
./Contents/Resources/ProductGuides/regulatory-022-00432.lpdf
./Contents/Resources/ProductGuides/regulatory-022-00458.lpdf
./Contents/Resources/ProductGuides/regulatory-022-00459.lpdf
./Contents/Resources/ProductGuides/regulatory-022-00465.lpdf
./Contents/Resources/ProductGuides/regulatory-022-00466.lpdf
./Contents/Resources/ProductGuides/regulatory-022-5167.lpdf
./Contents/Resources/ProductGuides/regulatory-022-6097.lpdf
./Contents/Resources/ProductGuides/regulatory-022-6098.lpdf
./Contents/Resources/ProductGuides/regulatory-022-6346.lpdf
./Contents/Resources/ProductGuides/regulatory-022-6347.lpdf


Looking inside, for example, productinfo1.lpdf/Contents/Resources we can see a set of lproj directories.
ar.lproj es.lproj id.lproj pl.lproj th.lproj ca.lproj es_MX.lproj it.lproj pt.lproj tr.lproj cs.lproj fi.lproj ja.lproj pt_PT.lproj uk.lproj da.lproj fr.lproj ko.lproj ro.lproj vi.lproj de.lproj he.lproj ms.lproj ru.lproj zh_CN.lproj el.lproj hr.lproj nl.lproj sk.lproj zh_TW.lproj en.lproj hu.lproj no.lproj sv.lproj
The above directory names are of the form language tag.lproj. Languages include: ar (arabic), fr (french), ja (japanese), ko (korean), th (thai) and zh_CN (chinese in China). Inside each of the above directories is a language localized productinfo1.pdf e.g. ko.lproj contains a productinfo1.pdf which is the Korean language version of the document.

More concisely: A lpdf is an OS X Package containing a set of language localized PDFs.

Note: This exploration of lpdf was carried out using OS X Yosemite 10.10.4.

Monday, 22 June 2015

PHP Variable Names

For years I thought PHP variable names could only be constructed from ASCII characters. Actually, maybe I had not really thought about it but rather just followed common practice without question. The common practice being something like
  • a variable name is prefixed with $
  • the first character must be a letter (a-z, A-Z) or an underscore (_)
  • subsequent characters can be any mix of letters or digits (0-9) or underscore
So, examples of valid PHP variable names include
  • $Andre   $age   $previous_total
But!!!!! We are in the Unicode age and so variable names are NOT restricted to the above common practice. We can be much more creative. We can, for instance, localise our code. Examples of valid variable names include
  • $André   $小山   $エクセレント   $우수한   $🐉
For the following explanation I am assuming your source code file is saved as Unicode UTF-8. If not, it should be.

Letʼs refer to the the definitive PHP documentation concerning variable names which is at php.net/manual/en/language.variables.basics.php. The key is the specified regular expression
  • [a-zA-Z_\x7F-\xFF][a-zA-Z0-9_\x7F-\xFF]*
The variable name 小山 UTF-8 encoded is E5 B0 8F E5 B1 B1 which is matched by above regular expression. The variable name 🐉 UTF-8 encoded is F0 9F 90 89 which is also matched by the above regular expression.

Determination of valid variable names is at a low level, the byte level. A UTF-8 encoded character will consist of 1 to 4 bytes. Only characters in the Basic Latin Unicode block (which is the same as ASCII) use 1 byte for encoding. All other characters require 2 to 4 bytes for encoding. The byte values for these All other characters are always ≥ 80. The consequence is that if one uses non Basic Latin Unicode characters there are no restrictions whatsoever! Thus one can, for example, have Chinese, Japanese, Korean, Punjabi, Russian or Egyptian Hieorglyphs variable names. One can have Currency Symbol, Mathematical Operators or Emoji variable names. An opportunity to be creative.

There are perhaps certain practices one should avoid when using Unicode for your variable names. The below are actually 3 different (valid) variable names even though they appear visually identical.
  • $André  (uses U+00E9 LATIN SMALL LETTER E WITH ACUTE)
  • $André  (uses U+0065 LATIN SMALL LETTER E & U+0301 COMBINING ACUTE ACCENT)
  • $Аndré (uses U+0410 CYRILLIC CAPITAL LETTER A)
What of the allowed byte value 7F (DELETE)? Why is this allowed in a variable name? I do not have a good answer to this. I also do not have a bad answer. At the moment I am going to leave this pending whilst I conduct further research.

Sunday, 14 June 2015

Chinese Name

My chinese adopted name is 小山 which consists of two chinese characters. Sometime ago I was informed that there is a single chinese character that combines 小 and 山. The ideographic description character sequence is ⿱小山

This combined character is encoded in Unicode in SIP (Supplementary Ideographic Plane), CJK Unified Ideographs Extension C at codepoint U+2AA24. The only font I have, so far, found which contains a glyph for this character is hanazono which is available for download from osdn.jp/projects/hanazono-font/releases/62072 Hanazono is actually provided as two ttf font files:  HanaMinA and HanaMinB. Here is what the combined character, which is in HanaMinB, looks like in TextEdit on OSX.



Is there a sound for this uncommon character? After much searching I discovered cns11643.gov.tw/MAIDB/query_general_view.do?page=c&code=263f which represents the sound as shān (Hanyu Pinyin) and ㄕㄢ (Zhuyin). Same sound for both forms of representation. CNS 11643 is a Taiwanese character set.