Monday, 22 June 2015

PHP Variable Names

For years I thought PHP variable names could only be constructed from ASCII characters. Actually, maybe I had not really thought about it but rather just followed common practice without question. The common practice being something like
  • a variable name is prefixed with $
  • the first character must be a letter (a-z, A-Z) or an underscore (_)
  • subsequent characters can be any mix of letters or digits (0-9) or underscore
So, examples of valid PHP variable names include
  • $Andre   $age   $previous_total
But!!!!! We are in the Unicode age and so variable names are NOT restricted to the above common practice. We can be much more creative. We can, for instance, localise our code. Examples of valid variable names include
  • $André   $小山   $エクセレント   $우수한   $🐉
For the following explanation I am assuming your source code file is saved as Unicode UTF-8. If not, it should be.

Letʼs refer to the the definitive PHP documentation concerning variable names which is at php.net/manual/en/language.variables.basics.php. The key is the specified regular expression
  • [a-zA-Z_\x7F-\xFF][a-zA-Z0-9_\x7F-\xFF]*
The variable name 小山 UTF-8 encoded is E5 B0 8F E5 B1 B1 which is matched by above regular expression. The variable name 🐉 UTF-8 encoded is F0 9F 90 89 which is also matched by the above regular expression.

Determination of valid variable names is at a low level, the byte level. A UTF-8 encoded character will consist of 1 to 4 bytes. Only characters in the Basic Latin Unicode block (which is the same as ASCII) use 1 byte for encoding. All other characters require 2 to 4 bytes for encoding. The byte values for these All other characters are always ≥ 80. The consequence is that if one uses non Basic Latin Unicode characters there are no restrictions whatsoever! Thus one can, for example, have Chinese, Japanese, Korean, Punjabi, Russian or Egyptian Hieorglyphs variable names. One can have Currency Symbol, Mathematical Operators or Emoji variable names. An opportunity to be creative.

There are perhaps certain practices one should avoid when using Unicode for your variable names. The below are actually 3 different (valid) variable names even though they appear visually identical.
  • $André  (uses U+00E9 LATIN SMALL LETTER E WITH ACUTE)
  • $André  (uses U+0065 LATIN SMALL LETTER E & U+0301 COMBINING ACUTE ACCENT)
  • $Аndré (uses U+0410 CYRILLIC CAPITAL LETTER A)
What of the allowed byte value 7F (DELETE)? Why is this allowed in a variable name? I do not have a good answer to this. I also do not have a bad answer. At the moment I am going to leave this pending whilst I conduct further research.

Sunday, 14 June 2015

Chinese Name

My chinese adopted name is 小山 which consists of two chinese characters. Sometime ago I was informed that there is a single chinese character that combines 小 and 山. The ideographic description character sequence is ⿱小山

This combined character is encoded in Unicode in SIP (Supplementary Ideographic Plane), CJK Unified Ideographs Extension C at codepoint U+2AA24. The only font I have, so far, found which contains a glyph for this character is hanazono which is available for download from osdn.jp/projects/hanazono-font/releases/62072 Hanazono is actually provided as two ttf font files:  HanaMinA and HanaMinB. Here is what the combined character, which is in HanaMinB, looks like in TextEdit on OSX.



Is there a sound for this uncommon character? After much searching I discovered cns11643.gov.tw/MAIDB/query_general_view.do?page=c&code=263f which represents the sound as shān (Hanyu Pinyin) and ㄕㄢ (Zhuyin). Same sound for both forms of representation. CNS 11643 is a Taiwanese character set.

Friday, 12 June 2015

New gTLDs CSV

There is a regularly updated ICANN csv file which summarises key New gTLD information in 6 fields. It is available at newgtlds.icann.org/newgtlds.csv (ref: cabforum.org/pipermail/public/2014-September/003907.html)

field 1: ASCII form (A-label)
field 2: unicode form (U-label) if an IDN otherwise empty
field 3: registry operator
field 4: date registry agreement signed
field 5: application number
field 6: date of delegation (empty if not yet delegated)

Fields 1,2 (if an IDN),3,4 from this csv are incorporated into the public suffix list publicsuffix.org/list/public_suffix_list.dat

For example, the csv entry for .コム is

xn--tckwe,コム,"VeriSign Sarl",2015-01-15,1-1254-37311,

It seems as though Chrome browser uses this public suffix list. One interesting (and IMHO good) consequence is that Chrome will recognise a .コム IDN (eg anything.コム ) as a domain name to be resolved even though it is not yet delegated. Chrome does not resort to a search. So Chrome is ready to go as soon as .コム is delegated.