Monday 22 June 2015

PHP Variable Names

For years I thought PHP variable names could only be constructed from ASCII characters. Actually, maybe I had not really thought about it but rather just followed common practice without question. The common practice being something like
  • a variable name is prefixed with $
  • the first character must be a letter (a-z, A-Z) or an underscore (_)
  • subsequent characters can be any mix of letters or digits (0-9) or underscore
So, examples of valid PHP variable names include
  • $Andre   $age   $previous_total
But!!!!! We are in the Unicode age and so variable names are NOT restricted to the above common practice. We can be much more creative. We can, for instance, localise our code. Examples of valid variable names include
  • $André   $小山   $エクセレント   $우수한   $🐉
For the following explanation I am assuming your source code file is saved as Unicode UTF-8. If not, it should be.

Letʼs refer to the the definitive PHP documentation concerning variable names which is at php.net/manual/en/language.variables.basics.php. The key is the specified regular expression
  • [a-zA-Z_\x7F-\xFF][a-zA-Z0-9_\x7F-\xFF]*
The variable name 小山 UTF-8 encoded is E5 B0 8F E5 B1 B1 which is matched by above regular expression. The variable name 🐉 UTF-8 encoded is F0 9F 90 89 which is also matched by the above regular expression.

Determination of valid variable names is at a low level, the byte level. A UTF-8 encoded character will consist of 1 to 4 bytes. Only characters in the Basic Latin Unicode block (which is the same as ASCII) use 1 byte for encoding. All other characters require 2 to 4 bytes for encoding. The byte values for these All other characters are always ≥ 80. The consequence is that if one uses non Basic Latin Unicode characters there are no restrictions whatsoever! Thus one can, for example, have Chinese, Japanese, Korean, Punjabi, Russian or Egyptian Hieorglyphs variable names. One can have Currency Symbol, Mathematical Operators or Emoji variable names. An opportunity to be creative.

There are perhaps certain practices one should avoid when using Unicode for your variable names. The below are actually 3 different (valid) variable names even though they appear visually identical.
  • $André  (uses U+00E9 LATIN SMALL LETTER E WITH ACUTE)
  • $André  (uses U+0065 LATIN SMALL LETTER E & U+0301 COMBINING ACUTE ACCENT)
  • $Аndré (uses U+0410 CYRILLIC CAPITAL LETTER A)
What of the allowed byte value 7F (DELETE)? Why is this allowed in a variable name? I do not have a good answer to this. I also do not have a bad answer. At the moment I am going to leave this pending whilst I conduct further research.