André 小山 Schappo: Computer Science Internationalization

When coding, it is essential to consider and account for edge cases. Whilst coding Internationalised Programming Challenge 17 jsfiddle.net/coas/4djhso1y I happened upon an unexpected and fascinating edge case.

Before I reveal the edge case I need to give you some background information, starting with some Chinese characters.

娥，鄂，鹅，仒，厄，戹*，屵*，阨*，阿*，呝，俄，砨*，偔，堨*，圔*，誒*，噁*，儑*，貖*，礘*，櫮，鰪*，岋，阸*，妸，咢，匎*，卾，隘*，廅*，僫，蕚，噩，鍔，額，鰐，讹，吪，妿，咹，胺，啞*，蛯*，搤，磀，遻*，嶭*，騀，顎，鶚 ...and many more at chinese-tools.com/tools/sinograms.html?p=e

All these Chinese characters can be written in pinyin as E or e. Those characters marked with *, have multiple meanings, hence multiple pronunciations, hence multiple ways of writing in pinyin. Those characters not marked with * are only written in pinyin as E or e. Some of you may well be thinking, what of tone marks. Well, unless I explicitly request it, I have never seen a Chinese person write pinyin with tone marks.

Some of these Chinese characters are family names and some would be suitable for given names.

So, now to the edge case. A Chinese name when written in pinyin could be E E or Ee E. I asked on Weibo 微博 whether anyone knew of any Chinese name which when written in pinyin is E E or Ee E. One person responded with the Chinese name 鄂娥 which written in pinyin is E E.

I reason that a person whose only language is English would think E E are initials and not the full name. Actually, before I considered name edge cases I would probably also have thought E E are initials. I have been aware for a long time that some Chinese characters can be written in pinyin as single letters such as e or a but I had not made the connection with people names.

This example illustrates that programmers need to thoroughly research naming conventions in different countries/cultures/languages before writing validation code.

I would like to encompass several international naming conventions in my Challenge 17. So far, I have coded validation rules for Chinese 中文, Khmer ភាសាខ្មែរ, Korean 한국어, Vietnamese Tiếng Việt and a catchall. I welcome contributions of international naming rules which I will code and incorporate into Challenge 17. You can email me, or if you do not know my email me you can tweet me @andreschappo or contact me on Weibo 微博 @schappo

Techie stuff: The regex I use for Chinese name validation is:

XRegExp("^((?![\\p{InKangxi_Radicals}\\p{InCJK_Radicals_Supplement}\\p{InCJK_Symbols_and_Punctuation}])\\p{Han}){2,4}$","u")

My regex is using negative look-ahead, recognisable by the ?! construct. For each character, I am checking that it is a Han* character and is not a radical or symbol or punctuation character. This can be generalised to:

(?!Character_Set_B)Character_Set_A

which reads as: a character must not be in Character_Set_B and must be in Character_Set_A in order to be valid.

The same can be achieved using negative look-behind:

XRegExp("^(\\p{Han}(?<![\\p{InKangxi_Radicals}\\p{InCJK_Radicals_Supplement}\\p{InCJK_Symbols_and_Punctuation}])){2,4}$","u")

This can be generalised as:

Character_Set_A(?<!Character_Set_B)

which reads as: a character must be in Character_Set_A and not in Character_Set_B in order to be valid.

The Chinese for Regular Expression (regex) is 正则表达式.

* A Han character, in this context, is actually a CJK (Chinese or Japanese or Korean) character but that is far too long a story for this blog article.

Update 30th March 2018: In my XRegExp, above, I am using the u flag. This enables Unicode but only the BMP (Basic Multilingual Plane). There is an A flag which enables the whole of Unicode, BMP + Astral characters, but, for quite some time, I could not get this to work. I did eventually find the problem and a solution. The problem is that jsdelivr minification breaks XRegExp. My solution was simply not to use the jsdelivr minified version of XRegExp. I am, therefore, using cdn.jsdelivr.net/npm/xregexp@4.1.1/xregexp-all.js and not cdn.jsdelivr.net/npm/xregexp@4.1.1/xregexp-all.min.js.

In addition to using the A flag I have had to make some minor changes to my regex. My updated regex is:

XRegExp("^((?!\\p{InKangxi_Radicals}|\\p{InCJK_Radicals_Supplement}|\\p{InCJK_Symbols_and_Punctuation})\\p{Han}){2,4}$","A")

The positive outcome for Chinese and Korean Hanja names validation is that these names can now contain characters from the Unicode SIP (Supplementary Ideographic Plane) in addition to characters in all the other Unicode planes.

André 小山 Schappo

Saturday, 10 March 2018

Computer Science Internationalization - Validating People Names