Saturday, 10 March 2018

Computer Science Internationalization - Validating People Names

When coding, it is essential to consider and account for edge cases. Whilst coding Internationalised Programming Challenge 17 jsfiddle.net/coas/4djhso1y I happened upon an unexpected and fascinating edge case.

Before I reveal the edge case I need to give you some background information, starting with some Chinese characters.

娥,鄂,鹅,仒,厄,戹*,屵*,阨*,阿*,呝,俄,砨*,偔,堨*,圔*,誒*,噁*,儑*,貖*,礘*,櫮,鰪*,岋,阸*,妸,咢,匎*,卾,隘*,廅*,僫,蕚,噩,鍔,額,鰐,讹,吪,妿,咹,胺,啞*,蛯*,搤,磀,遻*,嶭*,騀,顎,鶚 ...and many more at chinese-tools.com/tools/sinograms.html?p=e

All these Chinese characters can be written in pinyin as E or e. Those characters marked with *, have multiple meanings, hence multiple pronunciations, hence multiple ways of writing in pinyin. Those characters not marked with * are only written in pinyin as E or e. Some of you may well be thinking, what of tone marks. Well, unless I explicitly request it, I have never seen a Chinese person write pinyin with tone marks.

Some of these Chinese characters are family names and some would be suitable for given names.

So, now to the edge case. A Chinese name when written in pinyin could be E E or Ee E. I asked on Weibo 微博 whether anyone knew of any Chinese name which when written in pinyin is E E or Ee E. One person responded with the Chinese name 鄂娥 which written in pinyin is E E.

I reason that a person whose only language is English would think E E are initials and not the full name. Actually, before I considered name edge cases I would probably also have thought E E are initials. I have been aware for a long time that some Chinese characters can be written in pinyin as single letters such as e or a but I had not made the connection with people names.

This example illustrates that programmers need to thoroughly research naming conventions in different countries/cultures/languages before writing validation code.

I would like to encompass several international naming conventions in my Challenge 17. So far, I have coded validation rules for Chinese 中文, Khmer ភាសាខ្មែរ, Korean 한국어, Vietnamese Tiếng Việt and a catchall. I welcome contributions of international naming rules which I will code and incorporate into Challenge 17. You can email me, or if you do not know my email me you can tweet me @andreschappo or contact me on Weibo 微博 @schappo

Techie stuff: The regex I use for Chinese name validation is:

XRegExp("^((?![\\p{InKangxi_Radicals}\\p{InCJK_Radicals_Supplement}\\p{InCJK_Symbols_and_Punctuation}])\\p{Han}){2,4}$","u")

My regex is using negative look-ahead, recognisable by the ?! construct. For each character, I am checking that it is a Han* character and is not a radical or symbol or punctuation character. This can be generalised to:

(?!Character_Set_B)Character_Set_A

which reads as: a character must not be in Character_Set_B and must be in Character_Set_A in order to be valid.

The same can be achieved using negative look-behind:

XRegExp("^(\\p{Han}(?<![\\p{InKangxi_Radicals}\\p{InCJK_Radicals_Supplement}\\p{InCJK_Symbols_and_Punctuation}])){2,4}$","u")

This can be generalised as:

Character_Set_A(?<!Character_Set_B)

which reads as: a character must be in Character_Set_A and not in Character_Set_B in order to be valid.

The Chinese for Regular Expression (regex) is 正则表达式.

* A Han character, in this context, is actually a CJK (Chinese or Japanese or Korean) character but that is far too long a story for this blog article.

Update 30th March 2018: In my XRegExp, above, I am using the u flag. This enables Unicode but only the BMP (Basic Multilingual Plane). There is an A flag which enables the whole of Unicode, BMP + Astral characters, but, for quite some time, I could not get this to work. I did eventually find the problem and a solution. The problem is that jsdelivr minification breaks XRegExp. My solution was simply not to use the jsdelivr minified version of XRegExp. I am, therefore, using cdn.jsdelivr.net/npm/xregexp@4.1.1/xregexp-all.js and not cdn.jsdelivr.net/npm/xregexp@4.1.1/xregexp-all.min.js.

In addition to using the A flag I have had to make some minor changes to my regex. My updated regex is:

XRegExp("^((?!\\p{InKangxi_Radicals}|\\p{InCJK_Radicals_Supplement}|\\p{InCJK_Symbols_and_Punctuation})\\p{Han}){2,4}$","A")

The positive outcome for Chinese and Korean Hanja names validation is that these names can now contain characters from the Unicode SIP (Supplementary Ideographic Plane) in addition to characters in all the other Unicode planes.

Saturday, 3 March 2018

Computer Science Internationalization - Bidirectional Text

For this blog article I have taken the last two text boxes from my internationalised programming challenge 9 at jsfiddle.net/coas/qa8190kn. Both text boxes contain the same text which is a mix of English and Hebrew. English is written and read left to right. Hebrew is written and read right to left. We are going to look more closely at bidirectional (bidi) text in browsersʼ text boxes. The two text boxes are:

text box 1: direction is Left ➜ Right


text box 2: direction is Left Right

The order of the text in these two boxes is in display order which is the order presented to users. The order in which it is actually stored is called memory order or logical order. It is the order in which I typed the text. The memory order is:

So, a is the first character and h is the last character of the text. Selection of text is determined by memory order. Selecting characters 3 thru 6 will give cdבא, characters 6 thru 9 gives הדגב, characters 11 thru 14 gives תזef and so forth.

Now onto the selection process. One way of selecting text is to use shift in conjunction with the arrow keys. You should make the following associations:

  • text box 1, Left ➜ Right
    • right arrow key becomes forward
    • left arrow key becomes back
  • text box 2, Left Right
    • left arrow key becomes forward
    • right arrow key becomes back

Having made the above associations, forget all about left and right and think only of forward and back where forward is moving forwards through the text (memory order) and back is moving backwards through the text (memory order).

Now to selection, starting with the cursor between b and c:

  • shift forward forward forward, will select cdא
  • shift back back, will select ab

These key sequences will select the same text in both textboxes 1 & 2. The difference is in how it is presented to the user. It is presented to the user in display order.

I suggest you practice selecting text in text boxes 1 & 2. Initially it will seem somewhat confusing. When you are selecting text in text boxes 1 & 2, use the memory order text box as a reference to more easily determine what should be selected.

Now try selecting text in the following two text boxes. They both contain the same text, which is a mix of Khmer and Arabic. Additionally, paste your selections into a word processor.

text box 3: direction is Left ➜ Right


text box 4: direction is Left Right

Onto a related topic which is cursor movement in browser text boxes. Forget text selection. Cursor movement with left and right arrow keys is in display order only. I think there should be a browser option to switch between cursor movement by display order and cursor movement by memory order. I would like the same option in word processors and text editors.