Sunday, 31 July 2011

IDN Whitelisting

When browsing using IDNs (Internationalized Domain Names) it is important that the Unicode form is displayed in the browser address bar and not the Punycode form. Only the Unicode form is human readable. Well it is if you know the language the IDN is written in. Lets take the Korean web address 송파구청.한국 by way of example.
  • 송파구청.한국 is the Unicode form
  • xn--2e0b569ap6hmmg.xn--3e0b707e is the Punycode form
You can convert between Unicode and Punycode forms using Verisign's online IDN Conversion Tool.

Unfortunately browsers do not always display the Unicode form but instead display the Punycode form in the address bar. Personally I consider that all browsers should always display the Unicode form. By displaying the Punycode form the browsers are impeding internationalization of the internet. The reason they give for displaying the Punycode form is for "Security". I am not convinced it is necessary because Registries take their own security measures to ensure registration of legitimate IDNs.

Ensuring display of  the Unicode form of IDNs is achieved by whitelisting. Different browsers have different whitelisting strategies and different methods for whitelisting.  My setup is MacOSX 10.6.8 using the English OSX and English versions of the browsers.

Firefox (v5.0.1)

Firefox whitelists by TLD (Top Level Domain). If a TLD is whitelisted the Unicode form will be displayed in the address bar☺. If a TLD is not whitelisted the Punycode form will be displayed in the address bar☹. There are many TLDs that are not whitelisted by default in Firefox.

One can view which TLDs are whitelisted in firefox:
  • Open a Firefox window & type about:config in the address bar
  • Click the I'll be careful, I promise button
  • Type network.IDN.whitelist into the filter box
You will now see a list of whitelisted TLDs. They are of the form network.IDN.whitelist.TLD. Note that some TLDs begin xn--. These are the Punycode forms of IDN TLDs.

Lets take the Korean IDN 송파구청.한국 as an example. 한국 means Korea and is a recently introduced IDN TLD and as such is not whitelisted by default. It is the Punycode and not the Unicode form that needs to be whitelisted in Firefox. The Punycode form of 한국 is xn--3e0b707e. Assuming you are still in the Firefox config window then in order to whitelist 한국:
  • right click mouse
  • select New  ▶  Boolean
  • Enter the preference name network.IDN.whitelist.xn--3e0b707e
  • set Boolean value to true and OK
Korean IDNs with the TLD 한국 will now display in Unicode form ie in Korean as one would naturally expect.

Google Chrome (v12.0.742.122)

Chrome will display the Unicode form for, as Google puts it, "those languages that the user claims to understand" otherwise the Punycode form will be displayed. As usual, for display, Unicode good Punycode bad.

I will again use 송파구청.한국 to illustrate. If you have a basic unconfigured English version of Chrome then, in the address bar, you will see xn--9d0br8l80kv9k92a984a.xn--3e0b707e displayed instead of 송파구청.한국 So we need to configure Chrome to display the Unicode form for Korean IDNs:
  • go to Chrome Preferences
  • click Under the Hood
  • click Languages and Spell-checker Settings...
  • in the Languages column click Add
  • select Korean - 한국어 and OK that
Korean IDNs will now display correctly. Repeat for all other languages for which you want correct display of IDNs. For more detailed information see the IDN in Google Chrome article.

Safari (v5.1)

Safari whitelists IDNs by Script (as in the script used to write a language). Any IDN that is written in a whitelisted Script will be displayed in the address bar in Unicode form☺. Any IDN that is written in a Script that is not whitelisted will be displayed in Punycode form☹.

The whitelisted Scripts are in a text file entitled IDNScriptWhiteList.txt The pathname to this file is:
  • /System/Library/Frameworks/WebKit.framework/Versions/Current/Resources
The default whitelisted Scripts are: Arabic, Armenian, Bopomofo, Canadian_Aboriginal, Devanagari, Deseret, Gujarati, Gurmukhi, Hangul, Han, Hebrew, Hiragana, Katakana_Or_Hiragana, Katakana, Latin, Tamil, Thai, Yi

If you want to whitelist other Scripts, such as, Cyrillic or Greek then you need to add the Script names to this file. You will require root privileges to edit this file.

Of all the browser whitelisting methods I have looked at I consider this the most rational but also, for the average user, the most difficult to configure. Apple should implement a simple user accessible method for whitelisting Scripts.