Sunday 31 July 2011

IDN Whitelisting

When browsing using IDNs (Internationalized Domain Names) it is important that the Unicode form is displayed in the browser address bar and not the Punycode form. Only the Unicode form is human readable. Well it is if you know the language the IDN is written in. Lets take the Korean web address 송파구청.한국 by way of example.
  • 송파구청.한국 is the Unicode form
  • xn--2e0b569ap6hmmg.xn--3e0b707e is the Punycode form
You can convert between Unicode and Punycode forms using Verisign's online IDN Conversion Tool.

Unfortunately browsers do not always display the Unicode form but instead display the Punycode form in the address bar. Personally I consider that all browsers should always display the Unicode form. By displaying the Punycode form the browsers are impeding internationalization of the internet. The reason they give for displaying the Punycode form is for "Security". I am not convinced it is necessary because Registries take their own security measures to ensure registration of legitimate IDNs.

Ensuring display of  the Unicode form of IDNs is achieved by whitelisting. Different browsers have different whitelisting strategies and different methods for whitelisting.  My setup is MacOSX 10.6.8 using the English OSX and English versions of the browsers.

Firefox (v5.0.1)


Firefox whitelists by TLD (Top Level Domain). If a TLD is whitelisted the Unicode form will be displayed in the address bar☺. If a TLD is not whitelisted the Punycode form will be displayed in the address bar☹. There are many TLDs that are not whitelisted by default in Firefox.

One can view which TLDs are whitelisted in firefox:
  • Open a Firefox window & type about:config in the address bar
  • Click the I'll be careful, I promise button
  • Type network.IDN.whitelist into the filter box
You will now see a list of whitelisted TLDs. They are of the form network.IDN.whitelist.TLD. Note that some TLDs begin xn--. These are the Punycode forms of IDN TLDs.

Lets take the Korean IDN 송파구청.한국 as an example. 한국 means Korea and is a recently introduced IDN TLD and as such is not whitelisted by default. It is the Punycode and not the Unicode form that needs to be whitelisted in Firefox. The Punycode form of 한국 is xn--3e0b707e. Assuming you are still in the Firefox config window then in order to whitelist 한국:
  • right click mouse
  • select New  ▶  Boolean
  • Enter the preference name network.IDN.whitelist.xn--3e0b707e
  • set Boolean value to true and OK
Korean IDNs with the TLD 한국 will now display in Unicode form ie in Korean as one would naturally expect.

Google Chrome (v12.0.742.122)


Chrome will display the Unicode form for, as Google puts it, "those languages that the user claims to understand" otherwise the Punycode form will be displayed. As usual, for display, Unicode good Punycode bad.

I will again use 송파구청.한국 to illustrate. If you have a basic unconfigured English version of Chrome then, in the address bar, you will see xn--9d0br8l80kv9k92a984a.xn--3e0b707e displayed instead of 송파구청.한국 So we need to configure Chrome to display the Unicode form for Korean IDNs:
  • go to Chrome Preferences
  • click Under the Hood
  • click Languages and Spell-checker Settings...
  • in the Languages column click Add
  • select Korean - 한국어 and OK that
Korean IDNs will now display correctly. Repeat for all other languages for which you want correct display of IDNs. For more detailed information see the IDN in Google Chrome article.


Safari (v5.1)


Safari whitelists IDNs by Script (as in the script used to write a language). Any IDN that is written in a whitelisted Script will be displayed in the address bar in Unicode form☺. Any IDN that is written in a Script that is not whitelisted will be displayed in Punycode form☹.

The whitelisted Scripts are in a text file entitled IDNScriptWhiteList.txt The pathname to this file is:
  • /System/Library/Frameworks/WebKit.framework/Versions/Current/Resources
The default whitelisted Scripts are: Arabic, Armenian, Bopomofo, Canadian_Aboriginal, Devanagari, Deseret, Gujarati, Gurmukhi, Hangul, Han, Hebrew, Hiragana, Katakana_Or_Hiragana, Katakana, Latin, Tamil, Thai, Yi

If you want to whitelist other Scripts, such as, Cyrillic or Greek then you need to add the Script names to this file. You will require root privileges to edit this file.

Of all the browser whitelisting methods I have looked at I consider this the most rational but also, for the average user, the most difficult to configure. Apple should implement a simple user accessible method for whitelisting Scripts.

Monday 25 July 2011

Loughborough Market T-shirts

On Saturday 23rd July Gordon Gekko, Jonny (刘家杰) and myself gave free T-shirts to Loughborough Market Stallholders. These T-shirts have Loughborough Market printed on them, not in English but in Chinese or Japanese. This endeavour is part of an initiative to internationalise Loughborough and make the World aware of Loughborough.

In the photo below Gordon Gekko, who is on the left, is wearing the Chinese T-shirt and Jonny is wearing the Japanese T-shirt. The text printed on the T-shirts is:

  • 拉夫堡市场 which is Simplified Chinese for Loughborough Market
  • ラフバラ市場 which is Japanese for Loughborough Market
Loughborough Market is on Thursdays and Saturdays so if you visit the market on a warm sunny day you may well see stallholders wearing these T-shirts.


Note: It may interest you to know that Gordon has the adopted Chinese name 恺心 (Kǎixīn).

Thursday 21 July 2011

Promoting Loughborough Internationally

Those of you that use Twitter will be familiar with the concept and use of hashtags. Recently Twitter implemented internationalized hashtags. Specifically, they introduced hashtags for Chinese, Japanese, Hangeul (Korean) and Cyrillic scripts. This presents a golden opportunity for Loughborough Twitterers to raise the international profile of Loughborough. Or to put it another way: Make the World aware of Loughborough.

All you need to do is put internationalised versions of #loughborough in your tweets. The more tweets that contain internationalised hashtags the more Loughborough will be noticed. If you would like to take part in this initiative then just copy one or more of the following hashtags and paste into your tweets.

  • #拉夫堡
  • #ラフバラ
  • #러프버러
  • #Лафборо
where
  • 拉夫堡 is Chinese for Loughborough
  • ラフバラ is Japanese for Loughborough
  • 러프버러 is Korean for Loughborough
  • Лафборо is Russian for Loughborough
By way of example, here is a link to one of my tweets in which I have used the Japanese #ラフバラ hashtag:

Sunday 17 July 2011

Twitter Hashtags

One can now use Unicode hashtags in twitter. This means that hashtags are no longer restricted to ASCII characters. One can now, for instance, have hashtags written in the Chinese or Japanese. The hashtag #loughborough can be written in Chinese as #拉夫堡 and in Japanese as #ラフバラ . Unicode hashtags have been operational since 13th July 2011. The original announcement is on the Twitter Japan Blog at blog.jp.twitter.com/2011/07/blog-post.html. There is also an announcement of the new hangeul (한글) hashtags (해시태그/해쉬태그) on Twitter's Korea Blog at blog.kr.twitter.com/2011/07/blog-post.html

I gather from the announcement that the currently supported Scripts for hashtags are Chinese, Japanese, Hangeul and Cyrillic. Therefore the supported Scripts are a small subset of the Scripts available in the Unicode Character Set. I tested out some Scripts in hashtags and my results are:

  1. Chinese ✓
  2. Japanese ✓
  3. Hangeul ✓
  4. Cyrillic ✓
  5. Thai ✗
  6. Arabic ✗
  7. Hebrew ✗
  8. Devanagari ✗
  9. Tamil ✗
The announcement states that symbols are not allowed in hashtags. I tested #→ #① #∛ #≤ #△ #☃ #◲ #✈ #❄ #☺and none work as hashtags.

An example tweet using Japanese hashtags is at http://twitter.com/#!/andreschappo/...

I had written a previous article which covered some of the Twitter i18n issues http://schappo.blogspot.com/2011/... The implementation of Unicode hashtags is a significant i18n step forward. It will be interesting to see how this new feature develops amongst Japanese language tweeters. I have already noticed that there are some long Japanese hashtags. As I write this blog I notice that currently there are two Japanese hashtag Trending Topics in Japan:

  • #名言の文末を過去形にすると深みが増す
  • #文頭に週刊をつけるとディアゴスティーニ風になる

Japanese does not use the space character between words and so it is easy and natural to create long Japanese hashtags. It could be the length of a complete tweet by simply having # as the first character.

Monday 4 July 2011

China's SNS

I have frequently advocated the use of China's SNS in order to have a presence in China. I have had a Sina Microblog 新浪微博 for several weeks weibo.com/andreschappo. Sina's Microblog is a closed system so you will need to register an account in order to view my microblog. The exception is that with Verified Users one can view the first few postings without an account. My account is not verified.

On Saturday, with the help of 刘家杰, I created a Sina Blog 新浪博客 blog.sina.com.cn/andreschappo. Unlike Sina Microblog, Sina Blog is an open system and so you can view my blog without registering an account.