Friday, 31 March 2017

Computer Science Internationalization - Adaptive URL

A URL can consist of a Domain Name and a pathname. In the examples below x.y.z represents the Domain Name, the remainder being the pathname. My experience of the internet is that the pathname is usually written in English or more accurately ASCII. The below ASCII pathname represents a multi-page website in the form of a journey from home to a hotel in Korea.


Websites, such as Google, adapt the language of their text content according to the browser preferred display language (BL). This browser preferred language can be set by the user. Letʼs go one step further than Google and adapt the language of the URL pathname according to the BL. Here is the ASCII pathname rewritten into Chinese, Japanese and Korean.




So, how do we implement these language adaptive URL parthnames? Firstly, we need to programmatically determine the BL. One way of achieving this is to examine the Accept-Language http header sent from the browser to the server. This will contain one or more language tags. If there is more than one language tag they are presented in priority order. Language tags can take many forms. They include: zh, zh-CN and cmn for Mandarin Chinese; ja for Japanese and ko for Korean. Now that we can determine the BL we can select the appropriate URL pathname, thus internationalizing our website with a language adaptive URL pathname.

On a Linux machine, each component of the pathname will be a directory. In my schema I am assuming an index.html or index.php, per directory. A requirement of this schema is that we do not want a directory hierarchy for each language, nor do we want an index.html or index.php for each language.

My native language is English so I will make my master pathname directory names English ie home, bus, airplane, korea, taxi and hotel. I will make the Chinese, Japanese and Korean directory names as aliases to the English named master directories. This can be easily achieved on Linux with the ln -s command, where ln means link and the -s option means create symbolic link, as opposed to a hard link.

ln -s home 家
ln -s home ホーム
ln -s home 홈

ln -s hotel 饭店
ln -s hotel ホテル
ln -s hotel 호텔

What if your native language is not English? In that case, create the master pathname directory names in your native language. If your native language is Korean then the master directory names will be 집, 버스, 비행기, 한국, 택시 and 호텔 and your links will be:

ln -s 홈 home
ln -s 홈 家
ln -s 홈 ホーム

ln -s 호텔 hotel
ln -s 호텔 饭店
ln -s 호텔 ホテル

Emoji are hugely popular so letʼs construct a totally cool Emoji pathname.


ln -s home 🏡
ln -s bus 🚌
ln -s airplane ✈️
ln -s korea 🇰🇷
ln -s taxi 🚕
ln -s hotel 🏨

I have never encountered an Emoji URL pathname on a website and so implementing such a pathname on your website would be both totally cool and unique. You could also use an Emoji pathname for those languages your website does not support. My schema only supports Chinese, English, Japanese and Korean. If the BL was an unsupported language, such as Arabic, then the Emoji pathname could be displayed in the browser address bar instead of, for example, defaulting to English.

I have used x.y.x to represent the Domain Name, the implication being it is ASCII. We can complete the language adaptive equation by having Domain Names in supported BL languages. Thus my completed equation schema would have Chinese, Japanese and Korean Domain Names in addition to an ASCII Domain Name.

Friday, 17 March 2017

Computer Science Internationalization - EAI

As I stated in both DataMail and Google mail support Email Address Internationalization (EAI). DataMail provides a complete EAI service which includes both support and creation of Internationalized email addresses. Google Mail provides a partial EAI service, in that, it supports EAI but does not yet provide for creation of internationlized email accounts with internationalized email addresses. Thus organisations using Google Mail have an advantage over those organisations having an ASCII addresses only email service and have a head start in provision of a complete EAI service.

Given the Domain name of an organisation, the Unix host command can be used to determine the mail service provider. Here are some of the organisations using Google Mail:

苹果电脑 ~: host has address has address has address mail is handled by 10 mail is handled by 1 mail is handled by 10 mail is handled by 5 mail is handled by 10 mail is handled by 5 mail is handled by 10
苹果电脑 ~: host has address has address mail is handled by 30 mail is handled by 10 mail is handled by 20 mail is handled by 30 mail is handled by 20
苹果电脑 ~: host # ミクシィ has address has address has address mail is handled by 30 mail is handled by 10 mail is handled by 20 mail is handled by 20 mail is handled by 30
苹果电脑 ~: host # University of Bristol has address mail is handled by 5 ALT1.ASPMX.L.GOOGLE.COM. mail is handled by 10 ASPMX2.GOOGLEMAIL.COM. mail is handled by 1 ASPMX.L.GOOGLE.COM. mail is handled by 10 ASPMX3.GOOGLEMAIL.COM. mail is handled by 5 ALT2.ASPMX.L.GOOGLE.COM.
苹果电脑 ~: host # Bath Spa University has address has address has address mail is handled by 10 ALT4.ASPMX.L.GOOGLE.COM. mail is handled by 5 ALT2.ASPMX.L.GOOGLE.COM. mail is handled by 1 ASPMX.L.GOOGLE.COM. mail is handled by 5 ALT1.ASPMX.L.GOOGLE.COM. mail is handled by 10 ALT3.ASPMX.L.GOOGLE.COM.
Providing a full EAI service involves going beyond ASCII. It entails supporting Unicode email addresses. Unicode email addresses such as my Chinese email 小山@电邮.在线

Tuesday, 31 January 2017

Computer Science Internationalization - Unicode Terminal Session

Below is an OSX bash shell command line terminal session. It is a real, working terminal session using basic unix commands. It does, though, look significantly different from a standard terminal session. If you know basic unix commands such as ls and cd, you should/may be able to work out what is happening.

苹果电脑 ~: 妈 我的目录
苹果电脑 ~: 茶 我的目录
苹果电脑 我的目录: 丽
苹果电脑 我的目录: 头 文档一 文档二 文档三
苹果电脑 我的目录: 丽
文档一  文档三  文档二
苹果电脑 我的目录: 词 > 文档四
苹果电脑 我的目录: 词 文档四
苹果电脑 我的目录: 丽
文档一  文档三  文档二  文档四
苹果电脑 我的目录: ⇉ 文档四 文档五
苹果电脑 我的目录: 丽
文档一  文档三  文档二  文档五  文档四
苹果电脑 我的目录: → 文档一 文档六
苹果电脑 我的目录: 丽
文档三  文档二  文档五  文档六  文档四
苹果电脑 我的目录: 

So, what is happening!?

Firstly I am using Unicode characters. If you search the internet you will find many examples of terminal sessions but they will invariably be using ASCII characters only. In my above terminal session I am using Unicode characters, mostly Chinese/Japanese and two arrow symbol characters.

Where are the commands such as ls and cd? I have mapped a set of commands to Unicode characters using the alias command eg alias 丽='ls'

I have changed the command line prompt.

If you understand basic bash commands, I believe I have now given you sufficient information in order for you to work out what is happening in the terminal session. Knowing Chinese or Japanese gives a slight advantage but it is not essential to understanding this terminal session. The Chinese/Japanese characters I chose for the command mappings are somewhat random so it will not help you to google translate them.

I actually devised these command mappings and the terminal session several years ago. Today, I decided it was time to put it onto my blog. My main purpose was and still is, to encourage students to think beyond ASCII. I believe it has impact because it is so unexpected when one first sees this terminal session.

There can be many different permutations on the session using different human language scripts and unicode symbols. It makes for an interesting and unusual exercise for students studying unix. Absolutely no reason why one should not, for example, use emoji for the command mappings.

Monday, 9 January 2017

Chinese Email Address

The latest and hottest news is that I now have a Chinese email address➜ 小山@电邮.在线 😄

  1. 小山 is my adopted Chinese name
  2. 电邮 means email
  3. 在线 means online

I acquired my free Chinese email address from DataMail which supports email addresses in twelve languages: العَرَبِيَّة‎‎ Arabic, বাংলা Bengali, 中文 Chinese, English, ગુજરાતી Gujarati, हिन्दी Hindi, मराठी Marathi, ਪੰਜਾਬੀ Punjabi, ру́сский Russian, தமிழ் Tamil, తెలుగు Telugu, اُردُو‎ Urdu.

Additionally, DataMail has an impressive family of IDNs (Internationalized Domain Names) with each language having itʼs own IDN.
  1. Arabic داده.امارات
  2. Bengali ডাটামেল্.ভারত
  3. Chinese 电邮.在线
  4. English
  5. Gujarati ડાટામેલ.ભારત
  6. Hindi डाटामेल.भारत
  7. Marathi डेटामेल.भारत
  8. Punjabi ਡਾਟਾਮੇਲ.ਭਾਰਤ
  9. Russian почта.рус
  10. Tamil இந.இந்தியா
  11. Telugu డేటామెయిల్.భారత్
  12. Urdu ڈاٹامیل.بھارت

If you would like your own DataMail email address in one of the above languages then just click one of the above links. The website directs you to download an Android or iOS App. One uses the App to actually register a DataMail email address.

The main points in the registration process using the DataMail App are:

  1. The crucial part of this process is that firstly you need to select the language for the email address you are about to register. Subsequent instructions will be in the language you have selected. So, I chose Chinese in order to register 小山@电邮.在线.
  2. Validation of your phone number - the DataMail App will, with your approval, send an SMS text to DataMail in India to confirm your phone number. If the validation process fails, it could be that your phone contract does not cover the sending of international SMS text.
  3. Choosing the local-part which in my case is 小山. The Domain Name part is fixed and is provided by DataMail. There is a Domain Name per language, as above.

I have successfully exchanged emails between Gmail ASCII emails addresses and my DataMail Chinese email address. Gmail supports Internationalized Email Addresses (IEAs) but one cannot create IEAs in Gmail. DataMail, to my knowledge, is currently the only production email system that both supports and allows creation of IEAs. I used Gmail with a browser when testing exchange of IEAs. If you are accessing your Gmail using IMAP or POP then IEAs may or may not work. It all depends on whether or not your client software supports IEAs.

I have sent email from DataMail using my Chinese email address 小山@电邮.在线 to several Gmail users. My current experience is that for some of the Gmail users, my email goes to their spam folder instead of their primary inbox. If this is happening to you or your recipients, please mark the Gmail email as 'not spam' to help prevent reoccurrences of this problem.

In addition to the App, DataMail can be used with a web bowser ➜ 邮.电邮.在线

Currently, the few systems supporting internationalized email addresses are DataMail, Gmail and Outlook 2016. So, what to do when exchanging email with a system that only supports ASCII email addresses? DataMail have thought about this issue and offer email aliasing. One can create ASCII email aliases and use them to exchange email with systems that do not yet support international email addresses. My DataMail mailbox has the Chinese email address 小山@电邮.在线 and ASCII addresses thus allowing me to communicate with any email system.

DataMail is a good example of an AI (Adaptive Internationalized) website. It adapts to the language of the web address used for access. The most obvious adaptation is the text content is in the language of the web address. Secondly, the appropriate language button is highlighted. Finally, and perhaps less obviously, in the top right corner there is a DataMail support email address which is in the current web address language. In the case of 电邮.在线 the DataMail support email address is 支持@电邮.在线

Letʼs examine some of the technicalities of EAI (Email Address Internationalization). The structure of an email address is local-part@Domain Name where the Domain Name identifies a mail server and local-part identifies a mailbox on said mail server. The email addresses you will be most familiar with are ASCII local-part@ASCII Domain Name. IEAs, on the other hand, are of the form Unicode local-part@Unicode Domain Name. In order to make this form work we need to encode both parts with one encoding for the Unicode local-part and a different encoding for the Unicode Domain Name. The encoded email address is UTF-8@punycode. Users see the Unicode email address and Computers work with the encoded address.

For further technical reading, these are the primary EAI RFCs:


Thursday, 15 December 2016

grep highlighting

I frequently use grep to demonstrate and explain regular expressions (regex). I use it in interactive mode with the input coming from the keyboard and the output going to the screen. So, I type some string and if grep finds a match this input string is echoed to the screen. If no match is found then this input string is not echoed to the screen. I have used this teaching method for many years.

Recently, whilst using CentOS, I discovered that grep can highlight matched strings. The CentOS machine I used was setup with grep highlighting which is how I discovered it. I was impressed as it makes it clear exactly which text is matched.

My Mac OSX does not have grep highlighting with the default settings. I therefore decided to configure my OSX system so it does highlight grep matches as it is so useful. Rather than having to repeatedly type the relevant grep otions on the command line, I put them into my .bash_profile, as follows.

export GREP_OPTIONS='--color=auto'
export GREP_COLOR='1;34' # 1=bold; 34=blue

I now give a grep terminal session extract which illustrates non matching and matching.

苹果电脑 ~: egrep '노팅엄'
안산 안양 부산 구미 제주 포항 양산
안산 안양 부산 노팅엄 구미 제주 포항 양산
안산 안양 부산 노팅엄 구미 제주 포항 양산

The text used in this terminal session is Korean Hangeul. Each word is a Korean city, apart from 노팅엄 which is Nottingham, a city in England. The Korean cities are: 안산 Ansan, 안양 Anyang, 부산 Busan, 구미 Gumi, 제주 Jeju, 포항 Pohang and 양산 Yangsan.

Note: I use egrep as it is short form for grep -E which enables extended regular expressions.

Environment: OSX Sierra v10.12.1

Saturday, 26 November 2016

Domain Name Registrations

To keep up to date with Domain Name registrations I highly recommend gd-domains. It gives listings of newly registered Domain Names on a daily basis. Listings for individual TLDs (Top Level Domains) are available. It is thanks to this site that I discovered the below impressive and sizeable family of Korean Domain Names. They were registered on 11th and 22nd November 2016. The TLD used is 닷컴 which is Verisign's Korean equivalent to com.

I think embedding telephone numbers into these IDNs (Internationalized Domain Names) is clever marketing ☎️

  1. 남양주용달이사
  2. 원룸반포장이사
  3. 마포포장이사
  4. 강동구이사
  5. 강서구포장이사
  6. 광진구원룸이사
  7. 광진구이사짐센터
  8. 송파구포장이사
  9. 중랑구원룸이사
  10. 서초구원룸이사
  11. 송파구용달센터
  12. 학생이사
  13. 사당동원룸이사
  14. 지방용달가격
  15. 싼곳용달이사
  16. 마포용달이사
  17. 반포장이사
  18. 서울일반이사
  19. 1톤트럭이사
  20. 1톤소형이사
  21. 1톤용달
  22. 서울1톤용달
  23. 원룸포장이사
  24. 원룸이사가격
  1. 마포원룸이사
  2. 강동구용달이사
  3. 강서구용달이사
  4. 강동구지역이사
  5. 서울개인용달이사
  6. 광진구용달이사
  7. 송파구원룸이사
  8. 동작구용달이사
  9. 중랑구용달이사
  10. 송파구용달이사
  11. 서초구용달이사
  12. 서울소형이사
  13. 오피스텔이사
  14. 지방용달이사
  15. 용산용달이사
  16. 용달차이사
  17. 합정동용달이사
  18. 서울반포장이사
  19. 서울경기용달차
  20. 친절원룸이사
  21. 원룸투룸
  22. 원룸이사비용
  23. 화물차용달
  24. 용달이사견적 is the link for 닷컴 registrations on 11th November 2016. is the direct link for all 닷컴 registrations on 22nd November 2016, 2016년 11월 22일 화요일.

Wednesday, 26 October 2016

Family of Korean IDNs

The following is a list of functioning Korean IDNs (Internationalized Domain Names). They all belong to the same Computer Repair Company. The TLD (Top Level Domain) used is 닷컴 which is Verisign's new Korean language equivalent to their com TLD. Each IDN contains 컴퓨터수리 which means Computer Repair. The only difference between these IDNs is the first two characters which are the names of South Korean cities. I think this is clever and creative use of IDNs!

The last two IDNs below are structured differently. The first two characters are, I think, a neighbourhood and the first two characters after the hyphen are the city.

The cities are: 시흥 Siheung, 부천 Bucheon, 창원 Changwon, 마산 Masan, 평택 Pyeongtaek, 오산 Osan, 진해 Jinhae, 김해 Gimhae, 부산 Busan.

  1. 시흥컴퓨터수리.닷컴
  2. 부천컴퓨터수리.닷컴
  3. 창원컴퓨터수리.닷컴
  4. 마산컴퓨터수리.닷컴
  5. 평택컴퓨터수리.닷컴
  6. 오산컴퓨터수리.닷컴
  7. 진해컴퓨터수리.닷컴
  8. 김해컴퓨터수리.닷컴
  9. 북동컴퓨터수리-창원컴퓨터수리.닷컴
  10. 우동컴퓨터수리-부산컴퓨터수리.닷컴

Update 9th March 2017: Here is another family of Computer Repair 컴퓨터수리 IDNs with a different registrant.

  1. 김포컴퓨터수리.닷컴
  2. 안양컴퓨터수리.닷컴
  3. 용인컴퓨터수리.닷컴
  4. 용산컴퓨터수리.닷컴
  5. 대구컴퓨터수리.닷컴
  6. 종로컴퓨터수리.닷컴
  7. 강남컴퓨터수리.닷컴
  8. 파주컴퓨터수리.닷컴
  9. 일산컴퓨터수리.닷컴
  10. 성남컴퓨터수리.닷컴