Tuesday, 31 January 2017

Computer Science Internationalization - Unicode Terminal Session

Below is an OSX bash shell command line terminal session. It is a real, working terminal session using basic unix commands. It does, though, look significantly different from a standard terminal session. If you know basic unix commands such as ls and cd, you should/may be able to work out what is happening.

苹果电脑 ~: 妈 我的目录
苹果电脑 ~: 茶 我的目录
苹果电脑 我的目录: 丽
苹果电脑 我的目录: 头 文档一 文档二 文档三
苹果电脑 我的目录: 丽
文档一  文档三  文档二
苹果电脑 我的目录: 词 > 文档四
苹果电脑 我的目录: 词 文档四
苹果电脑 我的目录: 丽
文档一  文档三  文档二  文档四
苹果电脑 我的目录: ⇉ 文档四 文档五
苹果电脑 我的目录: 丽
文档一  文档三  文档二  文档五  文档四
苹果电脑 我的目录: → 文档一 文档六
苹果电脑 我的目录: 丽
文档三  文档二  文档五  文档六  文档四
苹果电脑 我的目录: 

So, what is happening!?

Firstly I am using Unicode characters. If you search the internet you will find many examples of terminal sessions but they will invariably be using ASCII characters only. In my above terminal session I am using Unicode characters, mostly Chinese/Japanese and two arrow symbol characters.

Where are the commands such as ls and cd? I have mapped a set of commands to Unicode characters using the alias command eg alias 丽='ls'

I have changed the command line prompt.

If you understand basic bash commands, I believe I have now given you sufficient information in order for you to work out what is happening in the terminal session. Knowing Chinese or Japanese gives a slight advantage but it is not essential to understanding this terminal session. The Chinese/Japanese characters I chose for the command mappings are somewhat random so it will not help you to google translate them.

I actually devised these command mappings and the terminal session several years ago. Today, I decided it was time to put it onto my blog. My main purpose was and still is, to encourage students to think beyond ASCII. I believe it has impact because it is so unexpected when one first sees this terminal session.

There can be many different permutations on the session using different human language scripts and unicode symbols. It makes for an interesting and unusual exercise for students studying unix. Absolutely no reason why one should not, for example, use emoji for the command mappings.

Monday, 9 January 2017

Chinese Email Address

The latest and hottest news is that I now have a Chinese email address➜ 小山@电邮.在线 😄

  1. 小山 is my adopted Chinese name
  2. 电邮 means email
  3. 在线 means online

I acquired my free Chinese email address from DataMail which supports email addresses in twelve languages: العَرَبِيَّة‎‎ Arabic, বাংলা Bengali, 中文 Chinese, English, ગુજરાતી Gujarati, हिन्दी Hindi, मराठी Marathi, ਪੰਜਾਬੀ Punjabi, ру́сский Russian, தமிழ் Tamil, తెలుగు Telugu, اُردُو‎ Urdu.

Additionally, DataMail has an impressive family of IDNs (Internationalized Domain Names) with each language having itʼs own IDN.
  1. Arabic داده.امارات
  2. Bengali ডাটামেল্.ভারত
  3. Chinese 电邮.在线
  4. English datamail.in
  5. Gujarati ડાટામેલ.ભારત
  6. Hindi डाटामेल.भारत
  7. Marathi डेटामेल.भारत
  8. Punjabi ਡਾਟਾਮੇਲ.ਭਾਰਤ
  9. Russian почта.рус
  10. Tamil இந.இந்தியா
  11. Telugu డేటామెయిల్.భారత్
  12. Urdu ڈاٹامیل.بھارت

If you would like your own DataMail email address in one of the above languages then just click one of the above links. The website directs you to download an Android or iOS App. One uses the App to actually register a DataMail email address.

The main points in the registration process using the DataMail App are:

  1. The crucial part of this process is that firstly you need to select the language for the email address you are about to register. Subsequent instructions will be in the language you have selected. So, I chose Chinese in order to register 小山@电邮.在线.
  2. Validation of your phone number - the DataMail App will, with your approval, send an SMS text to DataMail in India to confirm your phone number. If the validation process fails, it could be that your phone contract does not cover the sending of international SMS text.
  3. Choosing the local-part which in my case is 小山. The Domain Name part is fixed and is provided by DataMail. There is a Domain Name per language, as above.

I have successfully exchanged emails between Gmail ASCII emails addresses and my DataMail Chinese email address. Gmail supports Internationalized Email Addresses (IEAs) but one cannot create IEAs in Gmail. DataMail, to my knowledge, is currently the only production email system that both supports and allows creation of IEAs. I used Gmail with a browser when testing exchange of IEAs. If you are accessing your Gmail using IMAP or POP then IEAs may or may not work. It all depends on whether or not your client software supports IEAs.

I have sent email from DataMail using my Chinese email address 小山@电邮.在线 to several Gmail users. My current experience is that for some of the Gmail users, my email goes to their spam folder instead of their primary inbox. If this is happening to you or your recipients, please mark the Gmail email as 'not spam' to help prevent reoccurrences of this problem.

In addition to the App, DataMail can be used with a web bowser ➜ 邮.电邮.在线

Currently, the few systems supporting internationalized email addresses are DataMail, Gmail and Outlook 2016. So, what to do when exchanging email with a system that only supports ASCII email addresses? DataMail have thought about this issue and offer email aliasing. One can create ASCII email aliases and use them to exchange email with systems that do not yet support international email addresses. My DataMail mailbox has the Chinese email address 小山@电邮.在线 and ASCII @datamail.in addresses thus allowing me to communicate with any email system.

DataMail is a good example of an AI (Adaptive Internationalized) website. It adapts to the language of the web address used for access. The most obvious adaptation is the text content is in the language of the web address. Secondly, the appropriate language button is highlighted. Finally, and perhaps less obviously, in the top right corner there is a DataMail support email address which is in the current web address language. In the case of 电邮.在线 the DataMail support email address is 支持@电邮.在线

Letʼs examine some of the technicalities of EAI (Email Address Internationalization). The structure of an email address is local-part@Domain Name where the Domain Name identifies a mail server and local-part identifies a mailbox on said mail server. The email addresses you will be most familiar with are ASCII local-part@ASCII Domain Name. IEAs, on the other hand, are of the form Unicode local-part@Unicode Domain Name. In order to make this form work we need to encode both parts with one encoding for the Unicode local-part and a different encoding for the Unicode Domain Name. The encoded email address is UTF-8@punycode. Users see the Unicode email address and Computers work with the encoded address.

For further technical reading, these are the primary EAI RFCs:

  1. tools.ietf.org/html/rfc6531
  2. tools.ietf.org/html/rfc6532
  3. tools.ietf.org/html/rfc6533
  4. tools.ietf.org/html/rfc6534

Thursday, 15 December 2016

grep highlighting

I frequently use grep to demonstrate and explain regular expressions (regex). I use it in interactive mode with the input coming from the keyboard and the output going to the screen. So, I type some string and if grep finds a match this input string is echoed to the screen. If no match is found then this input string is not echoed to the screen. I have used this teaching method for many years.

Recently, whilst using CentOS, I discovered that grep can highlight matched strings. The CentOS machine I used was setup with grep highlighting which is how I discovered it. I was impressed as it makes it clear exactly which text is matched.

My Mac OSX does not have grep highlighting with the default settings. I therefore decided to configure my OSX system so it does highlight grep matches as it is so useful. Rather than having to repeatedly type the relevant grep otions on the command line, I put them into my .bash_profile, as follows.

export GREP_OPTIONS='--color=auto'
export GREP_COLOR='1;34' # 1=bold; 34=blue

I now give a grep terminal session extract which illustrates non matching and matching.

苹果电脑 ~: egrep '노팅엄'
안산 안양 부산 구미 제주 포항 양산
안산 안양 부산 노팅엄 구미 제주 포항 양산
안산 안양 부산 노팅엄 구미 제주 포항 양산

The text used in this terminal session is Korean Hangeul. Each word is a Korean city, apart from 노팅엄 which is Nottingham, a city in England. The Korean cities are: 안산 Ansan, 안양 Anyang, 부산 Busan, 구미 Gumi, 제주 Jeju, 포항 Pohang and 양산 Yangsan.

Note: I use egrep as it is short form for grep -E which enables extended regular expressions.

Environment: OSX Sierra v10.12.1

Saturday, 26 November 2016

Domain Name Registrations

To keep up to date with Domain Name registrations I highly recommend gd-domains. It gives listings of newly registered Domain Names on a daily basis. Listings for individual TLDs (Top Level Domains) are available. It is thanks to this site that I discovered the below impressive and sizeable family of Korean Domain Names. They were all registered on 22nd November 2016, 2016년 11월 22일 화요일. The TLD used is 닷컴 which is Verisign's Korean equivalent to com.

I think embedding telephone numbers into these IDNs (Internationalized Domain Names) is clever marketing ☎️

  1. 남양주용달이사-010-3126-0853.닷컴
  2. 원룸반포장이사-010-3126-0853.닷컴
  3. 마포포장이사-010-3126-0853.닷컴
  4. 강동구이사-010-3126-0853.닷컴
  5. 강서구포장이사-010-3126-0853.닷컴
  6. 광진구원룸이사-010-3126-0853.닷컴
  7. 광진구이사짐센터-010-3126-0853.닷컴
  8. 송파구포장이사-010-3126-0853.닷컴
  9. 중랑구원룸이사-010-3126-0853.닷컴
  10. 서초구원룸이사-010-3126-0853.닷컴
  11. 송파구용달센터-010-3126-0853.닷컴
  12. 학생이사-010-3126-0853.닷컴
  13. 사당동원룸이사-010-3126-0853.닷컴
  14. 지방용달가격-010-3126-0853.닷컴
  15. 싼곳용달이사-010-3126-0853.닷컴
  16. 마포용달이사-010-3126-0853.닷컴
  17. 반포장이사-010-3126-0853.닷컴
  18. 서울일반이사-010-3126-0853.닷컴
  19. 1톤트럭이사-010-3126-0853.닷컴
  20. 1톤소형이사-010-3126-0853.닷컴
  1. 마포원룸이사-010-4675-2414.닷컴
  2. 강동구용달이사-010-4675-2414.닷컴
  3. 강서구용달이사-010-4675-2414.닷컴
  4. 강동구지역이사-010-4675-2414.닷컴
  5. 서울개인용달이사-010-4675-2414.닷컴
  6. 광진구용달이사-010-4675-2414.닷컴
  7. 송파구원룸이사-010-4675-2414.닷컴
  8. 동작구용달이사-010-4675-2414.닷컴
  9. 중랑구용달이사-010-4675-2414.닷컴
  10. 송파구용달이사-010-4675-2414.닷컴
  11. 서초구용달이사-010-4675-2414.닷컴
  12. 서울소형이사-010-4675-2414.닷컴
  13. 오피스텔이사-010-4675-2414.닷컴
  14. 지방용달이사-010-4675-2414.닷컴
  15. 용산용달이사-010-4675-2414.닷컴
  16. 용달차이사-010-4675-2414.닷컴
  17. 합정동용달이사-010-4675-2414.닷컴
  18. 서울반포장이사-010-4675-2414.닷컴

www.gd-domains.com/20161122-229 is the direct link for all 닷컴 registrations on 22nd November 2016, 2016년 11월 22일 화요일.

Wednesday, 26 October 2016

Family of Korean IDNs

The following is a list of functioning Korean IDNs (Internationalized Domain Names). They all belong to the same Computer Repair Company. The TLD (Top Level Domain) used is 닷컴 which is Verisign's new Korean language equivalent to their com TLD. Each IDN contains 컴퓨터수리 which means Computer Repair. The only difference between these IDNs is the first two characters which are the names of South Korean cities. I think this is clever and creative use of IDNs!

The last two IDNs below are structured differently. The first two characters are, I think, a neighbourhood and the first two characters after the hyphen are the city.

The cities are: 시흥 Siheung, 부천 Bucheon, 창원 Changwon, 마산 Masan, 평택 Pyeongtaek, 오산 Osan, 진해 Jinhae, 김해 Gimhae, 부산 Busan.

  1. 시흥컴퓨터수리.닷컴
  2. 부천컴퓨터수리.닷컴
  3. 창원컴퓨터수리.닷컴
  4. 마산컴퓨터수리.닷컴
  5. 평택컴퓨터수리.닷컴
  6. 오산컴퓨터수리.닷컴
  7. 진해컴퓨터수리.닷컴
  8. 김해컴퓨터수리.닷컴
  9. 북동컴퓨터수리-창원컴퓨터수리.닷컴
  10. 우동컴퓨터수리-부산컴퓨터수리.닷컴

Friday, 7 October 2016

Computer Science Internationalization — Bidi

Scripts such as Latin are written from Left to Right (L➡︎R). Scripts such as Arabic and Hebrew are written Right to Left (L⬅︎R). What happens when we mix L➡︎R and L⬅︎R scripts within a document? Here is an exercise in mixing scripts.

Take a mixed bidi (bidirectional) string consisting of Latin and Hebrew characters in a L➡︎R paragraph.


...and here is the same string in a L⬅︎R paragraph.


Now to the actual exercise. Copy the above stings to your text editor or word processor. You will need to setup the 2nd occurrence of the string as a L⬅︎R paragraph. I am assuming that your directionality is L➡︎R by default. Each string has two boundaries where the text changes direction. For each boundary you are going to insert a character, either a L➡︎R, such as x, or a L⬅︎R, such as ד. For each insertion operation use the initial mixed bidi string. There are two mixed strings above and so there are a total of 8 insertion operations. The challenge is to predict where in the strings the inserted character will appear before you actually insert the character. Give it a go! Good luck😀

If I did this exercise before I ever studied bidi, I would probably have scored 4/8. Now I understand how the computer is processing this bidi text and so I usually score full marks for such exercises. It is though not an intuitive process for me as I have spent most of my life reading and writing L➡︎R scripts only. I have to think very carefully as to how the computer does it in order to determine the correct answers.

The main purpose of this exercise is to think about the ordering of the characters in the strings. There are two orderings to consider: memory order and display order. Memory order is how it is logically saved in memory which in this case is the order in which I typed it. The memory order of the string I have used above is "abcגבאdef". Display order is how it is presented to the viewer. You have already seen, above, the two possible display orders for the single string "abcגבאdef".

I have used TextEdit for this exercise. In order to set paragraph text direction in TextEdit follow the path: "TextEdit➜ Format➜ Text➜ Writing Direction". Now set paragraph text direction to Right to Left. TextEdit correctly handles bidi text but that is not the case for all word processors or text editors.

There are several permutations of this exercise, including:

  1. What happens at the boundaries with forward delete and back delete?
  2. What happens if the initial memory order character(s) are L⬅︎R instead of L➡︎R?
  3. Use Arabic instead of Hebrew as this introduces the additional challenge of letters changing shape according to preceding and following characters.

This article is aimed at L➡︎R reading/writing people. If you are a L⬅︎R person then you will need to invert some of my instructions. Actually, if you are a L⬅︎R person you will be totally familiar with mixing bidi text and so will fully understand this exercise.

Environment: OSX v10.12 (Sierra), TextEdit v1.12

Wednesday, 21 September 2016

Computer Science Curriculum Internationalization

I have been a long time practitioner and advocate of internationalising Computer Science teaching. My fundamental aim is to give students global computing skills. One such global skill, for example, is the processing of Unicode text rather than the very restricted ASCII text. Once one encompasses Unicode then one is encompassing most languages and scripts of the world.

Over the years I have tried to find other like minded Computer Science educators but have had no success. I had more or less concluded I am a solitary voice when it comes to Computer Science internationalisation. There does though appear to be some light as I recently discovered two organisations that promote internationalisation of teaching curricula.

🌍 The Centre for Curriculum Internationalisation (CCI) which is based at Oxford Brookes University, UK. brookes.ac.uk/services/cci/index.html In addition to their website they have a google discussion group. I posted some information on my Computer Science Internationalisation initiatives and practices to this google forum. Please see groups.google.com/forum/#!topic/cicin/6XJCrqcdLD4

🌏 Internationalisation of the Curriculum (IoC) in action which is based in Australia. ioc.global/index.html