André 小山 Schappo

Monday, 10 July 2017

Computer Science Internationalization - Unicode Encoding & Decoding

Several years ago I devised this visual and fun way to teach and practise encoding and decoding Unicode. I used this method in my International Computing class. This method involves use of pencil and eraser.

The codepoints and the UTF-8 are all written in hexadecimal(hex). The binary bits are an intermediate form for the purposes of encoding and decoding.

We start with the following form which is designed for encoding Unicode codepoints to UTF-8 and decoding UTF-8 to Unicode codepoints.

Encoding: We will start with encoding Unicode codepoints to UTF-8.

The first thing we can do is fill in the fixed bits. They are the fixed bits defined by the encoding scheme. I have entered the fixed bits in red to make them distinct from variable bits.

Now we will write one or more Unicode codepoints on the form. These will be the codepoints we will encode into UTF-8. The codepoints should be written in hexadecimal. I will use the codepoints U+0444 and U+597D.

So, how do we determine where the codepoints go on the form. We need to look at the free bits to determine the range of values that can be accommodated.

1 byte row - 7 free variable bits giving a range of 0 ➔ 7F
2 byte row - 11 free variable bits giving a range of 80 ➔ 7FF
3 byte row - 16 free variable bits giving a range of 800 ➔ FFFF
4 byte row - 21 free variable bits giving a range of 10000 ➔ 1FFFFF (the actual maximum value of a codepoint is 10FFFF)

Now we know the ranges we can put U+0444 and U+597D in the correct places of the form.

We have empty boxes into which we write the binary values of the codepoints.

Finally, we take the complete bytes and write them as hexadecimal values to form the UTF-8 encoded forms. U+0444 encoded is D184, U+597D encoded is E5A5BD.

Decoding: Now onto decoding from UTF-8 to Unicode codepoints. We will decode the UTF-8 F0AA9FB7 which I have entered onto the form. I have used spaces on the form to make the byte boundaries more obvious.

Complete the bytes by writing the binary variable values.

Extract the variable binary values to form the hex Unicode codepoint U+2A7F7.

Whilst I was at it, I completed a single byte entry. The single byte characters are ASCII characters. ASCII is a subset of Unicode.

It is a Unicode convention, when writing codepoints, to use a minimum of four hex digits. So for codepoints <1000, one should left pad with zeroes. Hence my entries U+0444 and U+0057 rather than U+444 and U+57.

Sunday, 2 July 2017

Computer Science Internationalization - Text Search

So, you have just written some Cool Code which will search for and find occurrences of specified text strings. You have access to Big Data text eg all the text in all public webpages. You will,of course, want to test your Cool Code. Letʼs perform some, seemingly, very simple tests.

Letʼs search for the word 'Scorpion'. Your code works just fine and hence finds all occurrences of the word 'Scorpion'.

Now test with the following two words.

Scorpion
Scorpion

Your Cool Code works fine as all I have done is applied some CSS styling, thus giving each of the two words differing appearance.

Now test you Cool Code with the following two words.

𝑆𝑐𝑜𝑟𝑝𝑖𝑜𝑛
𝐒𝐜𝐨𝐫𝐩𝐢𝐨𝐧

If you have only programmed for ASCII text then your now not so Cool Code will fail. These two words have differing appearance because they are not made up of the ASCII characters you are familiar with. These words use characters from the Unicode Math Alphanumeric Symbols block, U+1D400-1D4FF.

Should the Math Alphanumeric Symbols Scorpion be treated the same as the ASCII Scorpion wrt the search results of your code? In this context I think "Yes", most definitely. A person reading this blog, for example, will just perceive the word Scorpion whatever characters are used to write the word. The reader may well also visualise the insect with a "sting in the tail"😱

What of current working practice?

With twitter, a user has no means of changing text style within a tweet. It has thus become common to use Unicode Math Alphanumeric Symbols to change appearance. I could, for example, use Unicode Math Alphanumeric Symbols to emphasise a word (eg Scorpion) or phrase within a tweet. The meaning of the tweet remains the same.

Google returns the same number of search results whichever of the above forms of Scorpion I use. At time of writing this is "About 144,000,000 results". I deduce Google is treating ASCII Scorpion and Unicode Math Alphanumeric Symbols 𝑆𝑐𝑜𝑟𝑝𝑖𝑜𝑛 & 𝐒𝐜𝐨𝐫𝐩𝐢𝐨𝐧 as equivalent.

Sogou 搜狗 is a Chinese search engine. Using Sogou: ASCII Scorpion returns 93,341 results, Math Alphanumeric Symbols 𝑆𝑐𝑜𝑟𝑝𝑖𝑜𝑛 returns 4,738, Math Alphanumeric Symbols 𝐒𝐜𝐨𝐫𝐩𝐢𝐨𝐧 returns 61. I think it evident that Sogou does not treat my three forms of Scorpion as equivalent.

I side with Google on this.

Here is a taster of what is happening in the behind the scenes technicalities of Unicode. Letʼs take just one of the Unicode Math Alphanumeric Symbols I have used, 𝐒 MATHEMATICAL BOLD CAPITAL S U+1D412. If you visit codepoints.net/U+1D412 you will see a wealth of information about this character. Of relevance to this blog is the Decomposition Mapping which is to the, oh so familiar, ASCII uppercase capital S. This Unicode information can be used to compute string equivalents which can then be used for search thus providing all relevant results.

The moral of this "Sting in the Tale" is: If you do not already know it, you must learn Unicode, it is essential.

Friday, 28 April 2017

Computer Science Internationalization - Hieroglyphs in Domain Names

I have been aware for a long time that domains such as .com support many human language scripts. Verisign's .com includes support for Hiragana, Gurmukhi, Han, Tibetan, Sinhala, Devanagari, Hangul and many more.

But what of Verisign's .com equivalents .コム (Japanese) and .닷컴 (Korean)? Both of these support a multitude of human language scripts. The supported scripts for many, but not all, Domains are listed in the IANA Repository of IDN Practices iana.org/domains/idn-tables.

Whilst browsing this repository, I discovered there are sixteen domains, all belonging to Verisign, which support Egyptian Hieroglyphs which I think is totally cool! Verisign's .com, .コム and .닷컴 all support Egyptian Hieroglyphs. This means one can register domain names such as:-

𓇋𓈖𓏏𓂝𓂋𓈖𓄿𓏏𓇋𓍯𓈖𓄿𓃭.com
𓇋𓈖𓏏𓂝𓂋𓈖𓄿𓏏𓇋𓍯𓈖𓄿𓃭.コム
𓇋𓈖𓏏𓂝𓂋𓈖𓄿𓏏𓇋𓍯𓈖𓄿𓃭.닷컴

It is possible you do not have an Egyptian Hieroglyph font on your device so here are the domain names in image format.

Google provide a free Egyptian Hieroglyph font which you can download from google.com/get/noto/

Does the Egyptian Hieroglyph string I have used above mean anything? It is actually a transliteration of the English word international. I used ngm.nationalgeographic.com/ngm/egypt/translator.html for the transliteration. The hieroglyphs translator presents the Egyptian Hieroglyphs as images. So, no simple copy and paste of Egyptian Hieroglyph text. I had to match with the appropriate Unicode characters by visual inspection. I cannot guarantee I made all the correct matches but I think I have them correct.

Here are some registered and live Egyptian Hieroglyph Domain Names egyptianhieroglyphic.com/egypt/egyptian-hieroglyphics/

Friday, 31 March 2017

Computer Science Internationalization - Adaptive URL

A URL can consist of a Domain Name and a pathname. In the examples below x.y.z represents the Domain Name, the remainder being the pathname. My experience of the internet is that the pathname is usually written in English or more accurately ASCII. The below ASCII pathname represents a multi-page website in the form of a journey from home to a hotel in Korea.

x.y.z/home/bus/airplane/korea/taxi/hotel

Websites, such as Google, adapt the language of their text content according to the browser preferred display language (BL). This browser preferred language can be set by the user. Letʼs go one step further than Google and adapt the language of the URL pathname according to the BL. Here is the ASCII pathname rewritten into Chinese, Japanese and Korean.

x.y.z/家/公共汽车/飞机/韩国/出租车/饭店

x.y.z/ホーム/バス/飛行機/韓国/タクシー/ホテル

x.y.z/홈/버스/비행기/한국/택시/호텔

So, how do we implement these language adaptive URL parthnames? Firstly, we need to programmatically determine the BL. One way of achieving this is to examine the Accept-Language http header sent from the browser to the server. This will contain one or more language tags. If there is more than one language tag they are presented in priority order. Language tags can take many forms. They include: zh, zh-CN and cmn for Mandarin Chinese; ja for Japanese and ko for Korean. Now that we can determine the BL we can select the appropriate URL pathname, thus internationalizing our website with a language adaptive URL pathname.

On a Linux machine, each component of the pathname will be a directory. In my schema I am assuming an index.html or index.php, per directory. A requirement of this schema is that we do not want a directory hierarchy for each language, nor do we want an index.html or index.php for each language.

My native language is English so I will make my master pathname directory names English ie home, bus, airplane, korea, taxi and hotel. I will make the Chinese, Japanese and Korean directory names as aliases to the English named master directories. This can be easily achieved on Linux with the ln -s command, where ln means link and the -s option means create symbolic link, as opposed to a hard link.

ln -s home 家
ln -s home ホーム
ln -s home 홈
⋮
ln -s hotel 饭店
ln -s hotel ホテル
ln -s hotel 호텔

What if your native language is not English? In that case, create the master pathname directory names in your native language. If your native language is Korean then the master directory names will be 집, 버스, 비행기, 한국, 택시 and 호텔 and your links will be:

ln -s 홈 home
ln -s 홈 家
ln -s 홈 ホーム
⋮
ln -s 호텔 hotel
ln -s 호텔 饭店
ln -s 호텔 ホテル

Emoji are hugely popular so letʼs construct a totally cool Emoji pathname.

x.y.z/🏡/🚌/🛩/🇰🇷/🚕/🏨

ln -s home 🏡
ln -s bus 🚌
ln -s airplane 🛩
ln -s korea 🇰🇷
ln -s taxi 🚕
ln -s hotel 🏨

I have never encountered an Emoji URL pathname on a website and so implementing such a pathname on your website would be both totally cool and unique. You could also use an Emoji pathname for those languages your website does not support. My schema only supports Chinese, English, Japanese and Korean. If the BL was an unsupported language, such as Arabic, then the Emoji pathname could be displayed in the browser address bar instead of, for example, defaulting to English.

I have used x.y.x to represent the Domain Name, the implication being it is ASCII. We can complete the language adaptive equation by having Domain Names in supported BL languages. Thus my completed equation schema would have Chinese, Japanese and Korean Domain Names in addition to an ASCII Domain Name.

Friday, 17 March 2017

Computer Science Internationalization - EAI

As I stated in schappo.blogspot.co.uk/2017/01/chinese-email-address.html both DataMail and Google mail support Email Address Internationalization (EAI). DataMail provides a complete EAI service which includes both support and creation of Internationalized email addresses. Google Mail provides a partial EAI service, in that, it supports EAI but does not yet provide for creation of internationlized email accounts with internationalized email addresses. Thus organisations using Google Mail have an advantage over those organisations having an ASCII addresses only email service and have a head start in provision of a complete EAI service.

Given the Domain name of an organisation, the Unix host command can be used to determine the mail service provider. Here are some of the organisations using Google Mail:

苹果电脑 ~: host spotify.com
spotify.com has address 194.132.198.198
spotify.com has address 194.132.197.198
spotify.com has address 194.132.198.149
spotify.com mail is handled by 10 ASPMX3.GOOGLEMAIL.com.
spotify.com mail is handled by 1 ASPMX.L.GOOGLE.com.
spotify.com mail is handled by 10 ASPMX2.GOOGLEMAIL.com.
spotify.com mail is handled by 5 ALT2.ASPMX.L.GOOGLE.com.
spotify.com mail is handled by 10 ASPMX5.GOOGLEMAIL.com.
spotify.com mail is handled by 5 ALT1.ASPMX.L.GOOGLE.com.
spotify.com mail is handled by 10 ASPMX4.GOOGLEMAIL.com.

苹果电脑 ~: host twitter.com
twitter.com has address 104.244.42.129
twitter.com has address 104.244.42.1
twitter.com mail is handled by 30 aspmx3.googlemail.com.
twitter.com mail is handled by 10 aspmx.l.google.com.
twitter.com mail is handled by 20 alt1.aspmx.l.google.com.
twitter.com mail is handled by 30 aspmx2.googlemail.com.
twitter.com mail is handled by 20 alt2.aspmx.l.google.com.

苹果电脑 ~: host mixi.jp # ミクシィ
mixi.jp has address 52.198.59.66
mixi.jp has address 54.92.71.226
mixi.jp has address 52.198.89.90
mixi.jp mail is handled by 30 aspmx2.googlemail.com.
mixi.jp mail is handled by 10 aspmx.l.google.com.
mixi.jp mail is handled by 20 alt2.aspmx.l.google.com.
mixi.jp mail is handled by 20 alt1.aspmx.l.google.com.
mixi.jp mail is handled by 30 aspmx3.googlemail.com.

苹果电脑 ~: host bristol.ac.uk # University of Bristol
bristol.ac.uk has address 137.222.0.38
bristol.ac.uk mail is handled by 5 ALT1.ASPMX.L.GOOGLE.COM.
bristol.ac.uk mail is handled by 10 ASPMX2.GOOGLEMAIL.COM.
bristol.ac.uk mail is handled by 1 ASPMX.L.GOOGLE.COM.
bristol.ac.uk mail is handled by 10 ASPMX3.GOOGLEMAIL.COM.
bristol.ac.uk mail is handled by 5 ALT2.ASPMX.L.GOOGLE.COM.

苹果电脑 ~: host bathspa.ac.uk # Bath Spa University
bathspa.ac.uk has address 194.83.160.0
bathspa.ac.uk has address 162.13.24.154
bathspa.ac.uk has address 72.47.217.0
bathspa.ac.uk mail is handled by 10 ALT4.ASPMX.L.GOOGLE.COM.
bathspa.ac.uk mail is handled by 5 ALT2.ASPMX.L.GOOGLE.COM.
bathspa.ac.uk mail is handled by 1 ASPMX.L.GOOGLE.COM.
bathspa.ac.uk mail is handled by 5 ALT1.ASPMX.L.GOOGLE.COM.
bathspa.ac.uk mail is handled by 10 ALT3.ASPMX.L.GOOGLE.COM.

Providing a full EAI service involves going beyond ASCII. It entails supporting Unicode email addresses. Unicode email addresses such as my Chinese email 小山@电邮.在线

Tuesday, 31 January 2017

Computer Science Internationalization - Unicode Terminal Session

Below is an OSX bash shell command line terminal session. It is a real, working terminal session using basic unix commands. It does, though, look significantly different from a standard terminal session. If you know basic unix commands such as ls and cd, you should/may be able to work out what is happening.

苹果电脑 ~: 妈 我的目录
苹果电脑 ~: 茶 我的目录
苹果电脑 我的目录: 丽
苹果电脑 我的目录: 头 文档一 文档二 文档三
苹果电脑 我的目录: 丽
文档一  文档三  文档二
苹果电脑 我的目录: 词 > 文档四
一
二
三
四
五
六
苹果电脑 我的目录: 词 文档四
一
二
三
四
五
六
苹果电脑 我的目录: 丽
文档一  文档三  文档二  文档四
苹果电脑 我的目录: ⇉ 文档四 文档五
苹果电脑 我的目录: 丽
文档一  文档三  文档二  文档五  文档四
苹果电脑 我的目录: → 文档一 文档六
苹果电脑 我的目录: 丽
文档三  文档二  文档五  文档六  文档四
苹果电脑 我的目录:

So, what is happening!?

Firstly I am using Unicode characters. If you search the internet you will find many examples of terminal sessions but they will invariably be using ASCII characters only. In my above terminal session I am using Unicode characters, mostly Chinese/Japanese and two arrow symbol characters.

Where are the commands such as ls and cd? I have mapped a set of commands to Unicode characters using the alias command eg alias 丽='ls'

I have changed the command line prompt.

If you understand basic bash commands, I believe I have now given you sufficient information in order for you to work out what is happening in the terminal session. Knowing Chinese or Japanese gives a slight advantage but it is not essential to understanding this terminal session. The Chinese/Japanese characters I chose for the command mappings are somewhat random so it will not help you to google translate them.

I actually devised these command mappings and the terminal session several years ago. Today, I decided it was time to put it onto my blog. My main purpose was and still is, to encourage students to think beyond ASCII. I believe it has impact because it is so unexpected when one first sees this terminal session.

There can be many different permutations on the session using different human language scripts and unicode symbols. It makes for an interesting and unusual exercise for students studying unix. Absolutely no reason why one should not, for example, use emoji for the command mappings.

Monday, 9 January 2017

Chinese Email Address

The latest and hottest news is that I now have a Chinese email address➜ 小山@电邮.在线 😄

小山 is my adopted Chinese name
电邮 means email
在线 means online

I acquired my free Chinese email address from DataMail which supports email addresses in twelve languages: العَرَبِيَّة‎‎ Arabic, বাংলা Bengali, 中文 Chinese, English, ગુજરાતી Gujarati, हिन्दी Hindi, मराठी Marathi, ਪੰਜਾਬੀ Punjabi, ру́сский Russian, தமிழ் Tamil, తెలుగు Telugu, اُردُو‎ Urdu.

Additionally, DataMail has an impressive family of IDNs (Internationalized Domain Names) with each language having itʼs own IDN.

If you would like your own DataMail email address in one of the above languages then just click one of the above links. The website directs you to download an Android or iOS App. One uses the App to actually register a DataMail email address.

The main points in the registration process using the DataMail App are:

The crucial part of this process is that firstly you need to select the language for the email address you are about to register. Subsequent instructions will be in the language you have selected. So, I chose Chinese in order to register 小山@电邮.在线.
Validation of your phone number - the DataMail App will, with your approval, send an SMS text to DataMail in India to confirm your phone number. If the validation process fails, it could be that your phone contract does not cover the sending of international SMS text.
Choosing the local-part which in my case is 小山. The Domain Name part is fixed and is provided by DataMail. There is a Domain Name per language, as above.

I have successfully exchanged emails between Gmail ASCII emails addresses and my DataMail Chinese email address. Gmail supports Internationalized Email Addresses (IEAs) but one cannot create IEAs in Gmail. DataMail, to my knowledge, is currently the only production email system that both supports and allows creation of IEAs. I used Gmail with a browser when testing exchange of IEAs. If you are accessing your Gmail using IMAP or POP then IEAs may or may not work. It all depends on whether or not your client software supports IEAs.

I have sent email from DataMail using my Chinese email address 小山@电邮.在线 to several Gmail users. My current experience is that for some of the Gmail users, my email goes to their spam folder instead of their primary inbox. If this is happening to you or your recipients, please mark the Gmail email as 'not spam' to help prevent reoccurrences of this problem.

In addition to the App, DataMail can be used with a web bowser ➜ 邮.电邮.在线

Currently, the few systems supporting internationalized email addresses are DataMail, Gmail and Outlook 2016. So, what to do when exchanging email with a system that only supports ASCII email addresses? DataMail have thought about this issue and offer email aliasing. One can create ASCII email aliases and use them to exchange email with systems that do not yet support international email addresses. My DataMail mailbox has the Chinese email address 小山@电邮.在线 and ASCII @datamail.in addresses thus allowing me to communicate with any email system.

DataMail is a good example of an AI (Adaptive Internationalized) website. It adapts to the language of the web address used for access. The most obvious adaptation is the text content is in the language of the web address. Secondly, the appropriate language button is highlighted. Finally, and perhaps less obviously, in the top right corner there is a DataMail support email address which is in the current web address language. In the case of 电邮.在线 the DataMail support email address is 支持@电邮.在线

Letʼs examine some of the technicalities of EAI (Email Address Internationalization). The structure of an email address is local-part@Domain Name where the Domain Name identifies a mail server and local-part identifies a mailbox on said mail server. The email addresses you will be most familiar with are ASCII local-part@ASCII Domain Name. IEAs, on the other hand, are of the form Unicode local-part@Unicode Domain Name. In order to make this form work we need to encode both parts with one encoding for the Unicode local-part and a different encoding for the Unicode Domain Name. The encoded email address is UTF-8@punycode. Users see the Unicode email address and Computers work with the encoded address.

For further technical reading, these are the primary EAI RFCs:

Thursday, 15 December 2016

grep highlighting

I frequently use grep to demonstrate and explain regular expressions (regex). I use it in interactive mode with the input coming from the keyboard and the output going to the screen. So, I type some string and if grep finds a match this input string is echoed to the screen. If no match is found then this input string is not echoed to the screen. I have used this teaching method for many years.

Recently, whilst using CentOS, I discovered that grep can highlight matched strings. The CentOS machine I used was setup with grep highlighting which is how I discovered it. I was impressed as it makes it clear exactly which text is matched.

My Mac OSX does not have grep highlighting with the default settings. I therefore decided to configure my OSX system so it does highlight grep matches as it is so useful. Rather than having to repeatedly type the relevant grep otions on the command line, I put them into my .bash_profile, as follows.

export GREP_OPTIONS='--color=auto'

export GREP_COLOR='1;34'  # 1=bold; 34=blue

I now give a grep terminal session extract which illustrates non matching and matching.

苹果电脑 ~: egrep '노팅엄'

안산 안양 부산 구미 제주 포항 양산

안산 안양 부산 노팅엄 구미 제주 포항 양산

안산 안양 부산 노팅엄 구미 제주 포항 양산

The text used in this terminal session is Korean Hangeul. Each word is a Korean city, apart from 노팅엄 which is Nottingham, a city in England. The Korean cities are: 안산 Ansan, 안양 Anyang, 부산 Busan, 구미 Gumi, 제주 Jeju, 포항 Pohang and 양산 Yangsan.

Note: I use egrep as it is short form for grep -E which enables extended regular expressions.

Environment: OSX Sierra v10.12.1

Saturday, 26 November 2016

Domain Name Registrations

To keep up to date with Domain Name registrations I highly recommend gd-domains. It gives listings of newly registered Domain Names on a daily basis. Listings for individual TLDs (Top Level Domains) are available. It is thanks to this site that I discovered the below impressive and sizeable family of Korean Domain Names. They were registered on 11th and 22nd November 2016. The TLD used is 닷컴 which is Verisign's Korean equivalent to com.

I think embedding telephone numbers into these IDNs (Internationalized Domain Names) is clever marketing ☎️

www.gd-domains.com/20161111-229 is the link for 닷컴 registrations on 11th November 2016. www.gd-domains.com/20161122-229 is the direct link for all 닷컴 registrations on 22nd November 2016, 2016년 11월 22일 화요일.

Wednesday, 26 October 2016

Family of Korean IDNs

The following is a list of functioning Korean IDNs (Internationalized Domain Names). The TLD (Top Level Domain) used is 닷컴 which is Verisign's Korean language equivalent to their com TLD.

This family of Korean IDNs are concerned with Computer Repair 컴퓨터수리. The first two characters of the first 6 IDNs are names of Korean cities: 김포 Gimpo, 안양 Anyang, 용인 Yongin, 대구 Daegu, 파주 Paju, 성남 Seongnam. I think the first 2 characters of the next 2 IDNs are districts or neighbourhoods of 서울 Seoul: 용산 Yongsan, 종로 Jongno. The last one 일산 Ilsan, is a 동 neighbourhood of 고양 Goyang.

The following family of Korean IDNs all resolve to Fun Design website. I discovered this family at newly.domains/20171128-229 홈페이지제작 means home page creation. The first 2 Korean characters of the first 9 IDNs are Korean cities: 과천 Gwacheon, 광주 Gwangju, 대구 Daegu, 대전 Daejeon, 부산 Busan, 서울 Seoul, 수원 Suwon, 울산 Ulsan and 인천 Incheon. The last one, 분당 Bundang, I am less certain about. I think it is a district of 성남 Seongnam.

Friday, 7 October 2016

Computer Science Internationalization — Bidi

Scripts such as Latin are written from Left to Right (L➡︎R). Scripts such as Arabic and Hebrew are written Right to Left (L⬅︎R). What happens when we mix L➡︎R and L⬅︎R scripts within a document? Here is an exercise in mixing scripts.

Take a mixed bidi (bidirectional) string consisting of Latin and Hebrew characters in a L➡︎R paragraph.

abcאבגdef

...and here is the same string in a L⬅︎R paragraph.

abcאבגdef

Now to the actual exercise. Copy the above stings to your text editor or word processor. You will need to setup the 2nd occurrence of the string as a L⬅︎R paragraph. I am assuming that your directionality is L➡︎R by default. Each string has two boundaries where the text changes direction. For each boundary you are going to insert a character, either a L➡︎R, such as x, or a L⬅︎R, such as ד. For each insertion operation use the initial mixed bidi string. There are two mixed strings above and so there are a total of 8 insertion operations. The challenge is to predict where in the strings the inserted character will appear before you actually insert the character. Give it a go! Good luck😀

If I did this exercise before I ever studied bidi, I would probably have scored 4/8. Now I understand how the computer is processing this bidi text and so I usually score full marks for such exercises. It is though not an intuitive process for me as I have spent most of my life reading and writing L➡︎R scripts only. I have to think very carefully as to how the computer does it in order to determine the correct answers.

The main purpose of this exercise is to think about the ordering of the characters in the strings. There are two orderings to consider: memory order and display order. Memory order is how it is logically saved in memory which in this case is the order in which I typed it. The memory order of the string I have used above is "abcגבאdef". Display order is how it is presented to the viewer. You have already seen, above, the two possible display orders for the single string "abcגבאdef".

I have used TextEdit for this exercise. In order to set paragraph text direction in TextEdit follow the path: "TextEdit➜ Format➜ Text➜ Writing Direction". Now set paragraph text direction to Right to Left. TextEdit correctly handles bidi text but that is not the case for all word processors or text editors.

There are several permutations of this exercise, including:

What happens at the boundaries with forward delete and back delete?
What happens if the initial memory order character(s) are L⬅︎R instead of L➡︎R?
Use Arabic instead of Hebrew as this introduces the additional challenge of letters changing shape according to preceding and following characters.

This article is aimed at L➡︎R reading/writing people. If you are a L⬅︎R person then you will need to invert some of my instructions. Actually, if you are a L⬅︎R person you will be totally familiar with mixing bidi text and so will fully understand this exercise.

Environment: OSX v10.12 (Sierra), TextEdit v1.12

Wednesday, 21 September 2016

Computer Science Curriculum Internationalization

I have been a long time practitioner and advocate of internationalising Computer Science teaching. My fundamental aim is to give students global computing skills. One such global skill, for example, is the processing of Unicode text rather than the very restricted ASCII text. Once one encompasses Unicode then one is encompassing most languages and scripts of the world.

Over the years I have tried to find other like minded Computer Science educators but have had no success. I had more or less concluded I am a solitary voice when it comes to Computer Science internationalisation. There does though appear to be some light as I recently discovered two organisations that promote internationalisation of teaching curricula.

 The Centre for Curriculum Internationalisation (CCI) which is based at Oxford Brookes University, UK. brookes.ac.uk/services/cci/index.html In addition to their website they have a google discussion group. I posted some information on my Computer Science Internationalisation initiatives and practices to this google forum. Please see groups.google.com/forum/#!topic/cicin/6XJCrqcdLD4

 Internationalisation of the Curriculum (IoC) in action which is based in Australia. ioc.global

Update 19th March 2017: I reached out to people and groups and I conclude I am still a solitary voice with respect to Computer Science Curricula Internationalization in UK Universities. I do believe UK Universities will have to embrace Computer Science Internationalization but I think it will be at least ten years before that happens. So, why do I persist? Am I wrong? Well, if I am wrong then so are, Google, Wikipedia, Facebook, Nivea, Booking.com, Nestlé, Hotels.com, Pampers, Intel, Microsoft, Philips, Adobe, Twitter and many many more companies. They all operate globally and are all producing software for the world. These global companies need graduates who have the skills and attitude necessary for building global software.
Note: I have taken these company names from The top 25 global websites from the 2017 Web Globalization Report Card globalbydesign.com/2017/02/16/the-top-25-global-websites-from-the-2017-web-globalization-report-card/

Update 1st October 2017: I recently created an open forum specifically for discussion on the topic of Computer Science/ICT/IT curricula internationalisation. If this topic interests you, please become a member and join in the discussions. It is open to all. Please see groups.google.com/forum/m/#!forum/computer-science-curriculum-internationalization