Monday 10 July 2017

Computer Science Internationalization - Unicode Encoding & Decoding

Several years ago I devised this visual and fun way to teach and practise encoding and decoding Unicode. I used this method in my International Computing class. This method involves use of pencil and eraser.

The codepoints and the UTF-8 are all written in hexadecimal(hex). The binary bits are an intermediate form for the purposes of encoding and decoding.

We start with the following form which is designed for encoding Unicode codepoints to UTF-8 and decoding UTF-8 to Unicode codepoints.
Encoding: We will start with encoding Unicode codepoints to UTF-8.

The first thing we can do is fill in the fixed bits. They are the fixed bits defined by the encoding scheme. I have entered the fixed bits in red to make them distinct from variable bits.
Now we will write one or more Unicode codepoints on the form. These will be the codepoints we will encode into UTF-8. The codepoints should be written in hexadecimal. I will use the codepoints U+0444 and U+597D.

So, how do we determine where the codepoints go on the form. We need to look at the free bits to determine the range of values that can be accommodated.

  • 1 byte row - 7 free variable bits giving a range of 0 ➔ 7F
  • 2 byte row - 11 free variable bits giving a range of 80 ➔ 7FF
  • 3 byte row - 16 free variable bits giving a range of 800 ➔ FFFF
  • 4 byte row - 21 free variable bits giving a range of 10000 ➔ 1FFFFF (the actual maximum value of a codepoint is 10FFFF)
Now we know the ranges we can put U+0444 and U+597D in the correct places of the form.

We have empty boxes into which we write the binary values of the codepoints.
Finally, we take the complete bytes and write them as hexadecimal values to form the UTF-8 encoded forms. U+0444 encoded is D184, U+597D encoded is E5A5BD.
Decoding: Now onto decoding from UTF-8 to Unicode codepoints. We will decode the UTF-8 F0AA9FB7 which I have entered onto the form. I have used spaces on the form to make the byte boundaries more obvious.
Complete the bytes by writing the binary variable values.
Extract the variable binary values to form the hex Unicode codepoint U+2A7F7.
Whilst I was at it, I completed a single byte entry. The single byte characters are ASCII characters. ASCII is a subset of Unicode.

It is a Unicode convention, when writing codepoints, to use a minimum of four hex digits. So for codepoints <1000, one should left pad with zeroes. Hence my entries U+0444 and U+0057 rather than U+444 and U+57.

Sunday 2 July 2017

Computer Science Internationalization - Text Search

So, you have just written some Cool Code which will search for and find occurrences of specified text strings. You have access to Big Data text eg all the text in all public webpages. You will,of course, want to test your Cool Code. Letʼs perform some, seemingly, very simple tests.

Letʼs search for the word 'Scorpion'. Your code works just fine and hence finds all occurrences of the word 'Scorpion'.

Now test with the following two words.

  • Scorpion
  • Scorpion

Your Cool Code works fine as all I have done is applied some CSS styling, thus giving each of the two words differing appearance.

Now test you Cool Code with the following two words.

  • 𝑆𝑐𝑜𝑟𝑝𝑖𝑜𝑛
  • 𝐒𝐜𝐨𝐫𝐩𝐢𝐨𝐧

If you have only programmed for ASCII text then your now not so Cool Code will fail. These two words have differing appearance because they are not made up of the ASCII characters you are familiar with. These words use characters from the Unicode Math Alphanumeric Symbols block, U+1D400-1D4FF.

Should the Math Alphanumeric Symbols Scorpion be treated the same as the ASCII Scorpion wrt the search results of your code? In this context I think "Yes", most definitely. A person reading this blog, for example, will just perceive the word Scorpion whatever characters are used to write the word. The reader may well also visualise the insect with a "sting in the tail"😱

What of current working practice?

With twitter, a user has no means of changing text style within a tweet. It has thus become common to use Unicode Math Alphanumeric Symbols to change appearance. I could, for example, use Unicode Math Alphanumeric Symbols to emphasise a word (eg Scorpion) or phrase within a tweet. The meaning of the tweet remains the same.

Google returns the same number of search results whichever of the above forms of Scorpion I use. At time of writing this is "About 144,000,000 results". I deduce Google is treating ASCII Scorpion and Unicode Math Alphanumeric Symbols 𝑆𝑐𝑜𝑟𝑝𝑖𝑜𝑛 & 𝐒𝐜𝐨𝐫𝐩𝐢𝐨𝐧 as equivalent.

Sogou 搜狗 is a Chinese search engine. Using Sogou: ASCII Scorpion returns 93,341 results, Math Alphanumeric Symbols 𝑆𝑐𝑜𝑟𝑝𝑖𝑜𝑛 returns 4,738, Math Alphanumeric Symbols 𝐒𝐜𝐨𝐫𝐩𝐢𝐨𝐧 returns 61. I think it evident that Sogou does not treat my three forms of Scorpion as equivalent.

I side with Google on this.

Here is a taster of what is happening in the behind the scenes technicalities of Unicode. Letʼs take just one of the Unicode Math Alphanumeric Symbols I have used, 𝐒 MATHEMATICAL BOLD CAPITAL S U+1D412. If you visit codepoints.net/U+1D412 you will see a wealth of information about this character. Of relevance to this blog is the Decomposition Mapping which is to the, oh so familiar, ASCII uppercase capital S. This Unicode information can be used to compute string equivalents which can then be used for search thus providing all relevant results.

The moral of this "Sting in the Tale" is: If you do not already know it, you must learn Unicode, it is essential.