Friday, 10 November 2017

Computer Science Internationalization - html tables


Subtitled: html tables with Chinese characteristics


Subtitled: applying Chinese and Korean font metrics to html tables


I think html tables look much better when the table cells have a square aspect ration ie a perfect square rather than rectangular.

There is actually a very easy way of making your html table cells square. Firstly one needs to have some understanding of the characteristics of human language scripts. One characteristic of the Chinese language script is that each character 汉字 occupies a square. We can use this characteristic when coding html tables.

I wrote some html code for a Sudoku board. In each cell I have an input. The relevant code, for this article, is:

<td class='bordy'><input type='text' class='bigcol' size='1'></td>

Let's look at relevant CSS in class bigcol, specifically the CSS property font-family —

font-family:Courier,monospace; and font-family:monospace; give a rectangular aspect ratio, the height being greater than the width. Not good.

Let's now use a Chinese font: font-family:"Hannotate SC",monospace;.  With Firefox, we now have a really good looking table with square aspect ratio table cells. Chrome, on the other hand, produces vertical rectangular aspect ratio table cells.

Using the template font-family: font name ,monospace;, let's try some more Chinese fonts.

Firefox + font PingFang SC produces a perfect square. Chrome + PingFang SC gives a vertical rectangle.

Firefox + Yuanti SC 圆体-简 produces a perfect square. Chrome + Yuanti SC 圆体-简 gives a vertical rectangle.

Firefox + Baoli SC 报隶-简 produces a perfect square. Chrome + Baoli SC 报隶-简 gives a vertical rectangle.

Firefox + Lantinghei SC 兰亭黑-简 produces a perfect square. Chrome + Lantinghei SC 兰亭黑-简 gives a vertical rectangle.

Using font Heiti SC, Chrome gives much better results than Firefox. Firefox is way out. Chrome + Heiti SC produces a near perfect square. Firefox + Heiti gives a horizontal rectangle.

My expectation is that all browsers, when Chinese fonts are specified, should give perfect squares for the table cells as the squared Chinese character is a fundamental characteristic of the Chinese language. From my experimentation, one can see that there are differences between browsers and a perfect square is not always produced.

One of the characteristics of the Korean language script Hangul (also romanised as Hangeul) 한글 is that the individual letters Jamo are formed into squared syllable blocks. Letʼs explore this characteristic with the Korean font  Nanum Myeongjo 나눔 명조.  Firefox + Nanum Myeongjo  produces a near perfect square. Chrome produces a vertical rectangle.

Using font "Noto Sans Korean"  results in really awful aspect ratios in both browsers. Chrome  + "Noto Sans Korean" produces vertical rectangles. Firefox + "Noto Sans Korean" gives horizontal rectangles.

I reason that there are dependencies between browsers and font metrics. In the "Noto Sans Korean" case, it seems that the 2 browsers have a 90 degree difference in interpretation of this fonts metrics.

My favourite combinations, so far, are: Firefox + Hannotate SC 手札体-简, Firefox + PingFang SC 苹方-简, Firefox + Yuanti SC 圆体-简, Firefox + Baoli SC 报隶-简, Lantinghei SC 兰亭黑-简, Firefox + Nanum Myeongjo 나눔 명조 and Chrome +Heiti SC 黑体-简.

〖 thoughts — Should look at more fonts. Are there any other human language scripts that are squared?〗

Here is what my Sudoku board looks like with Firefox + Hannotate SC font. This font contains glyphs for English as well as Chinese. I do like the Hannotate SC font style used for English as well as the Hannotate SC font style used for Chinese. I had a bit of fun with the Emoji as I selected them by Chinese name. Actually, an Emoji Sudoku might be popular. Instead of the numbers 1 thru 9, an Emoji Sudoku would have 9 different Emoji.


Firefox + Hannotate SC
Environment: OSX Sierra 10.12.6, Firefox 56.0.2, Chrome 62.0.3202.89

Friday, 3 November 2017

Computer Science Internationalization - Unicode Emoji

At the moment I am not going to give much explanation because I may be setting my students a challenge related to the below Emoji. You may find these Emoji useful for certain combinations of software and database. I will be adding more Unicode Emoji over the next week or so. I will fully explain in a few weeks time. So check back in a few weeks for my full explanation.


☺️ ☹️ ☠️ ✌️ ☃️ ☄️ ☎️ ☑️ ☔️ ☕️ ☘️ ☝️ ☢️ ☣️ ☦️ ☪️ ☮️ ☯️ ☸️ ♈️ ♉️ ♊️ ♋️ ♌️ ♍️ ♎️ ♏️ ♐️ ♑️ ♒️ ♓️ ♥️ ♦️ ♨️ ♻️ ♿️ ⚒️ ⚓️ ⚔️ ⚖️ ⚗️ ⚙️ ⚛️ ⚜️ ⚠️ ⚡️ ⚪️ ⚫️ ⚰️ ⚱️ ⚽️ ⚾️ ⛄️ ⛅️ ⛈️ ⛎️ ⛏️ ⛑️ ⛓️ ⛔️ ⛩️ ⛪️ ⛰️ ⛱️ ⛲️ ⛳️ ⛴️ ⛵️ ⛷️ ⛸️ ⛹️ ⛺️ ⛽️ ✂️ ✅️ ✈️ ✉️ ✊️ ✋️ ✌️ ✍️ ✏️ ✒️ ✝️ ✡️ ✨️ ✳️ ✴️ ❄️ ❇️ ❌️ ❎️ ❓️ ❔️ ❕️ ❗️ ❣️ ❤️ ➡️ 

Wednesday, 1 November 2017

Computer Science Internationalization - i18n links

In this article I list i18n (Internationalisation) relevant links. This article will evolve as and when I add and remove links.

Apple: OSX and iOS
  1. m10lmac.blogspot.co.uk — Multilingual Mac

Computer Science/IT/ICT Curricula Internationalisation
  1. groups.google.com/forum/#!forum/computer-science-curriculum-internationalization — I am administrator for this forum

IDNs (Internationalised Domain Names and EAI (Email Address Internationalisation)
  1. uasg.tech — Universal Acceptance Steering Group - I am a member of their email discussion list
  2. idnforums.com — IDN Forums - a Domainers forum - I am a member of this forum.
  3. Get a free
    1. 电邮.在线 — Chinese 中文 email address or
    2. ডাটামেল্.ভারত — Bengali email address or
    3. ડાટામેલ.ભારત — Gujarati email address or
    4. डाटामेल.भारत — Hindi email address or
    5. डेटामेल.भारत — Marathi email address or
    6. ਡਾਟਾਮੇਲ.ਭਾਰਤ — Punjabi email address or
    7. இந.இந்தியா — Tamil email address or
    8. datamail.in — English email address

Regular Expressions (regex)
  1. regular-expressions.info/refunicode.html — Regular Expression Unicode Syntax Reference
  2. speakerdeck.com/andre_schappo/unicode-regular-expressions — one of my presentations

twitter
  1. twitter.com/r12a — Richard Ishida

Unicode
  1. unicode.org — the definitive source for all things Unicode - I am a member of their email discussion list
  2. babelstone.co.uk/Unicode/unicode.html — unicode, the movie

W3
  1. w3.org/International — W3C Internationalization (i18n) Activity
  2. validator.w3.org/i18n-checker/ — i18n checker
  3. developer.mozilla.org/en-US/docs/Web/CSS/list-style-type — CSS list-style-type property

Wednesday, 25 October 2017

Computer Science Internationalization - Experimentation

Subtitle: "How I Discovered the Undiscoverable!"

Note: This is a work in progress and will probably take me several weeks to complete

I was writing some demonstrator code for an Introductory JavaScript class. I intended the code to illustrate expected and unexpected behaviour of the length property. Expected behaviour is when the result of the length property is equal to the number of human perceived characters. Unexpected behaviour is when the result of the length property is not equal to the number of human perceived characters.

"诺丁汉".length returns 3 (3 encoding units)
"ノッティンガム".length returns 7 (7 encoding units)
"노팅엄".length returns 3 (3 encoding units)

All good so far. These are answers that anyone would expect. Now letʼs try some Unicode Emoji.

"🐟".length returns 2 (2 encoding units)
"🐕".length returns 2 (2 encoding units)

...and, some non Emoji SMP (Supplementary Multilingual Plane) Unicode characters

"𓀌".length returns 2 (2 encoding units)
"🀤".length returns 2 (2 encoding units)

And now we observe some wierdness. In terms of human perceived characters the answer should, of course, be 1 so for most people this behaviour is unexpected. It is not unexpected for me as I know that the length property counts in UTF-16 encoding units rather than human perceived characters. I have written the number of UTF-16 encoding in brackets so that you can now understand the answer the length property returns.

Before we proceed further I need to give you further information. I can write Chinese on a Computer and Emoji can be selected by Chinese name using OSX Sierra's Simplified Pinyin Input Method. See schappo.blogspot.co.uk/2016/01/emoji-by-name.html

When I want  Emoji I sometimes use OSX's Emoji and Symbol Viewer and sometimes select by Chinese name.

Now we come to the random bit. I typed yu in the Simplified Pinyin Input Method and there were 6 different Emoji to choose from. I chose 🌧️ . I had no reason to type yu nor to chose 🌧️ , I was just experimenting. Now we come back to the length property.

"🌧️".length returns 3 (??????????!) [U+1F327 U+FE0F]
"🌦️".length returns 3 (??????????!) [U+1F326 U+FE0F]

It was most definitely not the answer I was expecting. After some 10 minutes investigation I discovered the reason for this unexpected answer. With these two Emoji the variation selector U+FE0F codepoints.net/U+FE0F is being appended thus giving a count of 3. We now have the answer to the length anomaly. But why do some Emoji have the variation selector appended and not others?

Peter Edberg gives this excellent explanation.

This is about characters U+1F327,U+1F326

The variation selector FE0F is *not* unnecessary with these. Looking at unicode.org/Public/emoji/5.0/emoji-data.txt those characters do *not* have the Emoji-Presentation property set, and they do have variation sequences defined.

From unicode.org/reports/tr51/#Emoji_Variation_Selector_Notes, such singleton emoji characters “should have emoji presentation selectors on base characters with Emoji_Presentation=No whenever an emoji presentation is desired”

I stated: I see that U+1F321➜1F32C do not have the Emoji_Presentation property set.

Peter Edberg responded: From unicode.org/emoji/charts-5.0/emoji-versions-sources.html you can see that these characters came into Unicode as a result of their being in the Webdings/Wingdings set, where they had a prior history of being non-emoji text characters. That is why they have Emoji_Presentation=No by default.

Letʼs now examine my bold claim "I discovered the Undiscoverable"

In order to make this discovery there is a set of required knowledge, skills and personality traits. These include:
  • A Knowledge of JavaScript
  • A good understanding of Unicode
  • The ability to write Chinese using OSX Sierra's Pinyin Simplified Input Method
  • Knowing that Emoji can be selected by Chinese name using Sierra's Pinyin Simplified Input Method
  • Being aware of the JavaScript length property quirk
  • A desire to experiment and explore
Considering that the World population is less than 8 billion (estimate) I think it (near) impossible that any other person (in Academia, Staff or Student) would at the instant of time I made the discovery meet the requirements necessary to make the same discovery. By instant of time I do mean as perceived by a person 〖 say less than one second. I need to research this!! 〗 because, of course, our thought process is not instant even though we experience it as such.

Window of opportunity for my discovery — I reason that the window of opportunity for the discovery started when 🌧️  was available. It was added to Unicode version 7.0 in 2014. It would probably have been another year before it became available and integrated into Apple's OSX. I made the discovery on Saturday 21st October 2017. twitter.com/andreschappo/status/921722952504238081 Given this reasoning the window of opportunity for this discovery is approximately 3 years.

Consequences: My first thought was that this anomaly would cause problems with Emoji Domains. Using mothereff.in/punycode, 🌧️.ws (with variation selector) gives the punycode address xn--v86c7044b.ws and 🌧.ws (without variation selector) gives the punycode address xn--kh8h.ws So, these are obviously different addresses. When 🌧️.ws is pasted into the Firefox address bar it needs to convert from the Unicode form 🌧️.ws to the punycode form. The punycode form it uses is xn--kh8h.ws, it is therefore evident that Firefox disregards the variation selector on conversion. Computers and Routers use the punycode form, the Unicode form is used for display to humans.

I realise that Emoji Domains are IDNA2008 disallowed, but, I figure they will be around for a good number of years yet to come.

〖 I will check out BMP and SMP Emoji with IDNA2003, UTS46 and IDNA2008 using unicode.org/cldr/utility/idna.jsp

Why was this my first thought. I am a long time practitioner of internationalised Computer Science teaching and IDNs Internationalised Domain Names (Emoji Domains are a controversial subset of IDNs) are an important part of i18n. I am an active member of the UASG discussion email list uasg.tech. I am also an active member of IDN Forums idnforums.com. I have learned much from the Domainers on IDN Forums. Thanks guys 👋


〖 Cor Blimey - I am getting into Combinatorial Explosion or Infinite Regression or Both😜  〗

...much more to add but that is all for now...

〖 To Do List —

  • JavaScript Code to count correctly — link to Mathias Bynens
  • Calculate Odds against the discovery been made — I will need help on this as my Maths is rubbish
  • Check all Emoji available in Sierra to see if any others have a count of 3
  • Consequences of 🌧️  counting anomaly
  • Check OSX High Sierra to see if the 🌧️  counting anomaly still exists
  • CJK variants domain names - need to do some research https://en.wikipedia.org/wiki/Variation_Selectors_(Unicode_block)


Environment: OSX Sierra version 10.12.6, FireFox version 56.0.2

Monday, 10 July 2017

Computer Science Internationalization - Unicode Encoding & Decoding

Several years ago I devised this visual and fun way to teach and practise encoding and decoding Unicode. I used this method in my International Computing class. This method involves use of pencil and eraser.

The codepoints and the UTF-8 are all written in hexadecimal(hex). The binary bits are an intermediate form for the purposes of encoding and decoding.

We start with the following form which is designed for encoding Unicode codepoints to UTF-8 and decoding UTF-8 to Unicode codepoints.
Encoding: We will start with encoding Unicode codepoints to UTF-8.

The first thing we can do is fill in the fixed bits. They are the fixed bits defined by the encoding scheme. I have entered the fixed bits in red to make them distinct from variable bits.
Now we will write one or more Unicode codepoints on the form. These will be the codepoints we will encode into UTF-8. The codepoints should be written in hexadecimal. I will use the codepoints U+0444 and U+597D.

So, how do we determine where the codepoints go on the form. We need to look at the free bits to determine the range of values that can be accommodated.

  • 1 byte row - 7 free variable bits giving a range of 0 ➔ 7F
  • 2 byte row - 11 free variable bits giving a range of 80 ➔ 7FF
  • 3 byte row - 16 free variable bits giving a range of 800 ➔ FFFF
  • 4 byte row - 21 free variable bits giving a range of 10000 ➔ 1FFFFF (the actual maximum value of a codepoint is 10FFFF)
Now we know the ranges we can put U+0444 and U+597D in the correct places of the form.

We have empty boxes into which we write the binary values of the codepoints.
Finally, we take the complete bytes and write them as hexadecimal values to form the UTF-8 encoded forms. U+0444 encoded is D184, U+597D encoded is E5A5BD.
Decoding: Now onto decoding from UTF-8 to Unicode codepoints. We will decode the UTF-8 F0AA9FB7 which I have entered onto the form. I have used spaces on the form to make the byte boundaries more obvious.
Complete the bytes by writing the binary variable values.
Extract the variable binary values to form the hex Unicode codepoint U+2A7F7.
Whilst I was at it, I completed a single byte entry. The single byte characters are ASCII characters. ASCII is a subset of Unicode.

It is a Unicode convention, when writing codepoints, to use a minimum of four hex digits. So for codepoints <1000, one should left pad with zeroes. Hence my entries U+0444 and U+0057 rather than U+444 and U+57.

Sunday, 2 July 2017

Computer Science Internationalization - Text Search

So, you have just written some Cool Code which will search for and find occurrences of specified text strings. You have access to Big Data text eg all the text in all public webpages. You will,of course, want to test your Cool Code. Letʼs perform some, seemingly, very simple tests.

Letʼs search for the word 'Scorpion'. Your code works just fine and hence finds all occurrences of the word 'Scorpion'.

Now test with the following two words.

  • Scorpion
  • Scorpion

Your Cool Code works fine as all I have done is applied some CSS styling, thus giving each of the two words differing appearance.

Now test you Cool Code with the following two words.

  • 𝑆𝑐𝑜𝑟𝑝𝑖𝑜𝑛
  • 𝐒𝐜𝐨𝐫𝐩𝐢𝐨𝐧

If you have only programmed for ASCII text then your now not so Cool Code will fail. These two words have differing appearance because they are not made up of the ASCII characters you are familiar with. These words use characters from the Unicode Math Alphanumeric Symbols block, U+1D400-1D4FF.

Should the Math Alphanumeric Symbols Scorpion be treated the same as the ASCII Scorpion wrt the search results of your code? In this context I think "Yes", most definitely. A person reading this blog, for example, will just perceive the word Scorpion whatever characters are used to write the word. The reader may well also visualise the insect with a "sting in the tail"😱

What of current working practice?

With twitter, a user has no means of changing text style within a tweet. It has thus become common to use Unicode Math Alphanumeric Symbols to change appearance. I could, for example, use Unicode Math Alphanumeric Symbols to emphasise a word (eg Scorpion) or phrase within a tweet. The meaning of the tweet remains the same.

Google returns the same number of search results whichever of the above forms of Scorpion I use. At time of writing this is "About 144,000,000 results". I deduce Google is treating ASCII Scorpion and Unicode Math Alphanumeric Symbols 𝑆𝑐𝑜𝑟𝑝𝑖𝑜𝑛 & 𝐒𝐜𝐨𝐫𝐩𝐢𝐨𝐧 as equivalent.

Sogou 搜狗 is a Chinese search engine. Using Sogou: ASCII Scorpion returns 93,341 results, Math Alphanumeric Symbols 𝑆𝑐𝑜𝑟𝑝𝑖𝑜𝑛 returns 4,738, Math Alphanumeric Symbols 𝐒𝐜𝐨𝐫𝐩𝐢𝐨𝐧 returns 61. I think it evident that Sogou does not treat my three forms of Scorpion as equivalent.

I side with Google on this.

Here is a taster of what is happening in the behind the scenes technicalities of Unicode. Letʼs take just one of the Unicode Math Alphanumeric Symbols I have used, 𝐒 MATHEMATICAL BOLD CAPITAL S U+1D412. If you visit codepoints.net/U+1D412 you will see a wealth of information about this character. Of relevance to this blog is the Decomposition Mapping which is to the, oh so familiar, ASCII uppercase capital S. This Unicode information can be used to compute string equivalents which can then be used for search thus providing all relevant results.

The moral of this "Sting in the Tale" is: If you do not already know it, you must learn Unicode, it is essential.

Friday, 28 April 2017

Computer Science Internationalization - Hieroglyphs in Domain Names

I have been aware for a long time that domains such as .com support many human language scripts. Verisign's .com includes support for Hiragana, Gurmukhi, Han, Tibetan, Sinhala, Devanagari, Hangul and many more.

But what of Verisign's .com equivalents .コム (Japanese) and .닷컴 (Korean)? Both of these support a multitude of human language scripts. The supported scripts for many, but not all, Domains are listed in the IANA Repository of IDN Practices iana.org/domains/idn-tables.

Whilst browsing this repository, I discovered there are sixteen domains, all belonging to Verisign, which support Egyptian Hieroglyphs which I think is totally cool! Verisign's .com, .コム and .닷컴 all support Egyptian Hieroglyphs. This means one can register domain names such as:-

  1. 𓇋𓈖𓏏𓂝𓂋𓈖𓄿𓏏𓇋𓍯𓈖𓄿𓃭.com
  2. 𓇋𓈖𓏏𓂝𓂋𓈖𓄿𓏏𓇋𓍯𓈖𓄿𓃭.コム
  3. 𓇋𓈖𓏏𓂝𓂋𓈖𓄿𓏏𓇋𓍯𓈖𓄿𓃭.닷컴

It is possible you do not have an Egyptian Hieroglyph font on your device so here are the domain names in image format.

Google provide a free Egyptian Hieroglyph font which you can download from google.com/get/noto/

Does the Egyptian Hieroglyph string I have used above mean anything? It is actually a transliteration of the English word international. I used ngm.nationalgeographic.com/ngm/egypt/translator.html for the transliteration. The hieroglyphs translator presents the Egyptian Hieroglyphs as images. So, no simple copy and paste of Egyptian Hieroglyph text. I had to match with the appropriate Unicode characters by visual inspection. I cannot guarantee I made all the correct matches but I think I have them correct.

Here are some registered and live Egyptian Hieroglyph Domain Names egyptianhieroglyphic.com/egypt/egyptian-hieroglyphics/