Friday, 5 January 2018

Computer Science Internationalization - Composed v Decomposed Text

My given name André, has the diacritic, acute accent over the letter e. In Unicode there are two ways of constructing and storing é. Either as a single composed character U+00E9 LATIN SMALL LETTER E WITH ACUTE or decomposed as the two characters e U+0065 LATIN SMALL LETTER E and ´ U+0301 COMBINING ACUTE ACCENT. Processes can and do convert between composed and decomposed forms. How can we know which of these forms is being stored?

Letʼs take the case of pasting into a browser web form. I will use André as my test data and I will make use of Richard Ishidaʼs Uniview To use Uniview, paste text into the largish white textbox and then click the white down arrow . You will then see information about all the characters in the textbox. Pasting André into Uniview gives the following results:-

  1. Chrome — é is decomposed
  2. Firefox — é is decomposed
  3. Safari — é is composed

The text André in this blog article is in decomposed form, except where I indicate otherwise. In my rather limited tests, Chrome and Firefox do not convert the text, so if the text starts as decomposed it arrives in Uniview as decomposed. Safari, on the other hand converts decomposed text to composed.

Letʼs now go back a step. The copy operation involves copying text to the clipboard and the paste operation takes text from the clipboard. Can we determine which form is in the clipboard? Yes we can and here is one way of doing it.

We are now going to use the terminal app. Typing the command  pbpaste|hexdump -C  will show the contents of the clipboard at byte level. Copying André and running the command  pbpaste|hexdump -C  will give  41 6e 64 72 65 cc 81 . This is André displayed in Unicode UTF-8 encoding, where  41 = A; 6e = n; 64 = d; 72 = r; 65 = e; cc 81 = combining acute accent. If we copy the composed form of André (⬅︎ composed ⬅︎ André) we get  41 6e 64 72 c3 a9  where  c3 a9 = composed é.

Conclusion: Itʼs complicated! Not being aware of these different forms of text can lead to, very difficult to find, bugs in code. Given different combinations of apps, versions and processes the results may well be different. One certainty though, is that one needs to have a good understanding of Unicode.

Environment: OSX High Sierra 10.13.2, Chrome 63.0.3239.84, Firefox 57.0.4, Safari 11.0.2