Subtitle: "How I Discovered the Undiscoverable!"
I was writing some demonstrator code for an Introductory JavaScript class. I intended the code to illustrate expected and unexpected behaviour of the length property. Expected behaviour is when the result of the length property is equal to the number of human perceived characters. Unexpected behaviour is when the result of the length property is not equal to the number of human perceived characters.
"诺丁汉".length returns 3 (3 encoding units)
"ノッティンガム".length returns 7 (7 encoding units)
"노팅엄".length returns 3 (3 encoding units)
"노팅엄".length returns 3 (3 encoding units)
All good so far. These are answers that anyone would expect. Now letʼs try some Unicode Emoji.
"🐟".length returns 2 (2 encoding units)
"🐕".length returns 2 (2 encoding units)
...and, some non Emoji SMP (Supplementary Multilingual Plane) Unicode characters
"𓀌".length returns 2 (2 encoding units)
"🀤".length returns 2 (2 encoding units)
...and, some non Emoji SMP (Supplementary Multilingual Plane) Unicode characters
"𓀌".length returns 2 (2 encoding units)
"🀤".length returns 2 (2 encoding units)
And now we observe some wierdness. In terms of human perceived characters the answer should, of course, be 1 so for most people this behaviour is unexpected. It is not unexpected for me as I know that the length property counts in UTF-16 encoding units rather than human perceived characters. I have written the number of UTF-16 encoding in brackets so that you can now understand the answer the length property returns.
Before we proceed further I need to give you further information. I can write Chinese on a Computer and Emoji can be selected by Chinese name using OSX Sierra's Simplified Pinyin Input Method. See schappo.blogspot.co.uk/2016/01/emoji-by-name.html
When I want Emoji I sometimes use OSX's Emoji and Symbol Viewer and sometimes select by Chinese name.
Now we come to the random bit. I typed yu in the Simplified Pinyin Input Method and there were 6 different Emoji to choose from. I chose 🌧️ . I had no reason to type yu nor to chose 🌧️ , I was just experimenting. Now we come back to the length property.
"🌧️".length returns 3 (??????????!) [U+1F327 U+FE0F]
"🌦️".length returns 3 (??????????!) [U+1F326 U+FE0F]
It was most definitely not the answer I was expecting. After some 10 minutes investigation I discovered the reason for this unexpected answer. With these two Emoji the variation selector U+FE0F codepoints.net/U+FE0F is being appended thus giving a count of 3. We now have the answer to the length anomaly. But why do some Emoji have the variation selector appended and not others?
Peter Edberg gives this excellent explanation.
This is about characters U+1F327,U+1F326
The variation selector FE0F is *not* unnecessary with these. Looking at unicode.org/Public/emoji/5.0/emoji-data.txt those characters do *not* have the Emoji-Presentation property set, and they do have variation sequences defined.
From unicode.org/reports/tr51/#Emoji_Variation_Selector_Notes, such singleton emoji characters “should have emoji presentation selectors on base characters with Emoji_Presentation=No whenever an emoji presentation is desired”
I stated: I see that U+1F321➜1F32C do not have the Emoji_Presentation property set.
Peter Edberg responded: From unicode.org/emoji/charts-5.0/emoji-versions-sources.html you can see that these characters came into Unicode as a result of their being in the Webdings/Wingdings set, where they had a prior history of being non-emoji text characters. That is why they have Emoji_Presentation=No by default.
Letʼs now examine my bold claim "I discovered the Undiscoverable"
In order to make this discovery there is a set of required knowledge, skills and personality traits. These include:
Before we proceed further I need to give you further information. I can write Chinese on a Computer and Emoji can be selected by Chinese name using OSX Sierra's Simplified Pinyin Input Method. See schappo.blogspot.co.uk/2016/01/emoji-by-name.html
When I want Emoji I sometimes use OSX's Emoji and Symbol Viewer and sometimes select by Chinese name.
Now we come to the random bit. I typed yu in the Simplified Pinyin Input Method and there were 6 different Emoji to choose from. I chose 🌧️ . I had no reason to type yu nor to chose 🌧️ , I was just experimenting. Now we come back to the length property.
"🌧️".length returns 3 (??????????!) [U+1F327 U+FE0F]
"🌦️".length returns 3 (??????????!) [U+1F326 U+FE0F]
It was most definitely not the answer I was expecting. After some 10 minutes investigation I discovered the reason for this unexpected answer. With these two Emoji the variation selector U+FE0F codepoints.net/U+FE0F is being appended thus giving a count of 3. We now have the answer to the length anomaly. But why do some Emoji have the variation selector appended and not others?
Peter Edberg gives this excellent explanation.
This is about characters U+1F327,U+1F326
The variation selector FE0F is *not* unnecessary with these. Looking at unicode.org/Public/emoji/5.0/emoji-data.txt those characters do *not* have the Emoji-Presentation property set, and they do have variation sequences defined.
From unicode.org/reports/tr51/#Emoji_Variation_Selector_Notes, such singleton emoji characters “should have emoji presentation selectors on base characters with Emoji_Presentation=No whenever an emoji presentation is desired”
I stated: I see that U+1F321➜1F32C do not have the Emoji_Presentation property set.
Peter Edberg responded: From unicode.org/emoji/charts-5.0/emoji-versions-sources.html you can see that these characters came into Unicode as a result of their being in the Webdings/Wingdings set, where they had a prior history of being non-emoji text characters. That is why they have Emoji_Presentation=No by default.
Letʼs now examine my bold claim "I discovered the Undiscoverable"
In order to make this discovery there is a set of required knowledge, skills and personality traits. These include:
- A Knowledge of JavaScript
- A good understanding of Unicode
- The ability to write Chinese using OSX Sierra's Pinyin Simplified Input Method
- Knowing that Emoji can be selected by Chinese name using Sierra's Pinyin Simplified Input Method
- Being aware of the JavaScript length property quirk
- A desire to experiment and explore
Considering that the World population is less than 8 billion (estimate) I think it (near) impossible that any other person (in Academia, Staff or Student) would at the instant of time I made the discovery meet the requirements necessary to make the same discovery. By instant of time I do mean as perceived by a person 〖 say less than one second. I need to research this!! 〗 because, of course, our thought process is not instant even though we experience it as such.
Window of opportunity for my discovery — I reason that the window of opportunity for the discovery started when 🌧️ was available. It was added to Unicode version 7.0 in 2014. It would probably have been another year before it became available and integrated into Apple's OSX. I made the discovery on Saturday 21st October 2017. twitter.com/andreschappo/status/921722952504238081 Given this reasoning the window of opportunity for this discovery is approximately 3 years.
Consequences: My first thought was that this anomaly would cause problems with Emoji Domains. Using mothereff.in/punycode, 🌧️.ws (with variation selector) gives the punycode address xn--v86c7044b.ws and 🌧.ws (without variation selector) gives the punycode address xn--kh8h.ws So, these are obviously different addresses. When 🌧️.ws is pasted into the Firefox address bar it needs to convert from the Unicode form 🌧️.ws to the punycode form. The punycode form it uses is xn--kh8h.ws, it is therefore evident that Firefox disregards the variation selector on conversion. Computers and Routers use the punycode form, the Unicode form is used for display to humans.
I realise that Emoji Domains are IDNA2008 disallowed, but, I figure they will be around for a good number of years yet to come.
Why was this my first thought. I am a long time practitioner of internationalised Computer Science teaching and IDNs Internationalised Domain Names (Emoji Domains are a controversial subset of IDNs) are an important part of i18n. I am an active member of the UASG discussion email list uasg.tech. I am also an active member of IDN Forums idnforums.com. I have learned much from the Domainers on IDN Forums. Thanks guys 👋
Environment: OSX Sierra version 10.12.6, FireFox version 56.0.2Window of opportunity for my discovery — I reason that the window of opportunity for the discovery started when 🌧️ was available. It was added to Unicode version 7.0 in 2014. It would probably have been another year before it became available and integrated into Apple's OSX. I made the discovery on Saturday 21st October 2017. twitter.com/andreschappo/status/921722952504238081 Given this reasoning the window of opportunity for this discovery is approximately 3 years.
Consequences: My first thought was that this anomaly would cause problems with Emoji Domains. Using mothereff.in/punycode, 🌧️.ws (with variation selector) gives the punycode address xn--v86c7044b.ws and 🌧.ws (without variation selector) gives the punycode address xn--kh8h.ws So, these are obviously different addresses. When 🌧️.ws is pasted into the Firefox address bar it needs to convert from the Unicode form 🌧️.ws to the punycode form. The punycode form it uses is xn--kh8h.ws, it is therefore evident that Firefox disregards the variation selector on conversion. Computers and Routers use the punycode form, the Unicode form is used for display to humans.
I realise that Emoji Domains are IDNA2008 disallowed, but, I figure they will be around for a good number of years yet to come.
Why was this my first thought. I am a long time practitioner of internationalised Computer Science teaching and IDNs Internationalised Domain Names (Emoji Domains are a controversial subset of IDNs) are an important part of i18n. I am an active member of the UASG discussion email list uasg.tech. I am also an active member of IDN Forums idnforums.com. I have learned much from the Domainers on IDN Forums. Thanks guys 👋