Saturday, 20 October 2012

Twitter Character Count

【Update: Do not know when it happened but Twitter no longer differentiates between BMP and non BMP characters WRT character count. All characters now have a count of 1. I may, at some stage, delete this article but, for the time being, I will leave it here as an historical record of the evolution of Twitter.

In a previous article I examined Sina Wēibó 新浪微博 character count for a user post schappo.blogspot.co.uk/2012/10/weibo-character-count.html Lets now examine twitter. The stated and generally understood limit is 140 characters for a tweet. This is not strictly true. The actual tweet limit is variable and ranges from 70 to 140, inclusive. Different characters have different counts, as follows:

  • Characters from Unicode range U+0000➜U+FFFF have a count of 1
  • Characters from Unicode range ≥ U+010000 have a count of 2
Or, to put it another way — Characters in the Basic Multilingual Plane (BMP) have a count of 1 and characters in the other planes have a count of 2. The 2 Mahjong Tile characters used in the example below are from the Supplementary Multilingual Plane (SMP).

Lets illustrate with a made-up posting that contains characters from the 2 Unicode ranges, above. The following text has a tweet character count of 17.
  • one two 一二三四五
  • 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 1 + 1 + 1 + 1 + 1 + 2 = 17

Saturday, 6 October 2012

Weibo Character Count

Same as all the other microblog systems I have encountered, Sina Wēibó 新浪微博 has a 140 character limit for a user post. This is not strictly accurate. The character limit is variable and ranges from 70 to 280, inclusive. It depends on which characters are included. Different characters have different counts, as follows:
  1. Characters from Unicode range U+0000➜U+00FF have a count of 0.5
  2. Characters from Unicode range U+0100➜U+FFFF have a count of 1
  3. Characters from Unicode range ≥ U+010000 have a count of 2
Some of the consequences of these differing counts are:

  • If one writes in everyday English then one has up to 280 characters as these will be Latin characters in Unicode blocks Basic Latin and Latin-1 Supplement U+0000➜U+00FF. The Latin Script does though occur in several Unicode blocks en.wikipedia.org/wiki/Latin_characters_in_Unicode. Latin characters in Unicode blocks other than Basic Latin and Latin-1 Supplement will have counts of 1 or 2 and usage of them will reduce the 280 limit.
  • For a Chinese only post then if all the Chinese characters used are in the Unicode Basic Multilingual Plane (BMP) then the limit will be the accepted 140 characters. There are many Chinese characters outside of the BMP and because they have a count of 2, usage of these will reduce the 140 limit. The extreme case being a limit of 70 if all characters used are Chinese characters outside of the BMP.
  • In recent releases of OSX and iOS, Apple incorporated Emoji characters en.wikipedia.org/wiki/Emoji The majority of these Emoji characters are outside the BMP (ie ≥ U+010000) and so will have a count of 2.
Lets illustrate with a nonsensical posting that contains characters from the 3 Unicode ranges, above. The following text has a Weibo character count of 13.

  • one two 🀂一二三四五🀀
  • 0.5 + 0.5 + 0.5 + 0.5 + 0.5 + 0.5 + 0.5 + 0.5 + 2 + 1 + 1 + 1 + 1 + 1 + 2 = 13