Saturday, 6 October 2012

Weibo Character Count

Same as all the other microblog systems I have encountered, Sina Wēibó 新浪微博 has a 140 character limit for a user post. This is not strictly accurate. The character limit is variable and ranges from 70 to 280, inclusive. It depends on which characters are included. Different characters have different counts, as follows:
  1. Characters from Unicode range U+0000➜U+00FF have a count of 0.5
  2. Characters from Unicode range U+0100➜U+FFFF have a count of 1
  3. Characters from Unicode range ≥ U+010000 have a count of 2
Some of the consequences of these differing counts are:

  • If one writes in everyday English then one has up to 280 characters as these will be Latin characters in Unicode blocks Basic Latin and Latin-1 Supplement U+0000➜U+00FF. The Latin Script does though occur in several Unicode blocks en.wikipedia.org/wiki/Latin_characters_in_Unicode. Latin characters in Unicode blocks other than Basic Latin and Latin-1 Supplement will have counts of 1 or 2 and usage of them will reduce the 280 limit.
  • For a Chinese only post then if all the Chinese characters used are in the Unicode Basic Multilingual Plane (BMP) then the limit will be the accepted 140 characters. There are many Chinese characters outside of the BMP and because they have a count of 2, usage of these will reduce the 140 limit. The extreme case being a limit of 70 if all characters used are Chinese characters outside of the BMP.
  • In recent releases of OSX and iOS, Apple incorporated Emoji characters en.wikipedia.org/wiki/Emoji The majority of these Emoji characters are outside the BMP (ie ≥ U+010000) and so will have a count of 2.
Lets illustrate with a nonsensical posting that contains characters from the 3 Unicode ranges, above. The following text has a Weibo character count of 13.

  • one two 🀂一二三四五🀀
  • 0.5 + 0.5 + 0.5 + 0.5 + 0.5 + 0.5 + 0.5 + 0.5 + 2 + 1 + 1 + 1 + 1 + 1 + 2 = 13