André 小山 Schappo: April 2014

Monday, 14 April 2014

Australian Universities on Weibo

There are quite a number of Australian Universities on Sina Weibo 新浪微博. Below I list those I have found. I only include those Australian Universities that have verified (the big blue V after the username) Weibo accounts. The text in square brackets is the username on Weibo.

Australian Catholic University [@ACUInternational] weibo.com/acuinternational
Charles Darwin University [@查尔斯达尔文大学] weibo.com/charlesdarwinuni
Curtin University [@科廷大学CurtinUniversity] weibo.com/CurtinWestAustralia
Deakin University [@澳大利亚迪肯大学] weibo.com/deakinuniversity
Federation University [@澳大利亚联邦大学FedUni] weibo.com/FedUniAustralia
Flinders University [@FlindersUni弗林德斯大学] weibo.com/flinders2011
La Trobe University [@澳大利亚拉筹伯大学] weibo.com/latrobeuniaus
Macquaire University [@澳大利亚麦考瑞大学] weibo.com/mquni
Monash University [@MonashUni澳大利亚蒙纳士大学] weibo.com/monashuniversityaust
Queensland University of Technology [@QUT昆士兰科技大学] weibo.com/qutbrisbane
Southern Cross University [@澳大利亚南十字星大学] weibo.com/scuchina
Swinburne University of Technology [@澳洲斯威本科技大学] weibo.com/swinburneuniversity
University of Adelaide [@澳大利亚阿德莱德大学] weibo.com/uniadelaide
University of Canberra [@堪培拉大学] weibo.com/unicanberra
University of Melbourne [@墨尔本大学官微] weibo.com/melbourneuni
University of New South Wales [@澳洲新南威尔士大学] weibo.com/ozunsw
University of Queensland [@昆士兰大学] weibo.com/myuq
University of South Australia [@南澳大学官方微博] weibo.com/studyatunisa
University of Southern Queensland [@澳大利亚南昆士兰大学] weibo.com/usqchina
University of Western Sydney [@西悉尼大学UWS] weibo.com/uwsinternational
University of Wollongong [@澳大利亚卧龙岗大学UOW] weibo.com/uowaustralia

Friday, 11 April 2014

Regular Expressions

Regular Expressions are not just about ASCII. They are (or should be) about Unicode, with ASCII being a very small subset of Unicode. The vast majority of Regular Expressions documentation and tutorials I have seen, only deal with ASCII. The consequence is that many/most will never consider non ASCII text strings.

If one considers Unicode text strings then one can process text strings consisting of non Latin Scripts and Symbols. Scripts such as: Cyrillic, Devanagari, Tamil, Georgian, Cherokee, Chinese and Sinhala. Symbols such as: Currency, Arrows, Mathematical Operators, Mahjong Tiles and Playing Cards. Unicode has a repertoire of over 100000 characters which can be processed with Regular Expressions.

Mostly, Regular Expressions are no different when using Unicode as compared to using the very limited ASCII. I will give some simple examples using Hangul, which is the Script used for writing Korean. The Hangul characters I will be using in the examples below are in Unicode block Hangul Syllables U+AC00-D7AF. I will intersperse other Unicode characters in my examples below. I present the examples in the form of a terminal session transcript.

苹果电脑 ~: egrep '바나나'
abcdef
abc바나나def
abc바나나def

苹果电脑 ~: egrep '바.나.나'
바诺丁汉나拉夫堡나
바拉나夫나堡
바拉나夫나堡

苹果电脑 ~: egrep '[바나다]'
abcdef
보노도고로
ДЖԶख나ખ༁
ДЖԶख나ખ༁

苹果电脑 ~: egrep '[가-힣]'
abcdef
abc현def
abc현def

苹果电脑 ~: egrep '^[ 가-힣]+$'
abcdef
abc서울def
서울은 아름답다
서울은 아름답다

Where you see a line duplicated that means there was a successful match with the Regular Expression. I have used egrep on OSX.

The transcript may look a bit odd because of the variety and unfamiliarity of the Unicode characters I have used. If, though, you carefully examine the above Regular Expressions you will see they have standard syntax and are actually elementary constructs. So, if you teach Regular Expressions, why not give your students an insight into processing Unicode strings and not just ASCII strings. Or, to put it another way, give your students an insight into processing multi-language strings and not just English strings. Or, to put it yet another way, code for the whole world and not just the English speaking world.

BTW — 苹果电脑 ~: is the prompt I setup for my iMac and the first four characters are Chinese for Apple Computer.

In the examples above, I have deliberately used one of the standard and common Regular Expression engines. I have accessed this engine via egrep. This type of engine is one which you will most likely encounter. Much less common, are the Regular Expression engines that have been extended with features specifically for Unicode. Such extensions, for instance, facilitate matching with Unicode characters having some specified property e.g. \p{Hangul} will match with any character belonging to the Hangul Script. More information on such engines is available at regular-expressions.info/unicode.html and unicode.org/reports/tr18/