Friday, September 20, 2019

A little complication: Unicode code point order


Now that the countdown for the official Unicode migration is going to end by the first of this October, particularly many of those from the community of Facebook users may, in fact, be indifferent to the occasion. They might be simple folks who think Zawgyi is just fine, or rebellious ones who would argue “Why fix it, if it ain’t broke?” Time is short and we just have to wait and see how many converts the establishment, celebrity Unicode ambassadors, academics, professionals, ISPs, NGOs, and the social media by itself could get out of them.
IMHO, migration to Unicode would take less effort than changing from the FPS system to the Metric system. For example, I still have to convert the length of one meter to something that is a little over one yard mentally to get a feeling of how long it is! So, also for a kilogram, or a kilometer. That’s because when you have the appropriate Unicode keyboard application and Unicode fonts installed, you would see exactly the same text as you would have typed out with Zawgyi. Besides, the keyboard layout is not that different from that for Zawgyi or earlier Myanmar fonts, I was told!
The following is the example for typing Myanmar text with KeyMagic/Keyman keyboard with Pyidaungsu Myanmar Unicode font. That was from the typing example in the User Manual available here.
What I did was typed out the example text exactly as shown in the User Manual with the KeyMagic keyboard into the Notepad, copied them, and paste them in the code fragment as seen below:
pyidaungsu_input_example <- c("ကျူ ", "ကျောင်း", "ကျွမ်း", "ကြီး", "မြို့ ", "ကြွေး", "ငြှိုး", "စက္ကူ", "လိမ္မော် ", "လိမ္မော် ", "အင်္ဂါ", "အင်္ကျီ", "သင်္ဘော", "ဥက္ကဋ္ဌ", "ဂုဏဝုဍ္ဎိ", "သဏ္ဌာန် ", "ဘဏ္ဍာ", "ဣန္ဒြေ", "ဣန္ဒြေ")
pyidaungsu_input_example
 [1] "က\u103bူ "              "က\u103bောင\u103aး"     "က\u103b\u103dမ\u103aး"
 [4] "က\u103cီး"              "မ\u103cို့ "              "က\u103c\u103dေး"      
 [7] "င\u103c\u103eိုး"        "စက္ကူ"                   "လိမ္မော\u103a "         
[10] "လိမ္မော\u103a "          "အင\u103a္ဂ\u102b"       "အင\u103a္က\u103bီ"      
[13] "သင\u103a္ဘော"           "ဥက္ကဋ္ဌ"                 "ဂုဏဝုဍ္ဎိ"                
[16] "သဏ္ဌာန\u103a "          "ဘဏ္ဍာ"                  "ဣန္ဒ\u103cေ"           
[19] "ဣန္ဒ\u103cေ"           
cat(pyidaungsu_input_example)
ကျူ  ကျောင်း ကျွမ်း ကြီး မြို့  ကြွေး ငြှိုး စက္ကူ လိမ္မော်  လိမ္မော်  အင်္ဂါ အင်္ကျီ သင်္ဘော ဥက္ကဋ္ဌ ဂုဏဝုဍ္ဎိ သဏ္ဌာန်  ဘဏ္ဍာ ဣန္ဒြေ ဣန္ဒြေ
To see how these syllables/words are stored in the computer and the order of code points that made up these syllable/words, We split each element of the character vector in the above example into characters (Unicode codepoints):
library(stringr)
x <- strsplit(pyidaungsu_input_example, split="") %>%
  sapply(.,paste0,collapse=" ")
x
 [1] "က \u103b ူ  "                "က \u103b ေ ာ င \u103a း"   
 [3] "က \u103b \u103d မ \u103a း" "က \u103c ီ း"               
 [5] "မ \u103c ိ ု ့  "              "က \u103c \u103d ေ း"       
 [7] "င \u103c \u103e ိ ု း"        "စ က ္ က ူ"                   
 [9] "လ ိ မ ္ မ ေ ာ \u103a  "       "လ ိ မ ္ မ ေ ာ \u103a  "      
[11] "အ င \u103a ္ ဂ \u102b"       "အ င \u103a ္ က \u103b ီ"     
[13] "သ င \u103a ္ ဘ ေ ာ"          "ဥ က ္ က ဋ ္ ဌ"               
[15] "ဂ ု ဏ ဝ ု ဍ ္ ဎ ိ"              "သ ဏ ္ ဌ ာ န \u103a  "       
[17] "ဘ ဏ ္ ဍ ာ"                   "ဣ န ္ ဒ \u103c ေ"           
[19] "ဣ န ္ ဒ \u103c ေ"           
utf8::utf8_print(x)
 [1] "က ျ ူ  "         "က ျ ေ ာ င ် း"   "က ျ ွ မ ် း"      "က ြ ီ း"        
 [5] "မ ြ ိ ု ့  "       "က ြ ွ ေ း"       "င ြ ှ ိ ု း"       "စ က ္ က ူ"       
 [9] "လ ိ မ ္ မ ေ ာ ်  " "လ ိ မ ္ မ ေ ာ ်  " "အ င ် ္ ဂ ါ"      "အ င ် ္ က ျ ီ"    
[13] "သ င ် ္ ဘ ေ ာ"    "ဥ က ္ က ဋ ္ ဌ"    "ဂ ု ဏ ဝ ု ဍ ္ ဎ ိ"  "သ ဏ ္ ဌ ာ န ်  " 
[17] "ဘ ဏ ္ ဍ ာ"       "ဣ န ္ ဒ ြ ေ"     "ဣ န ္ ဒ ြ ေ"    
cat(x)
က ျ ူ   က ျ ေ ာ င ် း က ျ ွ မ ် း က ြ ီ း မ ြ ိ ု ့   က ြ ွ ေ း င ြ ှ ိ ု း စ က ္ က ူ လ ိ မ ္ မ ေ ာ ်   လ ိ မ ္ မ ေ ာ ်   အ င ် ္ ဂ ါ အ င ် ္ က ျ ီ သ င ် ္ ဘ ေ ာ ဥ က ္ က ဋ ္ ဌ ဂ ု ဏ ဝ ု ဍ ္ ဎ ိ သ ဏ ္ ဌ ာ န ်   ဘ ဏ ္ ဍ ာ ဣ န ္ ဒ ြ ေ ဣ န ္ ဒ ြ ေ
When a Myanmar Unicode syllable is stored in the computer, the code points have to be in a certain order so that there would be no ambiguity about the syllable they are representing. Generally, the order is the same as when we learn to spell at the primary school. However, in “ကျောင်း” the code point order is “က ျ ေ ာ င ် း” . Here we see that the vowel “ေ” is not placed first as we say in spelling it.
That is because we need to follow the order for the code points required by Unicode, where the order of the components is:
<consonant><medial consonant><vowel><final consonant><tone>.
In a given syllable the consonant is always present, but one or more of other component may be present or absent.
The following shows the components of the syllable “ကျောင်း” .
For more detailed information on code point order and for Myanmar Unicode in general, you may like to look at Representing Myanmar in Unicode: Details and Examples Version 4, or Myanmar Script Notes on GitHub, among others.
All this look a lot complicated and talking about these may discourage people, unjustifiably, though. Nevertheless, people with “It ain’t broke” mentality and people with contempt for “Unicode bandwagon” should look around and get more information, and then it’s up to you.
On the other hand, Myanmar Unicode font developers have greatly eased your text typing by developing intelligent keyboard application so that you won’t have to think too much about code point order and other complications!

No comments:

Post a Comment