Bayanathi Technology: Myanmar vowels and Unicode

So far, I have tried playing with Myanmar language text using some NLP related packages of the R environment. Besides being a native speaker of that language, I’ve had no deeper than a superficial knowledge of this or any other language. And I guess what I’ve been doing so far is following the “bag of words” approach in NLP!

The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

Bag-of-words model - Wikipedia
https://en.wikipedia.org › wiki › Bag-of-words_model

Then I thougth, I should try to know a bit about Myanmar language and Myanmar script coding in Unicode. First, I’ve got the set of “Myanmar Language Dictionary (Abridged)” by the Myanmar Language Commission. This is useful for what a Myanmar word really means (even for some common words), or how it is properly pronounced. That could help even we native speakers of the Myanmar language. I was really surprised when I heard some of us say “ချဉ်းကပ်” as “ချည်းကပ်”. I have even heard this from a Myanmar native speaker broadcasting from a reputable broadcasting service outside Myanmar!

When I used this dictionary one difficulty was that I couldn’t figure out very well when I looked at one word and then to find another. I don’t know if I have to leaf backwards or forwards to reach the next word! This is because we have to take account of the order of consonants as well as the vowels and and other symbols in the process.

Well, even the gurus acknowledged this difficulty there in the first volume of the “Myanmar Language Dictionary (Abridged)” by the Myanmar Language Commission, 1978:

Seventy or so years ago, when I started learning to read and write in primary school, we were taught the basic syllable rhymes:

က ကာ ကိ ကီ ကု ကူ ကေ ကဲ ကော့ ကော် ကံ ကား … အ အာ အိ အီ အု အူ အေ အဲ အော့ အော် အံ အား

Now in Myanmar Language Dictionary I learn that this basic 12 syllable rhymes is expandable into 22 and so I tried spelling them out in R. The Unicode characters for Myanmar script is denoted in hexadecimal codes as U+1000 for “က”, U+1001 for “ခ”, and so on. To input the character “က”, for example, into R you have to use “\u1000”.

myVowel22 <- c("\u1021","\u1021\u102c","\u1021\u102c\u1038","\u1021\u102d","\u1021\u102e","\u1021\u102e\u1038","\u1021\u102f", "\u1021\u1030","\u1021\u1030\u1038","\u1021\u1031","\u1021\u1031\u1037","\u1021\u1031\u1038","\u1021\u1032","\u1021\u1032\u1037","\u1021\u1031\u102c","\u1021\u1031\u102c\u1037","\u1021\u1031\u102c\u103a","\u1021\u1036","\u1021\u1036\u1037","\u1021\u102d\u102f","\u1021\u102d\u102f\u1037","\u1021\u102d\u102f\u1038")
utf8::utf8_print(myVowel22)

 [1] "အ"   "အာ"  "အား" "အိ"   "အီ"   "အီး"  "အု"   "အူ"   "အူး"  "အေ"  "အေ့"  "အေး" "အဲ"  
[14] "အဲ့"   "အော" "အော့" "အော်" "အံ"   "အံ့"   "အို"   "အို့"   "အိုး"

I found that We can inspect the number of characters (code points) that made up each of these syllables, the number of bytes each is represented by, and the width when each is displayed, as follows:

CHARS <- nchar(myVowel22)
BYTES <- nchar(myVowel22,type = "bytes")
WIDTH <- nchar(myVowel22,type = "width")
MYAN_VOW <- data.frame(myVowel22,CHARS,BYTES,WIDTH)
MYAN_VOW

myVowel22 <fctr>	CHARS <int>	BYTES <int>	WIDTH <int>
<U+1021>	1	3	1
<U+1021><U+102C>	2	6	2
<U+1021><U+102C><U+1038>	3	9	3
<U+1021><U+102D>	2	6	1
<U+1021><U+102E>	2	6	1
<U+1021><U+102E><U+1038>	3	9	2
<U+1021><U+102F>	2	6	1
<U+1021><U+1030>	2	6	1
<U+1021><U+1030><U+1038>	3	9	2
<U+1021><U+1031>	2	6	2

From our data frame “MYAN-VOW”, I could print out each syllable together with the number of characters, number of bytes, and width of the printed syllable as follows:

utf8::utf8_print(paste(myVowel22,MYAN_VOW$CHARS,MYAN_VOW$BYTES,MYAN_VOW$WIDTH))

 [1] "အ 1 3 1"    "အာ 2 6 2"   "အား 3 9 3"  "အိ 2 6 1"    "အီ 2 6 1"    "အီး 3 9 2"  
 [7] "အု 2 6 1"    "အူ 2 6 1"    "အူး 3 9 2"   "အေ 2 6 2"   "အေ့ 3 9 2"   "အေး 3 9 3" 
[13] "အဲ 2 6 1"    "အဲ့ 3 9 1"    "အော 3 9 3"  "အော့ 4 12 3" "အော် 4 12 3" "အံ 2 6 1"   
[19] "အံ့ 3 9 1"    "အို 3 9 1"    "အို့ 4 12 1"   "အိုး 4 12 2"

Well, this has been nothing more than an exercise in inputting Myanmar syllables with code and printing them out together with their characteristics. At the beginning, What I expected was that I would get some idea about the alphabetical ordering of words in Myanmar language by knowing the syllable rhymes. Well, that was only partially correct. When I went on to read the front matter contained in the first volume of the Myanmar Language Dictionary (Abridged), I found that the basic 12 syllable rhymes is expandable to 22, and yet we need altogether 51 vowel-sounds to cover the Myanmar language. Finally, the alphabetical order is shown, consisting of (a) the order of the consonants, (b) the order of the vowels and asats, and (c) the order of the medials.

Here, I understand that The proper alphabatical ordering of Myanmar words is very important and particularly the creaters of Myanmar dictionary, thesaurus, and databases can’t do without it. I guess this is the subject matter covered in the term “collation”, or sorting for a language. Anyone more serious than me may like to read Collation of Myanmar (Burmese) in Unicode or Representing Myanmar in Unicode: Details and Examples Version 4, or similar.

As for me, and for my fellow dummies as well, the correct order of code points in a syllable in typing with a keyboard or inputting through coding would be neccessary and sufficient to keep our texts right, I guess.

Wednesday, September 18, 2019

Myanmar vowels and Unicode

No comments:

Post a Comment

Blog Archive