Wednesday, December 13, 2023

Syllabification of Myanmar Unicode text with Regex - 2


In the Ye Kyaw Thu’s post mentioned previously (
https://github.com/ye-kyaw-thu/sylbreak), he rejected a paper’s claim that his syllable segmentation method “… cannot correctly segment syllables that contain consonants, ‘်’ and ‘့’ …”. He asserted that his method works perfectly if the ‘်’ and ‘့’ are entered in correct order, that is ‘်’ first and then ‘့’.

To check this issue with our syllabification approach we created a text string t.1 containing “ရပ်တန့်” with correct order of “်” followed by “့” and t.2 with incorrect order of “့” followed by “်” in “န့်” using on-screen keyboard.

Check if our syllabification approach has issues with ‘်’ and ‘့’

# create data
t.1 <- "ရပ်တန့်" 
t.2 <- "ရပ်တန့်"
t.1
[1] "ရပ်တန့်"
t.2
[1] "ရပ်တန့်"
# Are they are exactly equal?
t.1 == t.2
[1] TRUE

They were exactly equal because the keyboard application changed the incorrect keying into correct order. We can split these two texts into characters to see that also.

strsplit(t.1, split = "")
[[1]]
[1] "ရ" "ပ" "်"  "တ" "န" "့"  "်" 
strsplit(t.2, split = "")
[[1]]
[1] "ရ" "ပ" "်"  "တ" "န" "့"  "်" 

Yes that is true. So to be able to get the second text string with incorrect order we programmetically enter the text with the Unicode codepoints. Then we again check if they are exactly equal.

tp.1 <- "ရပ်တန\u103a\u1037" 
tp.2 <- "ရပ်တန\u1037\u103a"
tp.1
[1] "ရပ်တန့်"
tp.2
[1] "ရပ်တန့်"
tp.1 == tp.2
[1] FALSE
strsplit(tp.1, split = "")
[[1]]
[1] "ရ" "ပ" "်"  "တ" "န" "်"  "့" 
strsplit(tp.2, split = "")
[[1]]
[1] "ရ" "ပ" "်"  "တ" "န" "့"  "်" 

They look the same in output. But the order of individual characters shows they really are different.

We now create a test string containing one correctly keyed text and two incorrectly keyed text.

d <- paste(tp.1,tp.2,tp.2, sep = "")
d
[1] "ရပ်တန့်ရပ်တန့်ရပ်တန့်"
# check: split into characters
strsplit(d, split = "")
[[1]]
 [1] "ရ" "ပ" "်"  "တ" "န" "်"  "့"  "ရ" "ပ" "်"  "တ" "န" "့"  "်"  "ရ" "ပ" "်"  "တ" "န"
[20] "့"  "်" 
# make syllables
tst <- str_replace_all(d, "([က-အဣ-ဧဩဪဿ၌-၏])", "-\\1") %>% str_replace_all(., "-", " ") %>% str_replace_all(., "\\s([က-အ][့်း]\\s|[က-အ][့်း])", "\\1") %>% str_replace_all(., "\\s([က-အ]္)\\s", "\\1")  %>% str_replace_all(., "(\\s[က-အ]င်္)\\s", "\\1") %>% str_remove_all(., "[a-zA-Z0-9၀-၉၊။\\[\\]\\(\\)]|^\\s") %>% str_squish(.)
tst
[1] "ရပ် တန့် ရပ် တန့် ရပ် တန့်"
# see if the original order of ‘်’ and ‘့’ retained in the syllables
strsplit(tst, split = "")
[[1]]
 [1] "ရ" "ပ" "်"  " " "တ" "န" "်"  "့"  " " "ရ" "ပ" "်"  " " "တ" "န" "့"  "်"  " " "ရ"
[20] "ပ" "်"  " " "တ" "န" "့"  "်" 

So the correct and incorrect keying orders are retained.

Any issue in using our syllables with “quanteda” R package

Here, since I have been learning to use the “quanteda” R package for quantitative text analysis, I am eager to find out if the keying order would affect it.

A simple test is to let the dfm() function of quanteda get the number of syllables created.

library(quanteda)
dfm(tokens(tst))
Document-feature matrix of: 1 document, 2 features (0.00% sparse) and 0 docvars.
       features
docs    ရပ် တန့်
  text1  3  3

The result shows that the keying order in the “တန့်” syllables is ignored.

Conclusion

  1. Our syllabification approach has no issues with ‘်’ and ‘့’.
  2. The quanteda R package seems to be insensitive to keying order in the composition of the syllables.

No comments:

Post a Comment