Friday, October 23, 2020

Extracting headwords from the scanned Myanmar Dictionary


After successfully creating syllable-rhymes level bookmarks for the volume-1 of Abridged Myanmar Dictionary I felt a bit more ambitious and set out to extract all the headwords from the volume-1 of the Abridged Myanmar Dictionary I’m dealing with.

With my earlier experience in converting the Myanmar spelling book to searchable PDF format I know I could use essentially the same workflow. The dictionary is in a two column format like the spelling book so I can shed superfluous text by slicing the pages into left and right columns with just enough space to cover the headwords. This could be done with GIMP software by importing the PDF file into GIMP as layers and slicing them using guides. The procedure had been fully described in my “Making Myanmar Spelling Book searchable -II” post of August 25, 2019. Here, however, the OCR output had been downloaded as plain text files.
In the image below, the left part shows the first two pages of the input PDF file and the right part shows the OCR output opened in Notepad++.

knitr::include_graphics("headWord_1_2.png")

After obtaining the OCR output text (in 7 text files), the workflow for extracting headwords from each of these text files is as follows:

  1. The text files are read-in into R.
  2. Using the pattern to identify a headword, that is, begins with a consonant and ends with a dash, with or without a space in-between, the headwords are extracted using regex.
  3. The extracted headwords were written out to a text file.
  4. The text file is opened in Notepad++, checked with headwords in the Dictionary PGF file, and OCR errors corrected or missed words entered. Note that the programmatically extracted headwords will have a dash character at the end. Words with OCR errors were marked with an "*" at the end, and corrected words or newly entered words are not marked.
  5. Modified text files were read-in into R.
  6. All files combined into one file and OCR errors and omission errors are calculated.
  7. All initially correct and corrected headwords are extracted and written out to a text file.
  8. The extracted headwords were again checked with the Dictionary and again the errors and omissions are corrected and marked using Notepad++.
  9. The corrected final text file is read-in into R, errors calculated and cleaned list of headwords is extracted.
  10. The final list of headwords is written out to a text file.
utf8::utf8_print(y0.1[1:31])
 [1] "က-"          "က -"         "က-"          "က -"        
 [5] "ကကတစ်-"       "ကကူရံ -"       "ကကြီး-"       "ကကြီးထွန်-"    
 [9] "ကကြိုး-"       "ကကြိုး-"       "ကခုန် –"       "ကချေသည်-"    
[13] "ကချော်ကချွတ် -" "ကချင်-"       "ကစား-"       "ကစားစရာ-"   
[17] "ကစားဝိုင်း -"   "ကစီ-"         "က -"         "ကစော် -"     
[21] "က်ပေါက် -"     "ကစဉ်က-"       "ကဆစ်-"        "ကဇာတ်-"      
[25] "ကည-"         "ကညို-"         "ကညင်-"        "ကညင်ဆီ -"     
[29] "ကညင်ဆီတိုင်-"     "ကညစ် -"       "ကညစ်ရေး-"    

The original entries for the above headwords in the first and second page of body text of the dictionary file is seen below. Since I am compiling a list of distinct headwords I dropped the duplicates in my list.

Below is an example of editing I’d done with editing marks made as described in the step-4 of my workflow given earlier. Look carefully and You can see that I had made mistakes in even this early part editing. That’s why I need to go over to a second round of editing!

After this first round of editing I calculated that the OCR errors in the headwords + the number of headwords missed by the OCR process was about 25%.

After completing step-7 and 8 of the workflow, the edited final headwords file was saved as allwords_1.txt. It was read-in into R and processed:

EDall_final <- readLines("allwords_1.txt",encoding = "UTF-8")
library(stringr)
HWall_final <- unlist(str_subset(EDall_final,"[*]", negate = TRUE))
HWcorr2 <- unlist(str_subset(EDall_final,"[+]"))
nHWall <- length(HWall_final)
nHWcorr2 <- length(HWcorr2)
nHWall
[1] 5187
nHWcorr2
[1] 68
# Total corrected+new entries (round-1+2)
paste("Total corrected+new entries (round-1+2) = ", (B + nHWcorr2))
[1] "Total corrected+new entries (round-1+2) =  1345"
#writeLines(y0.7, "418_split.txt", useBytes = TRUE)
# paste("Incorrect entries =",round(aCorr*100/cCorr,1), "%")
# paste("Corrected+new entries =",round(bCorr*100/cCorr,1), "%")

Total numbers of headwords and errors after two rounds of editing was found to be:

Total corrected+new entries (round-1) = 1277
Total correct+corrected+new entries (round-1) = 5189
Total raw entries (round-1) = 6245

Total corrected+new entries (round-1+2) = 1345
Total headwords = Correct+corrected+new entries (after round-2) = 5187
OCR errors corrected + new entries(for OCR misses) = 25.9 %

Extract final “KA” to “SA” (“က” - “စ”) headwords

cleanHWords_KA_SA <- gsub("[\u002d\u2010\u2013*+]", "", HWall_final) %>%
  trimws(.) 
# check if headwords contain characters that are not in Myanmar language
str_subset(cleanHWords_KA_SA, "[\u1000-\u104f]",negate = TRUE)
character(0)
# print a sample of 100 headwords
utf8::utf8_print(str_subset(cleanHWords_KA_SA, "[\u1000-\u104f]")[3440:3539])
  [1] "ဂြိုဟ်ကျ"           "ဂြိုဟ်စာ"           "ဂြိုဟ်စားသက်ရောက်"   
  [4] "ဂြိုဟ်စီးဂြိုဟ်နင်း"     "ဂြိုဟ်ဆိုး"           "ဂြိုဟ်တိုင်"          
  [7] "ဂြိုဟ်ပြေနံပြေ"      "ဂြိုဟ်မွှေ"           "ဂြိုဟ်ဝင်"          
 [10] "ဂြိုဟ်သက်"           "ဂြိုဟ်သိမ်"           "ဂွကလေး"          
 [13] "ဂွချော"           "ဂွဒိုး"             "ဂွေး"            
 [16] "ဂွမ်း"             "ဂွမ်းကပ်"           "ဂွမ်းခံအကျီ"        
 [19] "ဂွမ်းခက်"           "ဂျွတ်"             "ဂျွန်"            
 [22] "ဃ"               "ဃကြီး"            "ဃဋီ"             
 [25] "ဃနာ"             "ဃရာဝါသ"          "ဃောသ"           
 [28] "င"               "ငကျောက်"          "ငကျွဲ"            
 [31] "ငချိပ်"            "ငချိပ်ညိုပြောင်း"     "ငချိပ်သွေး"        
 [34] "ငစိုင်ရှင်"           "ငစဉ်းလဲ"           "ငစည်နှစ်"          
 [37] "ငစိန်"             "ငဆ"              "ငဆစ်သခွား"        
 [40] "ငတိ"              "ငတေ"             "ငနဲ"             
 [43] "ငပေါ"            "ငပုပ်ဖမ်း"          "ငပြေရှင်"         
 [46] "ငပြုပ်"            "ငမိုက်သား"          "ငမြှောင်တောင်"     
 [49] "ငရဲ"              "ငရဲကြီး"           "ငရဲငအုံ"           
 [52] "ငရဲမီး"            "ငရဲသစ်ငုတ်"          "ငရဲအိုး"           
 [55] "ငရံ့ပတူ"            "ငရုတ်"             "ငရုတ်ကောင်း"       
 [58] "ငရုတ်ကျည်ပွေ့"        "ငရုတ်ဆုံ"            "ငလျင်"           
 [61] "ငဟစ်"             "ငါ"              "ငါစား"          
 [64] "ငါစွဲ"             "ငါတကော"          "ငါတကောကော"      
 [67] "ငါတွေ့"            "ငါး"             "ငါးကုလား"        
 [70] "ငါးကင်"           "ငါးကင်း"          "ငါးကန်"          
 [73] "ငါးကန့်ဗျိုင်"        "ငါးကုံး"           "ငါးကျီးကန်း"      
 [76] "ငါးကျည်း"         "ငါးကျည်းခြောက်"    "ငါးကျပ်တိုက်"       
 [79] "ငါးကျပ်တင်"        "ငါးကြီးဆီ"         "ငါးကြီးအန်ဖတ်"     
 [82] "ငါးကြင်း"         "ငါးကြင်းကြေး"     "ငါးကြင်းမျက်ဆန်နီ"  
 [85] "ငါးကြင်းမြီး"      "ငါးကြောင်း"       "ငါးကွဲယင်"         
 [88] "ငါးကွမ်းရှပ်"        "ငါးခူ"            "ငါးခုတ်တုံး"        
 [91] "ငါးခုံးမ"          "ငါးချဉ်"          "ငါးခြောက်"       
 [94] "ငါးခြောက်ငါးခြမ်း" "ငါးခွေးလျှာ"       "ငါးစင်စပ်"        
 [97] "ငါးစင်ရိုင်း"        "ငါးစင်း"          "ငါးစင်းပြား"     
[100] "ငါးစည်ဖောင်း"     
# export all headwords to text file
writeLines(cleanHWords_KA_SA, "HeadWords_myDICTabr_v1.txt", useBytes = TRUE)

Summary: The total number of headwords is broken down by consonants

library(stringr)
KA_words2 <- str_subset(HWall_final, "^[\u1000]")
t1 <- paste0("(\u1000):",length(KA_words2))
KHA_words2 <- str_subset(HWall_final, "^[\u1001]")
t2 <- paste0("(\u1001):",length(KHA_words2))
GAn_words2 <- str_subset(HWall_final, "^[\u1002]")
t3 <- paste0("(\u1002):",length(GAn_words2))
GAg_words2 <- str_subset(HWall_final, "^[\u1003]")
t4 <- paste0("(\u1003):",length(GAg_words2))
NGA_words2 <- str_subset(HWall_final, "^[\u1004]")
t5 <- paste0("(\u1004):",length(NGA_words2))
SA_words2 <- str_subset(HWall_final, "^[\u1005]")
t6 <- paste0("(\u1005):",length(SA_words2))
cat(c(t1,t2,t3,t4,t5,t6),sep = "\n")
(က):1916
(ခ):1337
(ဂ):207
(ဃ):6
(င):466
(စ):1255
# finally export headwords as text file
writeLines(HWall_final, "HWords_myDICT_v1.txt", useBytes = TRUE)

I am sharing this text file of headwords here.
Caution: It may contain errors I’d missed!

No comments:

Post a Comment