Tuesday, September 15, 2020

Creating the basic bookmarks file for Myanmar Dictionary


My least-efforts approach to the creation of bookmarks for the Myanmar Dictionary would be like this:

  1. Open the scanned PDF dictionary in LibreOffice Draw and make minimal edits. Save as PDF with 90% quality JPEG compression, and image resolution of 150 DPI. With this a skeleton bookmarks of pages will be created.
  2. Open the PDF file in JPdfBookmarks. Change “charset encoding for dump and apply” to utf-8 in Tools/Options Menu. Dump bookmarks to a text file in jpdfbookmarks format.
  3. Open the bookmarks text file in RStudio. A data frame will be created. Extract the first column (variable) which is the bookmark names with a forward slash and page number added at the end. Remove slash and page number export as text file.
  4. Open the exported text file in Notepad++ and add beginning and ending words of the dictionary pages.
  5. Import the bookmark names text file into RStudio. Add the slash and page numbers back and create the updated JPdfBookmarks dump file.
  6. Open the original PDF file in JPdfBookmarks. Load bookmarks from the updated file and save it or continue working in JPdfBookmarks program to add children bookmarks to each page. I am skipping the first two steps in the following description.

Open the dump file, extract and modify first column and export to text file for entering bookmarks

Entries in the original bookmarks dump file:

BMd <- read.csv("jpdfBM-dump.txt", header = FALSE, encoding = "UTF-8")
str(BMd)
'data.frame':   546 obs. of  9 variables:
 $ V1: chr  "Page 1/1" "Page 2/2" "Page 3/3" "Page 4/4" ...
 $ V2: chr  "Black" "Black" "Black" "Black" ...
 $ V3: chr  "notBold" "notBold" "notBold" "notBold" ...
 $ V4: chr  "notItalic" "notItalic" "notItalic" "notItalic" ...
 $ V5: chr  "open" "open" "open" "open" ...
 $ V6: chr  "TopLeftZoom" "TopLeftZoom" "TopLeftZoom" "TopLeftZoom" ...
 $ V7: int  0 0 0 0 0 0 0 60 355 382 ...
 $ V8: int  0 0 0 0 0 0 0 0 0 0 ...
 $ V9: num  0 0 0 0 0 0 0 0 0 0 ...
BMd$V1[1:4]
[1] "Page 1/1" "Page 2/2" "Page 3/3" "Page 4/4"
# remove page numbers
x <- gsub("/[0-9]+", "", BMd$V1)
x[1:4]
[1] "Page 1" "Page 2" "Page 3" "Page 4"
writeLines(x, "BMnames2Modify.txt", useBytes = TRUE)

Enter bookmarks with Notepad++

The exported bookmarks names text file is opened in Notepad++ and entries made:

Modify the bookmark names

The saved file is imported into RStudio. The page numbers information were left intact in the file for identification if in need for further modification. But removed for inclusion in the bookmarks file to be used in JPdfBookmarks.

BMn <- readLines( "BM_Names2Modify.txt", encoding = "UTF-8")
# remove "Page xxx" from BMn 
BMn2 <- gsub("[P].+[0-9]+ ", "",BMn)
BMn2[11:27]
 [1] "က - ကည\u103dတ\u103a"                 
 [2] "ကဏန\u103aး - ကထိန\u103a"              
 [3] "ကဒူဥ - ကပ\u103cင\u103a"               
 [4] "ကဗ\u103bာ - ကရော့"                    
 [5] "ကရော\u103aကမည\u103a - ကလိုင\u103a"     
 [6] "ကလစ\u103a - ကာ"                      
 [7] "ကာ - ကာလ"                            
 [8] "ကာဠသုတ\u103a - ကုလား"                  
 [9] "ကုသိနာရုံ - ကဲ့"                           
[10] "ကဲ့ရဲ့ - ကော\u103aလံ"                     
[11] "ကံ - ကိုး"                              
[12] "ကိုး - ကောက\u103a"                     
[13] "ကောက\u103a - ကင\u103aပ\u103dန\u103aး"
[14] "ကင\u103aမရာ - ကိုင\u103a"              
[15] "ကိုင\u103aး - ကိတ\u103a"                
[16] "ကိတ္တိမ - ကိန္နရာ"                        
[17] "ကိန္နရီ - ကုန\u103aး"                    

We add back page number info to the bookmark names and add it as the new column in the data frame:

v1.1 <- gsub("[^/]+/", "/", BMd$V1)
BMd$colz <- do.call(paste0, list(BMn2, v1.1))
BMd$colz[11:27]
 [1] "က - ကည\u103dတ\u103a/9"                  
 [2] "ကဏန\u103aး - ကထိန\u103a/10"              
 [3] "ကဒူဥ - ကပ\u103cင\u103a/11"               
 [4] "ကဗ\u103bာ - ကရော့/12"                    
 [5] "ကရော\u103aကမည\u103a - ကလိုင\u103a/13"     
 [6] "ကလစ\u103a - ကာ/14"                      
 [7] "ကာ - ကာလ/15"                            
 [8] "ကာဠသုတ\u103a - ကုလား/16"                  
 [9] "ကုသိနာရုံ - ကဲ့/17"                           
[10] "ကဲ့ရဲ့ - ကော\u103aလံ/18"                     
[11] "ကံ - ကိုး/19"                              
[12] "ကိုး - ကောက\u103a/20"                     
[13] "ကောက\u103a - ကင\u103aပ\u103dန\u103aး/21"
[14] "ကင\u103aမရာ - ကိုင\u103a/22"              
[15] "ကိုင\u103aး - ကိတ\u103a/23"                
[16] "ကိတ္တိမ - ကိန္နရာ/24"                        
[17] "ကိန္နရီ - ကုန\u103aး/25"                    

Create the updated JPdfBookmarks dump file

cols <- c('colz', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9')
nuBMarks <- do.call(paste, c(BMd[cols], sep=","))
writeLines(nuBMarks, "MDict_BMarks_nu1.txt", useBytes = TRUE)

This updated bookmarks file was loaded in the opened dictionary file in JPdfBookmarks and it worked.