Sunday, June 2, 2019

Cycle2: Syllable segmentation, parallel processing


In my last post, I hinted the possible reduction of processing time for syllable segmentation of 10K sentences out of the 20K sample which has already been segmented into quanteda style characters. After frantic search for solution in package documentations, examples, and stackoverflow and with some luck I was able to get it done. With the following code I was able to reduce the running time from 1 hour and 28 minutes (see my last post) to 44 minutes for the same 10,000 sentences. It was for the first 10,000 sentences. Now I’m running it for the remaining 10,001th to 20,000th sentences. Then I will combine them to get syllables for the whole of 20,000 sample sentences.

Parallel processing with foreach()

library(quanteda)
library(doParallel)
library(foreach)
cl <- makeCluster(4)
registerDoParallel(cl)

system.time({
x20k_syllS <- list()
x20k_syllS <- foreach (k=10001:20000,.packages=c('quanteda','foreach')) %dopar% {
    x20k_syllS.k <- list()
    TEMP <- x20k.t[[k]][1]
    j <- 1
    L <- length(x20k.t[[k]])
    foreach(i=2:L,.packages=c('quanteda','foreach')) %do% {
        y <- grepl("[\u102b-\u102c\u1038-\u1039\u103a]",x20k.t[[k]][i])
        if (y == TRUE){
            TEMP <- paste0(TEMP,x20k.t[[k]][i])
        } else { 
            my.1 <- grepl("[\u1040-\u1049]", x20k.t[[k]][i])
            my.0 <- grepl("[\u1040-\u1049]", x20k.t[[k]][i-1])
            if (my.1 == TRUE){
                if (my.0 == TRUE){
                    TEMP <- paste0(TEMP,x20k.t[[k]][i])
                } else {
                    x20k_syllS.k[[j]] <- TEMP
                    j <- j+1
                    TEMP <- x20k.t[[k]][i]
                }
            } else {
                if (my.0 == TRUE){
                    x20k_syllS.k[[j]] <- TEMP
                    j <- j+1
                    TEMP <- x20k.t[[k]][i]
                } else {
                    # for stacked consonant
                    if (grepl("[\u1039]",x20k.t[[k]][i-1])==TRUE){
                        TEMP <- paste0(TEMP,x20k.t[[k]][i])
                    } else {
                        x20k_syllS.k[[j]] <- TEMP
                        j <- j+1
                        TEMP <- x20k.t[[k]][i]
                    }
                }
            }
        }
    }
    if (i == L){
        x20k_syllS.k[[j]] <- TEMP
    }
    x20k_syllS[[k]] <- paste(unlist(x20k_syllS.k))
}
})
   user  system elapsed 
  12.22    3.74 2760.16 

The result

Check the results for 10,001th and 20,000th sentences.
cat(unlist(x20k_syllS[c(1,10000)]))
ထို့ ကြောင့် မင်း ကြီး သည် အ မတ် တို့ ကို ခေါ် ၍ ၊ ပံ့ သု ကူ ထေရ် ကို အ ကြိမ် ကြိမ် ပင့် စေ ၏ ။ ထို အ ခါ မိ ခင် ဖြစ် သူ သည် " မိ မိ တို့ တွင် အ မွေ ဆက် ခံ မည့် အ မွေ ခံ သား မ ရှိ ပေ ။
You can see that I need very little modification of my code for the for loop in converting to foreach and %do% and %dopar%. Then a nested loop was tricky and I was very much frustrated before I hit the right solution! Note that I had used %do% with %dopar%. I had tried nesting the loops with %:% and %dopar% as demonstrated in the vignette for the foreach package. But that was way too tricky and I couldn’t manage and I guess it would involve considerable changes in my original for loop code.
We now see that the processing time for the last 10,000 sentences is 36 minutes. So you can see that parallel processing using foreach()looping has speed up the process twice. I now combine list-1(x20k_syllQ) and list-2(x20k_syllS) to get the syllables for the whole 20K sentences.
x20k_syllQS <- c(x20k_syllQ, x20k_syllS)
cat(unlist(x20k_syllQS[c(1,20000)]))
ဂူ ဂဲ ၏ သု ည စီ မံ ကိန်း ( P r o j e c t Z e r o ) လေ့ လာ ရှာ ဖွေ သူ ဖြစ် သည့် ဂျန်း ဟွန်း က ကွတ် ကီး များ သည် ကြား ခံ များ ဖြစ် သည့် ဝိုင် ဖိုင် ထောက် ပံ့ သ များ က ဖတ် ရှု နိုင် သည် ။ ထို အ ခါ မိ ခင် ဖြစ် သူ သည် " မိ မိ တို့ တွင် အ မွေ ဆက် ခံ မည့် အ မွေ ခံ သား မ ရှိ ပေ ။

No comments:

Post a Comment