Bayanathi Technology: May 2020

Thursday, May 14, 2020

Reshaping the corpus to sentence level

For my last post, I’d created a corpus from the text of the COVID-19 surveillance reports for Myanmar for April 2020 using the quanteda package and tried out its kwic() function. That corpus was created at the surveillance report level, that is, the document units of the corpus is an entire surveillance report. Using the corpus_reshape() function we may change the present corpus at the “document” level to either “sentences” or “paragraphs” level.

(1) Add period mark at the end of the Myanmar language sentence

First I tried to trick quanteda by adding a period mark (full stop) after the Myanmar major section ending mark “။”, so as to become “\u104b\u002e”, that is, “။.”.Then I tried reshaping to sentences level. That didn’t work; I got 999 sentences while the correct number is just 403!

(2) Add paragraph mark “\n\n” at the end of the Myanmar language sentence

This works! This way, I was able to trick quanteda to give the corpus at the desired sentences level. But then I found that the next approach is more natural.

(3) Try reshaping without any additional mark

When I tried using the method (1) I noticed that some of the segmentation into “sentence” correctly used the Myanmar sentence ending mark “။”. So I tried reshaping the corpus of my last post directly into “sentences” without adding anything. That gave mostly correct segmentation into sentences, though there were some errors, such as: (i) considering “၊” in addition to “။” as sentence ending mark, (ii) considering the sentence number at the beginning of sentence such as “၁။” as a sentence, and (iii) considering “မှတ်ချက်။ ။” as two sentences. After appropriate modification of such texts, the reshaping succeded. I was happy and suprised, because I didn’t think quanteda would recognize the sentence ending of the Myanmar language!

Here’s how it was done.
First I removed “၊” and change sentence numbering “၁။” to “<၁>”. Then I modified “မှတ်ချက်။ ။” to “မှတ်ချက် -”.
Then there was this mysterious character “သည\u103a ။” in one sentence which gave me a lot of trouble eliminating. Finally I found the right pattern for detecting it: “သည\u103a.{1}။”, and was able to remove it!

Recall that the vector of of the texts of surveillance reports for April 2020, xMc.v had been created (see my last post).

xMc.v1.0 <- gsub("ဓာတ်ခွဲအတည်ပြု လူနာဟောင်း (old confirmed case)", "old confirmed case", xMc.v, fixed = TRUE) %>%
  gsub("ဆေးသုတေသနဦးစီးဌာန (DMR)", "DMR", . , fixed = TRUE) %>%
  gsub("အမျိုးသားကျန်းမာရေးဓာတ်ခွဲမှုဆိုင်ရာဌာန (ရန်ကုန်) (NHL)", "NHL", ., fixed = TRUE) %>%
  gsub("အမျိုးသားကျန်းမာရေးဓာတ်ခွဲမှုဆိုင်ရာဌာန\\(ရန်ကုန်\\)|အမျိုးသားကျန်းမာရေးဓာတ်ခွဲမှုဆိုင်ရာဌာန \\(ရန်ကုန်\\)", "NHL", .) %>%
  gsub("([\u1040-\u1049]?[\u1040-\u1049])\u104b", "<\\1>", .) %>%
  gsub("\u104a", " ", .) %>% gsub("(မှတ်ချက်)။ ။", "\\1 - ",. ) %>%
  sub("သည\u103a.{1}။","", .)

# create data frame
xMc.v1.0df <- data.frame(Z1 = names(xMc.v1.0),Z2 = xMc.v1.0, stringsAsFactors = FALSE)

# create corpus at document level
library(quanteda)
xsenMP0_apr <- xMc.v1.0df %>%
  corpus(.,docid_field = "Z1", text_field = "Z2", )
summary(xsenMP0_apr)

Corpus consisting of 60 documents:

                  Text Types Tokens Sentences
  (30-4-2020, 8:00 PM)   114    311         7
  (30-4-2020, 7:00 AM)    94    183         4
   (29-4-2020, 8:00PM)   133    323         8
  (29-4-2020, 7:00 AM)    88    160         3
  (28-4-2020, 8:00 PM)   105    232         5
  (28-4-2020, 7:00 AM)    67    152         4
  (27-4-2020, 8:00 PM)    84    176         5
  (27-4-2020, 7:00 AM)    82    187         3
  (26-4-2020, 8:00 pm)   110    258         6
  (26-4-2020, 7:00 AM)    83    190         4
  (25-4-2020, 8:00 pm)   127    302         6
 (25-4-2020, 10:00 am)    57     92         2
 (24-4-2020, 11:00 pm)    86    236         6
  (24-4-2020, 8:00 pm)   112    298         7
 (24-4-2020, 10:00 am)    99    272         5
  (24-4-2020, 9:00 am)    93    206         5
 (24-4-2020, 00:30 am)    50     75         2
  (23-4-2020, 8:00 pm)   119    334         8
 (23-4-2020, 10:00 am)    83    196         4
  (23-4-2020, 7:00 AM)    69    139         3
  (22-4-2020, 8:00 PM)   114    305         8
 (22-4-2020, 10:00 AM)    76    154         4
  (21-4-2020, 10:00PM)    57     94         3
   (21-4-2020, 8:00PM)    88    203         5
 (20-4-2020, 11:00 PM)    83    195         6
  (20-4-2020, 8:00 PM)   114    244         6
 (20-4-2020, 10:00 AM)    99    181         3
 (19-4-2020, 11:15 PM)    82    190         6
  (19-4-2020, 8:00 PM)   143    349         9
  (19-4-2020, 7:00 am)    76    164         5
  (18-4-2020, 9:00 PM)    70    146         4
  (18-4-2020, 8:00 pm)    88    205         4
 (18-4-2020, 10:00 am)   164    511         8
  (18-4-2020, 8:00 am)   157    439         9
  (17-4-2020, 8:00 pm)   143    534         7
 (17-4-2020, 10:00 am)   121    273         5
  (16-4-2020, 8:00 pm)   101    213         4
 (16-4-2020, 10:00 AM)   101    220         4
  (15-4-2020, 8:00 pm)   103    216         4
 (15-4-2020, 10:30 AM)   110    216         4
  (14-4-2020, 8:00 pm)   170    437         6
  (14-4-2020, 4:00 pm)    58     99         2
 (14-4-2020, 00:15 am)    97    220         6
 (13-4-2020, 10:15 pm)   102    216         6
  (13-4-2020, 8:00 pm)    78    118         4
 (13-4-2020, 00:30 am)   237    813        15
  (12-4-2020, 8:00 pm)   224    573        11
  (12-4-2020, 2:00 AM)   238   1262        22
   (11-4-2020, 8:00PM)   208    727        13
  (10-4-2020, 10:15PM)   192    564         9
   (10-4-2020, 8:00PM)   186    490         7
   (10-4-2020, 3:00AM)   256   1016        14
   (9-4-2020, 8:00 PM)   321   1191        17
   (8-4-2020, 8:00 PM)   223    693         9
    (6-4-2020, 8:00PM)   241    712        11
   (5-4-2020, 8:00 PM)   188    491         7
   (4-4-2020, 8:00 PM)   261    748        11
   (3-4-2020, 8:00 PM)   192    493         7
    (2-4-2020, 8:00PM)   214    590        10
    (1-4-2020, 8:00PM)   226    636        11

Source: C:/DATA/GITA_EX/* on x86-64 by mtnn
Created: Thu May 14 23:12:40 2020
Notes:

# Reshape corpus to sentence level
xsenMP_apr <- corpus_reshape(xsenMP0_apr, "sentences")

View first two and last two sentences of (12-4-2020, 2:00 AM) report. Note that they have been truncated. Compare it with the whole report of that date.

utf8::utf8_print(texts(xsenMP_apr)[which(grepl("(12-4-2020, 8:00 pm)", names(texts(xsenMP_apr))))][c(1:2,10:11)])

(12-4-2020, 8:00 pm).1                                                                    
"<၁> NHL မှ ယနေ့ COVID-19 ရောဂါအတွက်ဓာတ်ခွဲစစ်ဆေးမှု ပထမအသုတ်အား စစ်ဆေးမှုတွင် စောင့်ကြည့်လူနာများ  ဆေးရုံများနှင့်…"
(12-4-2020, 8:00 pm).2                                                                    
"(COVID-19 ရောဂါပိုး တွေ့ရှိ (တွေ့ရှိ) မှု သတင်းကို ယနေ့  ညနေ (၆:၃၀) တွင် သတင်းအကျဉ်း ထုတ်ပြန်ပြီးဖြစ်ပါသည်။)"      
(12-4-2020, 8:00 pm).10                                                                   
"<၅> အဆိုပါ ဓာတ်ခွဲအတည်ပြုလူနာများနှင့် ထိတွေ့ခဲ့သူများအားလုံးကို စုံစမ်းဖော်ထုတ်၍ အသွားအလာကန့်သတ်ကာ စောင့်ကြပ်ကြည့်ရှုသွားမည်ဖြ…"
(12-4-2020, 8:00 pm).11                                                                   
"သို့ဖြစ်ပါ၍ အဆိုပါ ဓာတ်ခွဲအတည်ပြုလူနာများနှင့် အနီးကပ်ထိတွေ့ခဲ့သည့် ပြည်သူလူထုအနေဖြင့် နီးစပ်ရာ ကျန်းမာရေးဌာနသို့ ဆက်သွယ်အကြော…"

cat(texts(xsenMP_apr)[which(grepl("(12-4-2020, 8:00 pm)", names(texts(xsenMP_apr))))])

<၁> NHL မှ ယနေ့ COVID-19 ရောဂါအတွက်ဓာတ်ခွဲစစ်ဆေးမှု ပထမအသုတ်အား စစ်ဆေးမှုတွင် စောင့်ကြည့်လူနာများ  ဆေးရုံများနှင့် သက်ဆိုင်ရာ နေရာများ၌ အသွားအလာကန့်သတ်၍ စောင့်ကြပ်ကြည့်ရှုခံနေသူများ စုစုပေါင်း (၉၃) ဦး ၏ ဓာတ်ခွဲ နမူနာများအား စစ်ဆေးခဲ့ရာ (ဇယား-၁) ၌ ဖော်ပြထားရှိသော လူနာ (၁) ဦး၏ ဓာတ်ခွဲအဖြေတွင် COVID-19 ရောဂါပိုး တွေ့ရှိ (တွေ့ရှိ) ရပြီး ကျန်လူနာများနှင့် စောင့်ကြပ် ကြည့်ရှုမှုခံနေသူ (၉၂)ဦး၏ ဓာတ်ခွဲအဖြေများတွင် COVID-19 ရောဂါပိုးမရှိကြောင်း တွေ့ရှိရပါ သည်။ (COVID-19 ရောဂါပိုး တွေ့ရှိ (တွေ့ရှိ) မှု သတင်းကို ယနေ့  ညနေ (၆:၃၀) တွင် သတင်းအကျဉ်း ထုတ်ပြန်ပြီးဖြစ်ပါသည်။) <၂> ဓာတ်ခွဲအတည်ပြုလူနာ (Case-039) သည် ရန်ကုန်တိုင်းဒေသကြီး  ပန်းဘဲတန်းမြိို့နယ် တွင်နေထိုင်သူ အသက် (၈၅) နှစ်အရွယ် အမျိုးသားတစ်ဦးသည် (၉-၄-၂၀၂၀) ရက်နေ့တွင် ဖျားခြင်း  ချောင်းဆိုးခြင်းလက္ခဏာများစတင်ခံစားခဲ့ရသဖြင့် ရန်ကုန်ပြည်သူ့ဆေးရုံကြီးသို့ (၁၁-၄-၂၀၂၀) ရက်နေ့တွင် သွားရောက်ပြသခဲ့ရာစောင့်ကြည့်လူနာအဖြစ် သတ်မှတ်၍ ဓာတ်ခွဲ နမူနာရယူစစ်ဆေးခဲ့ခြင်းဖြစ်ပါသည်။ အဆိုပါလူနာသည် လွန်ခဲ့သော (၁၄)ရက်အတွင်း ပြည်ပ နိုင်ငံများသို့ ခရီးသွားလာသော ရာဇဝင်မရှိခဲ့ကြောင်း သိရှိရပြီး ဆီးချိုရောဂါနှင့် နှလုံးသွေး ကြောကျဉ်းရောဂါအခံရှိကြောင်း သိရှိရပါသည်။ လူနာမှာ ဆေးရုံ စတင်တက်ရောက်သည့် အချိန်တွင် အလွန်မောပန်းလျက်ရှိပြီး သွေးတွင်း အောက်စီဂျင်ဓာတ်များ ကျဆင်းလျက် ရှိရာ လိုအပ်သော ကုသမှုများပေးခဲ့သော်လည်း (၁၂-၄-၂၀၂၀)ရက်နေ့  နံနက် (၆:၀၀) နာရီအချိန်ခန့် တွင် သေဆုံးသွားခဲ့ကြောင်း သိရှိရပါသည်။ သေဆုံးရသည့် အကြောင်းအရင်းမှာ ပြင်းထန် အဆုတ်ရောင်ရောဂါ  ဆီးချိုရောဂါ  သွေးတိုးရောဂါနှင့် နှလုံးသွေးကြောကျဉ်းရောဂါအခံ ရှိခြင်းတို့ကြောင့် ဖြစ်ပါသည်။ <၃> ယခုအခါ (၂၃-၃-၂၀၂၀)ရက်နေ့မှ (၁၂-၄-၂၀၂၀)ရက်နေ့  ည (၈:၀၀)နာရီအချိန်အထိ မြန်မာနိုင်ငံတွင် COVID-19 ရောဂါ ဓာတ်ခွဲအတည်ပြုလူနာ (၃၉)ဦး တွေ့ရှိခဲ့ပြီး ဖြစ်ပါသည်။ <၄> သို့ဖြစ်ရာ COVID-19 ရောဂါ ဓာတ်ခွဲအတည်ပြုလူနာများမှာ ချင်းပြည်နယ်  တီးတိန် ပြည်သူ့ဆေးရုံတွင် (၃)ဦး  ရန်ကုန်မြို့  ဝေဘာဂီအထူးကုဆေးရုံကြီးတွင် လူနာ (၂၆)ဦး  ရှမ်း ပြည်နယ် (မြောက်ပိုင်း)  လားရှိုးပြည်သူ့ဆေးရုံကြီးတွင် (၁)ဦး  မော်လမြိုင်ပြည်သူ့ဆေးရုံကြီး တွင် လူနာ (၁)ဦး  စဝ်စံထွန်းပြည်သူ့ဆေးရုံကြီးတွင် လူနာ (၁)ဦး  စုစုပေါင်း (၃၂)ဦးတို့ဖြစ်ပြီး ၎င်းတို့၏ ကျန်းမာရေးအခြေအနေမှာ တည်ငြိမ်လျက်ရှိပါသည်။ ဝေဘာဂီအထူးကုဆေးရုံကြီး ရှိ လူနာ (၁) ဦးအား အထူးကြပ်မတ်ခန်း၌ သီးခြားထားရှိ ဆေးကုသမှုပေးလျက်ရှိပါသည်။ <၅> အဆိုပါ ဓာတ်ခွဲအတည်ပြုလူနာများနှင့် ထိတွေ့ခဲ့သူများအားလုံးကို စုံစမ်းဖော်ထုတ်၍ အသွားအလာကန့်သတ်ကာ စောင့်ကြပ်ကြည့်ရှုသွားမည်ဖြစ်ပါသည်။ သို့ဖြစ်ပါ၍ အဆိုပါ ဓာတ်ခွဲအတည်ပြုလူနာများနှင့် အနီးကပ်ထိတွေ့ခဲ့သည့် ပြည်သူလူထုအနေဖြင့် နီးစပ်ရာ ကျန်းမာရေးဌာနသို့ ဆက်သွယ်အကြောင်းကြားစေလိုကြောင်း တိုက်တွန်းအပ်ပါသည်။

Now we know that leaving only the major section mark (“။”) WORKS!!!!

It was done, but I don’t know if that was due to quanteda or the Unicode system. Anyway that’s wonderful for sure.

Wednesday, May 6, 2020

Exploring information on COVID-19 confirmed cases

With my last post I had shared the text of COVID-19 surveillance reports for Myanmar for the month of April, 2020. It was, however incomplete as I couldn’t find the way to read-in some of the reports from the MOHS website. In the meantime, I’d manually accessed the missing webpages, copy and paste their text to some text files and read-in their contents into R, and created the file of complete reports. This file is shared here.

For the purpose of this post, I’d used the quanteda R-package to, (i) Create a corpus of the complete set of surveillance reports for the month of April, 2020, (ii) Start exploring this corpus, particularly with kwic() function to extract information on the laboratory confirmed cases for COVID-19 infection.

Corpus creation

Prior to the present exercise, I’d created the text of the surveillance reports in the xM_aprAll dataframe.

library(quanteda)
xMc_apr <- xM_aprAll[, 1:2] %>%
  corpus(.,docid_field = "X1", text_field = "X2")
xMc_apr$heading <- xM_aprAll$X1

Shortening some texts for use as identifiers

In processing the reports and for the purpose of extracting information, the report headings would be too and I would therefore be replacing them with their date and time. Also, I am shortening some long names with only the English acronyms.

ap <- "COVID-19 ရောဂါစောင့်ကြပ်ကြည့်ရှုမှုနှင့်ပတ်သက်၍ သတင်းထုတ်ပြန်ခြင်း\\s"
bp <- "COVID-19 ရောဂါ စောင့်ကြပ်ကြည့်ရှုမှုနှင့်ပတ်သက်၍ သတင်းထုတ်ပြန်ခြင်း\\s"
cp <- "COVID-19 ရောဂါစောင့်ကြပ်ကြည့်ရှုမှုနှင့်ပတ်သက်၍ သတင်းထုတ်ပြန်ခြင်း\\s"
patt <- paste(c(ap,bp,cp), collapse = "|")
xMc.v <- texts(xMc_apr) %>%   
  gsub("ဓာတ်ခွဲအတည်ပြု လူနာဟောင်း (old confirmed case) များ၏ ဓာတ်ခွဲနမူနာများ၌", "old confirmed case", ., fixed = TRUE) %>%
  gsub("ဆေးသုတေသနဦးစီးဌာန (DMR)", "DMR", ., fixed = TRUE ) %>%
  gsub("အမျိုးသားကျန်းမာရေးဓာတ်ခွဲမှုဆိုင်ရာဌာန (ရန်ကုန်) (NHL)", "NHL", ., fixed = TRUE)
names(xMc.v) <- gsub(patt, "", names(xMc.v))

Tokenizing the report texts

I am using the spacing of the text in the original report for tokenization. This would segment the reports into pieces of text that are more like phrases than words in English language. For that, we use the tokens() function with the option, what = fastestword". This would split text on the space character, using stringi::stri_split_fixed(x, " ").

ctoks <- tokens(xMc.v, "fastestword")

Finding the search pattern for laboratory confirmed patients

Inspection of the reports shows three different ways of writing the same item:
ဓာတ်ခွဲ အတည်ပြုလူနာ (၁၅၀) ဦး
ဓာတ်ခွဲအတည်ပြုလူနာ (၁၄၉)ဦး
ဓာတ်ခွဲအတည်ပြု လူနာ (၁၂၇)ဦး

They could, however, be reduced into a single regex pattern: “လူနာ ([၀-၉]+)”
The search results using the kwic() function. “The kwic function (keywords-in-context) performs a search for a word and allows us to view the contexts in which it occurs.” Since Myanmar language doesn’t have words as such, we used the groups of text (groups of syllables) or phrases delimited by space as it is found in the original text for tokenization. From the kwic() function, we asked for three phrases preceding (pre) and (post) in addition to the keyword in the output.

Finding information relating to confirmed cases

cPatients <- data.frame(kwic(ctoks, phrase("လူနာ ([၀-၉]+)"), window = 3, "regex"))
nrow(cPatients)

[1] 228

You can see the first 10 and last 10 outputs from a total of 228.

utf8::utf8_print(sapply(cPatients[c(1:10,219:228),], paste))

      docname                from  to    pre                                                  
 [1,] "(30-4-2020, 8:00 PM)" "8"   "9"   "DMR မှ စစ်ဆေးပြီးစီးခဲ့သော"                               
 [2,] "(30-4-2020, 8:00 PM)" "16"  "17"  "COVID-19 ရောဂါ ဓာတ်ခွဲအတည်ပြု"                           
 [3,] "(30-4-2020, 8:00 PM)" "27"  "28"  "COVID-19 ရောဂါ ဓာတ်ခွဲ"                                
 [4,] "(30-4-2020, 8:00 PM)" "79"  "80"  "ဦး ရှိပါသည်။၆။ ယနေ့အထိ"                                  
 [5,] "(30-4-2020, 8:00 PM)" "100" "101" "(၁၂:၀၀) နာရီအတွင်း တွေ့ရှိရသော"                            
 [6,] "(30-4-2020, 7:00 AM)" "15"  "16"  "-• COVID-19 ရောဂါဓာတ်ခွဲအတည်ပြု"                         
 [7,] "(30-4-2020, 7:00 AM)" "25"  "26"  "COVID-19 ရောဂါ ဓာတ်ခွဲအတည်ပြု"                           
 [8,] "(29-4-2020, 8:00PM)"  "11"  "12"  "အဖြစ် စစ်ဆေးပြီးစီးခဲ့သော အသွားအလာကန့်သတ်ခံရသူနှင့်"               
 [9,] "(29-4-2020, 8:00PM)"  "17"  "18"  "ဓာတ်ခွဲနမူနာများနှင့် DMRမှ စစ်ဆေးပြီးစီးခဲ့သော"                 
[10,] "(29-4-2020, 8:00PM)"  "27"  "28"  "COVID-19 ရောဂါပိုး မတွေ့ရှိပါ။(ဓာတ်ခွဲအတည်ပြု"                 
[11,] "(1-4-2020, 8:00PM)"   "37"  "38"  "ညနေ (၆) နာရီအထိ"                                      
[12,] "(1-4-2020, 8:00PM)"   "62"  "63"  "ခဲ့ရာ (ဇယား-၁) တွင်ဖော်ပြထားရှိသော"                        
[13,] "(1-4-2020, 8:00PM)"   "72"  "73"  "တွေ့ရှိ (တွေ့ရှိ) ရပြီး"                                     
[14,] "(1-4-2020, 8:00PM)"   "136" "137" "ပေးလျက်ရှိပါသည်။၅။ COVID-19 ရောဂါ"                      
[15,] "(1-4-2020, 8:00PM)"   "142" "143" "ချင်းပြည်နယ်၊ တီးတိန် ပြည်သူ့ဆေးရုံတွင်"                         
[16,] "(1-4-2020, 8:00PM)"   "147" "148" "ဦး၊ ရန်ကုန်မြို့၊ ဝေဘာဂီအထူးကုဆေးရုံကြီးတွင်"                     
[17,] "(1-4-2020, 8:00PM)"   "153" "154" "နှင့် မန္တလေးမြို့၊ ကန်တော်နဒီဆေးရုံတွင်"                         
[18,] "(1-4-2020, 8:00PM)"   "157" "158" "(၁) ဦး၊ နေပြည်တော်ပြည်သူ့ဆေးရုံကြီးတွင်"                      
[19,] "(1-4-2020, 8:00PM)"   "166" "167" "လားရှိုးမြို့တွင် ကျောက်မဲမြို့ ပြည်သူ့ဆေးရုံမှ"                       
[20,] "(1-4-2020, 8:00PM)"   "176" "177" "ကျန်းမာရေးအခြေအနေမှာ တည်ငြိမ်လျက်ရှိပါသည်။ ဝေဘာဂီအထူးကုဆေးရုံကြီးရှိ"
      keyword                post                                       pattern       
 [1,] "စောင့်ကြည့်လူနာ (၂၉)"      "ဦး၏ ဓာတ်ခွဲနမူနာများတွင် -•"                    "လူနာ ([၀-၉]+)"
 [2,] "လူနာသစ် မတွေ့ရှိပါ။၂။"      "သို့ဖြစ်ပါ၍ (၃၀-၄-၂၀၂၀) ရက်နေ့၊"                "လူနာ ([၀-၉]+)"
 [3,] "အတည်ပြုလူနာ (၁၅၀)"       "ဦးရှိပြီဖြစ်ပါသည်။၃။ (၃၀-၄-၂၀၂၀) ရက်နေ့တွင်"       "လူနာ ([၀-၉]+)"
 [4,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၁၂၇)"   "ဦးအား သက်ဆိုင်ရာဆေးရုံများတွင် ဆေးကုသမှု"            "လူနာ ([၀-၉]+)"
 [5,] "စောင့်ကြည့်လူနာအသစ် (၆၀)ဦး" "ရှိပါသည်။"                                   "လူနာ ([၀-၉]+)"
 [6,] "လူနာသစ် မတွေ့ရှိပါ။၂။"      "သို့ဖြစ်ပါ၍ (၃၀-၄-၂၀၂၀)ရက်နေ့၊ နံနက်"             "လူနာ ([၀-၉]+)"
 [7,] "လူနာ (၁၅၀)"            "ဦးရှိပြီဖြစ်ပါသည်။၃။ (၂၉-၄-၂၀၂၀) ရက်နေ့အတွက်"      "လူနာ ([၀-၉]+)"
 [8,] "စောင့်ကြည့်လူနာ (၈၉)"      "ဦး၏ ဓာတ်ခွဲနမူနာများနှင့် DMRမှ"                  "လူနာ ([၀-၉]+)"
 [9,] "စောင့်ကြည့်လူနာ (၂၁)"      "ဦး၏ ဓာတ်ခွဲနမူနာများ၊ စုစုပေါင်း"                "လူနာ ([၀-၉]+)"
[10,] "လူနာသစ် မရှိပါ။)၂။"       "သို့ဖြစ်ပါ၍ (၂၉-၄-၂၀၂၀)ရက်နေ့၊ ည"               "လူနာ ([၀-၉]+)"
[11,] "စောင့်ကြည့်လူနာအသစ် (၃၆)"   "ဦး ရှိပါသည်။၃။ အမျိုးသားကျန်းမာရေးဓာတ်ခွဲမှုဆိုင်ရာဌာန" "လူနာ ([၀-၉]+)"
[12,] "လူနာ (၁)"              "ဦး၏ ဓာတ်ခွဲအဖြေတွင် COVID-19"                  "လူနာ ([၀-၉]+)"
[13,] "ကျန်လူနာ (၅၅)"          "ဦး၏ ဓာတ်ခွဲအဖြေများတွင် COVID-19"              "လူနာ ([၀-၉]+)"
[14,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၁၅)"    "ဦးအနက် ချင်းပြည်နယ်၊ တီးတိန်"                    "လူနာ ([၀-၉]+)"
[15,] "လူနာ (၁)"              "ဦး၊ ရန်ကုန်မြို့၊ ဝေဘာဂီအထူးကုဆေးရုံကြီးတွင်"           "လူနာ ([၀-၉]+)"
[16,] "လူနာ (၁၀)"             "ဦး နှင့် မန္တလေးမြို့၊"                          "လူနာ ([၀-၉]+)"
[17,] "လူနာ (၁)"              "ဦး၊ နေပြည်တော်ပြည်သူ့ဆေးရုံကြီးတွင် လူနာ"            "လူနာ ([၀-၉]+)"
[18,] "လူနာ (၁)"              "ဦး၊ ရှမ်းပြည်နယ် (မြောက်ပိုင်း)၊"                 "လူနာ ([၀-၉]+)"
[19,] "ပြောင်းရွှေ့လူနာ (၁)"      "ဦး၊ စုစုပေါင်း (၁၄)"                         "လူနာ ([၀-၉]+)"
[20,] "လူနာ (၁)"              "ဦးအား အထူးကြပ်မတ် ခန်း၌"                      "လူနာ ([၀-၉]+)"

Filtering the confirmed cases

We filter the output in order to get only the results that contain the term for laboratory confirmed infections. We identified three different group of wordings for the same term which are combined into a single regex pattern (q123).

q1 <- "ဓာတ်ခွဲ အတည်ပြုလူနာ \\([၀-၉]+\\)"
q2 <- "ဓာတ်ခွဲအတည်ပြုလူနာ \\([၀-၉]+\\)"
q3 <- "ဓာတ်ခွဲအတည်ပြု လူနာ \\([၀-၉]+\\)"
q123 <- paste0(c(q1,q2,q3), collapse = "|")

Filtering for the desired results, we obtain 53 lines of output, that are quite easy to understand.

tsel <- paste(cPatients$pre,cPatients$keyword) %>%
  grepl(q123, .)   # %>% cPatients[.,]
utf8::utf8_print(sapply(cPatients[tsel,], paste))

      docname                 from  to    pre                                             
 [1,] "(30-4-2020, 8:00 PM)"  "27"  "28"  "COVID-19 ရောဂါ ဓာတ်ခွဲ"                           
 [2,] "(30-4-2020, 8:00 PM)"  "79"  "80"  "ဦး ရှိပါသည်။၆။ ယနေ့အထိ"                             
 [3,] "(30-4-2020, 7:00 AM)"  "25"  "26"  "COVID-19 ရောဂါ ဓာတ်ခွဲအတည်ပြု"                      
 [4,] "(29-4-2020, 8:00PM)"   "37"  "38"  "COVID-19 ရောဂါ ဓာတ်ခွဲအတည်ပြု"                      
 [5,] "(28-4-2020, 8:00 PM)"  "32"  "33"  "နာရီ အထိ မြန်မာနိုင်ငံတွင်"                             
 [6,] "(27-4-2020, 8:00 PM)"  "21"  "22"  "ည (၈:၀၀)နာရီအထိ မြန်မာနိုင်ငံတွင်"                      
 [7,] "(26-4-2020, 8:00 pm)"  "31"  "32"  "ည (၈:၀၀)နာရီအထိ မြန်မာနိုင်ငံတွင်"                      
 [8,] "(25-4-2020, 8:00 pm)"  "32"  "33"  "ည (၈:၀၀)နာရီအထိ မြန်မာနိုင်ငံတွင်"                      
 [9,] "(25-4-2020, 10:00 am)" "10"  "11"  "နာရီအချိန်အထိ မြန်မာနိုင်ငံတွင် COVID-19"                 
[10,] "(24-4-2020, 9:00 am)"  "35"  "36"  "အချိန်အထိ COVID-19 ရောဂါ"                         
[11,] "(24-4-2020, 00:30 am)" "23"  "24"  "နံနက် (၀၀:၃၀)နာရီအထိ မြန်မာနိုင်ငံတွင်"                   
[12,] "(23-4-2020, 8:00 pm)"  "34"  "35"  "ည (၈:၀၀)နာရီအထိ မြန်မာနိုင်ငံတွင်"                      
[13,] "(23-4-2020, 10:00 am)" "8"   "9"   "(၁၀:၀၀)နာရီအချိန်ထိ မြန်မာနိုင်ငံတွင် COVID-19"           
[14,] "(23-4-2020, 10:00 am)" "11"  "12"  "ရောဂါဓာတ်ခွဲအတည်ပြုလူနာ (၁၂၇)ဦး တွေ့ရှိခဲ့ပြီး"             
[15,] "(23-4-2020, 7:00 AM)"  "35"  "36"  "နာရီအချိန်အထိ မြန်မာနိုင်ငံတွင် ဓာတ်ခွဲအတည်ပြု"                
[16,] "(22-4-2020, 8:00 PM)"  "35"  "36"  "နာရီအချိန်အထိ မြန်မာနိုင်ငံတွင် ဓာတ်ခွဲအတည်ပြု"                
[17,] "(22-4-2020, 10:00 AM)" "9"   "10"  "နာရီအချိန်အထိ မြန်မာနိုင်ငံတွင် COVID-19"                 
[18,] "(21-4-2020, 10:00PM)"  "22"  "23"  "နာရီအချိန်အထိ မြန်မာနိုင်ငံတွင် ဓာတ်ခွဲအတည်ပြု"                
[19,] "(21-4-2020, 8:00PM)"   "42"  "43"  "နာရီအချိန်အထိ မြန်မာနိုင်ငံတွင် ဓာတ်ခွဲအတည်ပြု"                
[20,] "(20-4-2020, 10:00 AM)" "39"  "40"  "နာရီအချိန်အထိ မြန်မာနိုင်ငံတွင် COVID-19"                 
[21,] "(18-4-2020, 8:00 pm)"  "26"  "27"  "ရောဂါပိုးမတွေ့ရှိ (မတွေ့ရှိ)ပါ။၂။ ယခုအထိ"                  
[22,] "(18-4-2020, 8:00 pm)"  "37"  "38"  "ဆင်းခွင့်ရရှိခဲ့ပြီးဖြစ်ပါသည်။၃။ COVID-19 ရောဂါ"          
[23,] "(18-4-2020, 10:00 am)" "89"  "90"  "နာရီအချိန်အထိ မြန်မာနိုင်ငံတွင် COVID-19"                 
[24,] "(18-4-2020, 8:00 am)"  "69"  "70"  "ယခုအချိန်အထိ COVID-19 ရောဂါ"                       
[25,] "(17-4-2020, 8:00 pm)"  "29"  "30"  "တွေ့ရှိခဲ့ရပါသည်။၂။ COVID-19 ရောဂါ"                   
[26,] "(17-4-2020, 10:00 am)" "55"  "56"  "တွေ့ရှိခဲ့ရပါသည်။၃။ COVID-19 ရောဂါ"                   
[27,] "(16-4-2020, 8:00 pm)"  "28"  "29"  "တွေ့ရှိခဲ့ရပါသည်။၂။ COVID-19 ရောဂါ"                   
[28,] "(16-4-2020, 10:00 AM)" "74"  "75"  "ယခုအချိန်အထိ COVID-19 ရောဂါ"                       
[29,] "(15-4-2020, 8:00 pm)"  "27"  "28"  "တွေ့ရှိခဲ့ရပါသည်။၂။ COVID-19 ရောဂါ"                   
[30,] "(15-4-2020, 10:30 AM)" "49"  "50"  "နာရီအချိန်အထိ မြန်မာနိုင်ငံတွင် COVID-19"                 
[31,] "(14-4-2020, 8:00 pm)"  "64"  "65"  "ယခုအချိန်အထိ COVID-19 ရောဂါ"                       
[32,] "(14-4-2020, 4:00 pm)"  "31"  "32"  "ဖော်ပြခဲ့ပြီး ဖြစ်ပါသည်။၂။ အဆိုပါ"                     
[33,] "(13-4-2020, 00:30 am)" "173" "174" "မြန်မာနိုင်ငံတွင် COVID-19 ရောဂါ"                     
[34,] "(12-4-2020, 8:00 pm)"  "124" "125" "မြန်မာနိုင်ငံတွင် COVID-19 ရောဂါ"                     
[35,] "(12-4-2020, 2:00 AM)"  "281" "282" "ရာဇဝင်ရှိကြောင်း သိရှိရပါသည်။ ထပ်မံတွေ့ရှိရသော"              
[36,] "(12-4-2020, 2:00 AM)"  "299" "300" "မြန်မာနိုင်ငံတွင် COVID-19 ရောဂါ"                     
[37,] "(11-4-2020, 8:00PM)"   "153" "154" "မြန်မာနိုင်ငံတွင် COVID-19 ရောဂါ"                     
[38,] "(10-4-2020, 10:15PM)"  "94"  "95"  "တွင် COVID-19 ရောဂါ"                             
[39,] "(10-4-2020, 8:00PM)"   "116" "117" "မြန်မာနိုင်ငံတွင် COVID-19 ရောဂါ"                     
[40,] "(10-4-2020, 3:00AM)"   "227" "228" "မြန်မာနိုင်ငံတွင် COVID-19 ရောဂါ"                     
[41,] "(9-4-2020, 8:00 PM)"   "51"  "52"  "မတွေ့ရှိရဘဲ COVID-19 ရောဂါ"                         
[42,] "(9-4-2020, 8:00 PM)"   "304" "305" "မြန်မာနိုင်ငံတွင် COVID-19 ရောဂါ"                     
[43,] "(8-4-2020, 8:00 PM)"   "197" "198" "မြန်မာနိုင်ငံတွင် COVID-19 ရောဂါ"                     
[44,] "(6-4-2020, 8:00PM)"    "126" "127" "ကောင်းမွန်လျက်ရှိပါသည်။၅။ COVID-19 ရောဂါ"             
[45,] "(6-4-2020, 8:00PM)"    "175" "176" "မြန်မာနိုင်ငံတွင် COVID-19 ရောဂါ"                     
[46,] "(5-4-2020, 8:00 PM)"   "79"  "80"  "တွေ့ရှိရပါသည်။၄။ COVID-19 ရောဂါ"                    
[47,] "(5-4-2020, 8:00 PM)"   "139" "140" "မြန်မာနိုင်ငံတွင် COVID-19 ရောဂါ"                     
[48,] "(4-4-2020, 8:00 PM)"   "153" "154" "ကျန်းမာရေးအခြေအနေမှာ ကောင်းမွန်လျက်ရှိပါသည်။၅။ COVID-19"
[49,] "(4-4-2020, 8:00 PM)"   "214" "215" "ရက်နေ့အထိ မြန်မာနိုင်ငံတွင် COVID-19"                    
[50,] "(3-4-2020, 8:00 PM)"   "82"  "83"  "၄။ COVID-19 ရောဂါ"                             
[51,] "(3-4-2020, 8:00 PM)"   "145" "146" "မြန်မာနိုင်ငံတွင် COVID-19 ရောဂါ"                     
[52,] "(2-4-2020, 8:00PM)"    "117" "118" "ဖြစ်ပါသည်။၅။ COVID-19 ရောဂါ"                     
[53,] "(1-4-2020, 8:00PM)"    "136" "137" "ပေးလျက်ရှိပါသည်။၅။ COVID-19 ရောဂါ"                 
      keyword                     post                                               pattern       
 [1,] "အတည်ပြုလူနာ (၁၅၀)"            "ဦးရှိပြီဖြစ်ပါသည်။၃။ (၃၀-၄-၂၀၂၀) ရက်နေ့တွင်"               "လူနာ ([၀-၉]+)"
 [2,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၁၂၇)"        "ဦးအား သက်ဆိုင်ရာဆေးရုံများတွင် ဆေးကုသမှု"                    "လူနာ ([၀-၉]+)"
 [3,] "လူနာ (၁၅၀)"                 "ဦးရှိပြီဖြစ်ပါသည်။၃။ (၂၉-၄-၂၀၂၀) ရက်နေ့အတွက်"              "လူနာ ([၀-၉]+)"
 [4,] "လူနာ (၁၅၀)"                 "ဦးရှိပြီဖြစ်ပါသည်။၃။ NHL အနေဖြင့်"                       "လူနာ ([၀-၉]+)"
 [5,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၁၄၉)ဦး"      "ရှိပြီဖြစ်ပါသည်။၃။ ဓာတ်ခွဲနမူနာများတွင် (၂)"                 "လူနာ ([၀-၉]+)"
 [6,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၁၄၆)"        "ဦး ရှိပြီဖြစ်ပါသည်။၃။ ယခုအခါ"                           "လူနာ ([၀-၉]+)"
 [7,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၁၄၆)"        "ဦးရှိပြီဖြစ်ပါသည်။၃။ ယနေ့အထိ ဓာတ်ခွဲအတည်ပြုလူနာ"               "လူနာ ([၀-၉]+)"
 [8,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၁၄၄)"        "ဦး ရှိပြီဖြစ်ပါသည်။၃။ ပြည်သူ့ဆေးရုံကြီး၊"                    "လူနာ ([၀-၉]+)"
 [9,] "ရောဂါဓာတ်ခွဲအတည်ပြုလူနာ (၁၄၄)"   "ဦး တွေ့ရှိခဲ့ပြီး သေဆုံးလူနာ"                               "လူနာ ([၀-၉]+)"
[10,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၁၃၉)"        "ဦးရှိပြီဖြစ်ပါသည်။၃။ အဆိုပါလူနာသစ်များအား သက်ဆိုင်ရာဆေးရုံများသို့" "လူနာ ([၀-၉]+)"
[11,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၁၃၂)"        "ဦး ရှိပြီဖြစ်ပါသည်။"                                   "လူနာ ([၀-၉]+)"
[12,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၁၃၂)"        "ဦး ရှိပြီဖြစ်ပါသည်။၃။ ယခုအခါ"                           "လူနာ ([၀-၉]+)"
[13,] "ရောဂါဓာတ်ခွဲအတည်ပြုလူနာ (၁၂၇)ဦး" "တွေ့ရှိခဲ့ပြီး သေဆုံးလူနာ (၅)"                              "လူနာ ([၀-၉]+)"
[14,] "သေဆုံးလူနာ (၅)"               "ဦးနှင့် ရောဂါသက်သာ၍ ဆေးရုံမှ"                            "လူနာ ([၀-၉]+)"
[15,] "လူနာ (၁၂၇)ဦး"               "ရှိပြီ ဖြစ်ပါသည်။၃။ (၂၂-၄-၂၀၂၀)"                       "လူနာ ([၀-၉]+)"
[16,] "လူနာ (၁၂၃)ဦးရှိပြီ"            "ဖြစ်ပါသည်။၃။ ယခုအခါ မြန်မာနိုင်ငံရှိ"                       "လူနာ ([၀-၉]+)"
[17,] "ရောဂါဓာတ်ခွဲအတည်ပြုလူနာ (၁၂၁)"   "ဦး တွေ့ရှိခဲ့ပြီး သေဆုံးလူနာ"                               "လူနာ ([၀-၉]+)"
[18,] "လူနာ (၁၂၁)ဦး"               "တွေ့ရှိခဲ့ပြီ ဖြစ်ပါသည်။၃။ ယနေ့ည"                           "လူနာ ([၀-၉]+)"
[19,] "လူနာ (၁၂၁)ဦးရှိပြီ"            "ဖြစ်ပါသည်။၄။ ယခုအခါ မြန်မာနိုင်ငံရှိ"                       "လူနာ ([၀-၉]+)"
[20,] "ရောဂါဓာတ်ခွဲအတည်ပြုလူနာ (၁၁၁)"   "ဦး တွေ့ရှိခဲ့ပြီး သေဆုံးလူနာ"                               "လူနာ ([၀-၉]+)"
[21,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၉၆)"         "ဦးအနက် (၅) ဦးမှာသေဆုံးခဲ့ပြီး၊"                          "လူနာ ([၀-၉]+)"
[22,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၈၄)ဦးအား"    "အောက်ပါဇယားအတိုင်း သက်ဆိုင်ရာ ဆေးရုံများတွင်"                 "လူနာ ([၀-၉]+)"
[23,] "ရောဂါဓာတ်ခွဲအတည်ပြုလူနာ (၉၄)"    "ဦး တွေ့ရှိခဲ့ပြီးဖြစ်ပါသည်။၅။ (၁၇-၄-၂၀၂၀)ရက်နေ့တွင်"           "လူနာ ([၀-၉]+)"
[24,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၉၄)"         "ဦးရှိပြီဖြစ်ပါသည်။၅။ (၁၇-၄-၂၀၂၀)ရက်နေ့တွင် ဓာတ်ခွဲအတည်ပြုခဲ့သည့်"   "လူနာ ([၀-၉]+)"
[25,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၇၉)ဦးအား"    "အောက်ပါဇယားအတိုင်း သက်ဆိုင်ရာ ဆေးရုံများတွင်"                 "လူနာ ([၀-၉]+)"
[26,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၈၅)ဦး"       "တွေ့ရှိခဲ့သည့်အနက် (၇၉)ဦးအား အောက်ပါ"                       "လူနာ ([၀-၉]+)"
[27,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၇၉)ဦးအား"    "အောက်ပါဇယားအတိုင်း သက်ဆိုင်ရာ ဆေးရုံများတွင်"                 "လူနာ ([၀-၉]+)"
[28,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၈၅)"         "ဦးရှိပြီ ဖြစ်ပြီး ဓာတ်ခွဲအတည်ပြုလူနာသစ်"                      "လူနာ ([၀-၉]+)"
[29,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၆၈)ဦးအား"    "အောက်ပါဇယားအတိုင်း သက်ဆိုင်ရာဆေးရုံများတွင် လိုအပ်သလို"            "လူနာ ([၀-၉]+)"
[30,] "ရောဂါဓာတ်ခွဲအတည်ပြုလူနာ (၇၄)"    "ဦး တွေ့ရှိခဲ့ပြီးဖြစ်ပါသည်။၄။ အဆိုပါ"                        "လူနာ ([၀-၉]+)"
[31,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၆၃)ဦး"       "တွေ့ရှိပြီးဖြစ်ကာ ၎င်းတို့အနက် -•"                           "လူနာ ([၀-၉]+)"
[32,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၂၁)"         "ဦးနှင့် ပတ်သက်သည့် သတင်းအချက်အလက်များကို"                     "လူနာ ([၀-၉]+)"
[33,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၄၁)"         "ဦး တွေ့ရှိခဲ့ပြီး ဖြစ်ပါသည်။၅။"                            "လူနာ ([၀-၉]+)"
[34,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၃၉)ဦး"       "တွေ့ရှိခဲ့ပြီး ဖြစ်ပါသည်။၄။ သို့ဖြစ်ရာ"                        "လူနာ ([၀-၉]+)"
[35,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၇)ဦး၏"       "ကျန်းမာရေး အခြေအနေမှာ ကောင်းမွန်လျက်ရှိပြီး"               "လူနာ ([၀-၉]+)"
[36,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၃၈)"         "ဦး တွေ့ရှိခဲ့ပြီး ဖြစ်ပါသည်။၁၁။"                           "လူနာ ([၀-၉]+)"
[37,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၃၁)"         "ဦး တွေ့ရှိခဲ့ပြီး ဖြစ်ပါသည်။၆။"                            "လူနာ ([၀-၉]+)"
[38,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၂၈)"         "ဦး တွေ့ရှိခဲ့ပြီး ဖြစ်ပါသည်။၄။"                            "လူနာ ([၀-၉]+)"
[39,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၂၇)ဦး"       "တွေ့ရှိလာရပြီဖြစ်ပါသည်။ သို့ဖြစ်ပါ၍ ပြည်သူများ"                "လူနာ ([၀-၉]+)"
[40,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၂၇)ဦး"       "တွေ့ရှိခဲ့ပြီး ဖြစ်ပါသည်။၇။ သို့ဖြစ်ပါ၍"                       "လူနာ ([၀-၉]+)"
[41,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၂၂)ဦး"       "တွေ့ရှိခဲ့ပါသည်။၃။ (၈-၄-၂၀၂၀)ရက်နေ့၊ ညနေ(၆)နာရီမှ"           "လူနာ ([၀-၉]+)"
[42,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၂၃)"         "ဦး တွေ့ရှိလာခဲ့ရပြီဖြစ်ပြီး ၎င်းလူနာများအနက်"                 "လူနာ ([၀-၉]+)"
[43,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၂၂)ဦး"       "တွေ့ရှိလာရပြီဖြစ်ပါသည်။ သို့ဖြစ်ပါ၍ ပြည်သူများအနေဖြင့်"          "လူနာ ([၀-၉]+)"
[44,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၂၁)ဦးအနက်"    "ချင်းပြည်နယ်၊ တီးတိန် ပြည်သူ့"                             "လူနာ ([၀-၉]+)"
[45,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၂၂)ဦး"       "တွေ့ရှိလာရပြီဖြစ်ပါသည်။ သို့ဖြစ်ပါ၍ ပြည်သူများအနေဖြင့်"          "လူနာ ([၀-၉]+)"
[46,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၂၀)"         "ဦးအနက် ချင်းပြည်နယ်၊ တီးတိန်"                            "လူနာ ([၀-၉]+)"
[47,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၂၁)"         "ဦး တွေ့ရှိလာရပြီဖြစ်ပါသည်။ သို့ဖြစ်ပါ၍"                      "လူနာ ([၀-၉]+)"
[48,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၂၀)"         "ဦးအနက် ချင်းပြည်နယ်၊ တီးတိန်"                            "လူနာ ([၀-၉]+)"
[49,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၂၁)"         "ဦး တွေ့ရှိလာရပြီဖြစ်ပါသည်။ သို့ဖြစ်ပါ၍"                      "လူနာ ([၀-၉]+)"
[50,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၁၉)"         "ဦးအနက် ချင်းပြည်နယ်၊ တီးတိန်ပြည်သူ့"                        "လူနာ ([၀-၉]+)"
[51,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၂၀)"         "ဦး တွေ့ရှိလာရပြီဖြစ်ပါသည်။ သို့ဖြစ်ပါ၍"                      "လူနာ ([၀-၉]+)"
[52,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၁၉)"         "ဦးအနက် ချင်းပြည်နယ်၊ တီးတိန်ပြည်သူ့"                        "လူနာ ([၀-၉]+)"
[53,] "ဓာတ်ခွဲအတည်ပြုလူနာ (၁၅)"         "ဦးအနက် ချင်းပြည်နယ်၊ တီးတိန်"                            "လူနာ ([၀-၉]+)"

Friendly nudge

Now I can see that the kwic() function would be quite handy for exploring other useful information out of the COVID-19 surveillance reports. Here, in my own small way, I’m happy to learn that existing NLP tools could be applied in the context of the Myanmar language. And kudos my fellow dummies for lying low at home in this social distancing period! I would be happier indeed to see you go much, much, further!