Wednesday, January 27, 2021

My first use of the “tesseract” OCR

 
I’ve heard sometime before that “tesseract” is a powerful OCR engine that (now) supports 100 languages, out of box, including the Myanmar language. But I have been using Google Docs for OCR for sometime and found it quite dependable though with the inconvenience of its online interface and limits in the size of input data. What prevented me from using tesseract then was because Myanmar language wasn’t supported at that time.

Now, after talking with my son who has been experimenting with tesseract via the python language, I decided to play with tesseract. The preparation part was quite easy. I looked for the implementation of tesseract in R, found the “tesseract” package, and installed it.

> library(tesseract)
First use of Tesseract: copying language data...

Warning message:
package ‘tesseract’ was built under R version 4.0.3 

Find out what languages are supported:

> tesseract_info()
$datapath
[1] "C:\\Users\\mtnn\\AppData\\Local\\tesseract4\\tesseract4\\tessdata/"

$available
[1] "eng" "osd"

$version
[1] "4.1.0"

$configs
 [1] "alto"             "ambigs.train"     "api_config"       "bigram"          
 [5] "box.train"        "box.train.stderr" "digits"           "get.images"      
 [9] "hocr"             "inter"            "kannada"          "linebox"         
[13] "logfile"          "lstm.train"       "lstmbox"          "lstmdebug"       
[17] "makebox"          "pdf"              "quiet"            "rebox"           
[21] "strokewidth"      "tsv"              "txt"              "unlv"            
[25] "wordstrbox"      

The default installation of tesseract doesn’t include Myanmar language, so I download it and check the supported languages again:

> tesseract_download("mya")
 Downloaded: 4.43 MB  (100%)
[1] "C:\\Users\\mtnn\\AppData\\Local\\tesseract4\\tesseract4\\tessdata/mya.traineddata"
> tesseract_info()
$datapath
[1] "C:\\Users\\mtnn\\AppData\\Local\\tesseract4\\tesseract4\\tessdata/"

$available
[1] "eng" "mya" "osd"

$version
[1] "4.1.0"

$configs
 [1] "alto"             "ambigs.train"    
 [3] "api_config"       "bigram"          
 [5] "box.train"        "box.train.stderr"
 [7] "digits"           "get.images"      
 [9] "hocr"             "inter"           
[11] "kannada"          "linebox"         
[13] "logfile"          "lstm.train"      
[15] "lstmbox"          "lstmdebug"       
[17] "makebox"          "pdf"             
[19] "quiet"            "rebox"           
[21] "strokewidth"      "tsv"             
[23] "txt"              "unlv"            
[25] "wordstrbox"      

Now I run tesseract with an image file of an excerpt from the foreword from one of the books of Dr. Than Tun, the well-known Myanmar historian. The result is:

myan <- tesseract("mya")
text <- ocr("dTT-excerpt.png", engine = myan)
cat(text)
အချုပ်အားဖြင့် မျိုးချစ်စိတ်အကဲအပိုနှင့် မိမိရာဇဝင်ကိုဖြစ်စေး သူတပါးရာဇဝင်ကိုဖြစ်စေ,
မရေးသင့်ဟု ဆိုလိုပါသည်; ဒုတိယ နိဂုံးအနေဖြင့်လည်း မြန်မာရာဇဝင်ကို မြန်မာများကပင် အကဲ
အပိုမပါစေဘဲ ရေးသားရွ် ကမ္ဘာကို တင်ပြချိန်ရောက်ပြီဟု ဆိုချင်ပါသည်1 တတိယ နိဂုံးအနေဖြင့်
နု ကလမ သေက် မန ၂ ဆယ့်ရြ (မ ာဆိးကာ က္ကာ
အဂလပၢၤ ကုလား စသညတ္ရက မဟုတ မမှန လည်ဆယ်ရ် မြန်မာအဆိုးဟု ရေးလေသညတ္ရကုလညး
င္ကူ်င ငန မျ ကြ
ဖေါ် ထုတ် ပြင်ဆင်ရှ် ဟုတ်သလောက်မူ ဝန်ခံံပါဟု တိုက်တွန်းလိုပါသည်#

You can see quite a of lot of errors in comparison with the input data below:

Now I run OCR of the same text image with Google Docs. The result is perfect as can be seen in the result saved as PDF file.

Now that I have a new toy to play, my immediate task would be to find ways to improve the out-of-the-box performance of tesseract’s OCR for Myanmar language. I guess a lot of enthusiasts had already trodden this path and, therefore and surely, this old boy will play along cheerfully!