Tuesday, August 25, 2020

Making Myanmar Spelling Book searchable- II


As I hinted at end of my previous post, the neat idea was to download the result of OCR with Google Docs as Plain Text(.txt) instead of (.odt) as I used to do.
Also, I don’t need to go into a fancy process like slicing up the image in GIMP to get single column text images. The Google Docs OCR could handle two columns well! That was what I thought. Yet when I tried out that idea, I found that instead of finding the entire right-column text to be positioned neatly below the entire left-column, they were not always that way. So I had to get back to working with the sliced left and right columns as mentioned in my previous post!
So my final work flow is:
  1. Extract pages 1-285 of the text (using PDFsam Basic software).
  2. Open the resulting pdf file in GIMP using the Open as Layers… option in the File menu.Now I have all 285 pages of pdf opened as graphic layers.
  3. Images layers were cut into left-column and right-column images using the Slice Using Guides option in the Image menu of GIMP.
  4. By installing the Export Layers plug-in for GIMP, I now have additional options in the File menu. I select the Export Layers option and export the layers as 570 separate pdf files.
  5. Merge the resulting pdf files to 8 pdf files (using PDFsam Basic software).
  6. Open Google Drive; import the pdf files; open each in Google Docs. This converted the text-image to editable text. Each is downloaded as a Plain Text(.txt) file.
  7. Text files were opened in LibreOffice Writer and font changed to Pyidaungsu (Myanmar Unicode font). All were then merged into a single odt file and converted to a a pdf file with the Writer.
The following shows the first page of this spelling book pdf file with one error marked:
The OCR process seems to be quite good, but I can’t recommend the output for use without some hard editing to eliminate OCR errors. I had tested the text search-ability and it doesn’t seem to have serious problems. However, when I tried to find “ကြွက်”(mouse in our language), that can’t be found. When I manually located that word in the file, then copy and pasted it on console I can see that there was a space in the string. See it at the first two lines in the box below. The next two lines shows the correct string.
Then when I input the string with the space, the search worked! Seems like there’s some problem with Google OCR because it was reading from a fully formed and clear image of that word. However, when I tried running the same search on my Android cell phone there wasn’t any problem!
> "ကြ ွက်"
[1] "က\u103c \u103dက\u103a"
> "ကြွက်" 
[1] "က\u103c\u103dက\u103a"
As for the looks, the output is completely different from the original pages. But the new version need not look like the original (a sample page is shown below) so long as it serves our purpose. For that matter, all we need to do would be to make sure that OCR errors are identified completely and corrected.

No comments:

Post a Comment