OCR of Sinhala text (to Unicode)

OCR of Sinhala text (to Unicode)

by Visvanath Ratnaweera -
Number of replies: 8

Dear Sandika

Could you try your imgocr or any other tool (specify!) to extract the text out of this image?

Compare the result with what somebody has done?

In reply to Visvanath Ratnaweera

OCR of Sinhala text (to Unicode)

by Sandika Madushan -
Dear Sir,

I tried several tools to extract the text, but they could not produce accurate results due to issues such as poor image quality, faded letters, and blurred or shadowed text.
In reply to Sandika Madushan

Re: OCR of Sinhala text (to Unicode)

by Visvanath Ratnaweera -
Dear Sandika

To be expected. That person hasn't sent me the scan instead a screenshot when viewing it on a laptop! Sorry for not checking. I will ask the person again.

This is not something urgent. I just want to know how well we can convert scanned Sinhala documents.
In reply to Visvanath Ratnaweera

Re: OCR of Sinhala text (to Unicode)

by Sandika Madushan -
Dear sir,
If the letters in the image are not faded or blurred and are clearly visible, we can extract the text.
In reply to Sandika Madushan

Re: OCR of Sinhala text (to Unicode)

by Visvanath Ratnaweera -
Dear Sandika

Hier is an image of higher resolution: https://pavana.syndrega.ch/ 191213sunilwatha1912pub/19121336347o.JPG (link deliberately broken to avoid embedding). How is the success this time?

P.S. This is just an investigation for a colleague, "unofficial", hence low priority. ;)
In reply to Visvanath Ratnaweera

Re: OCR of Sinhala text (to Unicode)

by Sandika Madushan -
Dear sir,
I extracted the Sinhala text from the image you provided using I2OCR website. It's mostly accurate. However, as shown in the image below, some Sinhala characters could not be recognized by this tool.
 
I have attached the extracted text file without any edits.
 
In reply to Sandika Madushan

Re: OCR of Sinhala text (to Unicode)

by Visvanath Ratnaweera -
Dear Sandika

Personally, I find the quality is very high. How many manual corrections did you have to do?

Don't worry about the odd symbols you've highlighted. They are problems with the typesetting. Those were the days where the letters were pressed on lead blocks (like Lego), then "set" to make the page! (I was lucky to have visited the Lake House press as a school boy.)

Anyway, I will send the text back to its owner.
In reply to Visvanath Ratnaweera

Re: OCR of Sinhala text (to Unicode)

by Sandika Madushan -
Dear Sir,
I manually corrected the text. I had to make around 50 manual corrections. All of them were odd symbols as shown in the image above. I have also attached the corrected text file here.
In reply to Sandika Madushan

Re: OCR of Sinhala text (to Unicode)

by Visvanath Ratnaweera -
Dear Sandika

I should have expected, one odd typesetting symbol usually means multiple corrections. I forwarded the edited version. Thanks!