Loading...

A Post-Processing Scheme for Malayalam using Statistical Sub-character Language Models

Most of the Indian scripts do not have any robust commer- cial OCRs. Many of the laboratory prototypes report rea- sonable results at recognition/classification stage. However, word level accuracies are still poor. It is well known that word accuracy decreases as the number of characters in a word i...

Full description

Bibliographic Details
Main Author: Karthika Mohan and C. V. Jawahar
Format: Printed Book
Published: ACM 2010
Subjects:
Online Access:http://10.26.1.76/ks/005435.pdf
Description
Summary:Most of the Indian scripts do not have any robust commer- cial OCRs. Many of the laboratory prototypes report rea- sonable results at recognition/classification stage. However, word level accuracies are still poor. It is well known that word accuracy decreases as the number of characters in a word increase. For Malayalam, the average number of char- acters in a word is almost twice that of English. Moreover, the number of words required to cover 80% of the Malay- alam language is more than forty times that of other Indian languages such as Hindi. Hence a direct dictionary based post-processing scheme is not suitable for Malayalam. In this paper, we propose a post-processing scheme which uses statistical language models at the sub-character level to boost word level recognition results. We use a multi-stage graph representation and formulate the recognition task as an optimization problem. Edges of the graph encode the language information and nodes represent the visual simi- larities. An optimal path from source node to destination node represents the recognized text. We validate our method on more than 10,000 words from a Malayalam corpus.
Item Description:DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems 493-500