Loading...
A Post-Processing Scheme for Malayalam using Statistical Sub-character Language Models
Most of the Indian scripts do not have any robust commer- cial OCRs. Many of the laboratory prototypes report rea- sonable results at recognition/classification stage. However, word level accuracies are still poor. It is well known that word accuracy decreases as the number of characters in a word i...
| Main Author: | |
|---|---|
| Format: | Printed Book |
| Published: |
ACM
2010
|
| Subjects: | |
| Online Access: | http://10.26.1.76/ks/005435.pdf |
| Summary: | Most of the Indian scripts do not have any robust commer-
cial OCRs. Many of the laboratory prototypes report rea-
sonable results at recognition/classification stage. However,
word level accuracies are still poor. It is well known that
word accuracy decreases as the number of characters in a
word increase. For Malayalam, the average number of char-
acters in a word is almost twice that of English. Moreover,
the number of words required to cover 80% of the Malay-
alam language is more than forty times that of other Indian
languages such as Hindi. Hence a direct dictionary based
post-processing scheme is not suitable for Malayalam.
In this paper, we propose a post-processing scheme which
uses statistical language models at the sub-character level to
boost word level recognition results. We use a multi-stage
graph representation and formulate the recognition task as
an optimization problem. Edges of the graph encode the
language information and nodes represent the visual simi-
larities. An optimal path from source node to destination
node represents the recognized text. We validate our method
on more than 10,000 words from a Malayalam corpus. |
|---|---|
| Item Description: | DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems 493-500 |