FATR: A Comprehensive Dataset and Evaluation Framework for Persian Text Recognition in Wild Images

Raisi, Z.; Nazarzehi Had, V. M.; Sarani, E.; Damani, R.

doi:10.22061/jecei.2024.11256.784

Document Type : Original Research Paper

Authors

Electrical Engineering Department, Chabahar Maritime University, Chabahar, Iran.

https://doi.org/10.22061/jecei.2024.11256.784

Abstract

Background and Objectives: Research on right-to-left scripts, particularly Persian text recognition in wild images, is limited due to lacking a comprehensive benchmark dataset. Applying state-of-the-art (SOTA) techniques on existing Latin or multilingual datasets often results in poor recognition performance for Persian scripts. This study aims to bridge this gap by introducing a comprehensive dataset for Persian text recognition and evaluating SOTA models on it.
Methods: We propose a Farsi (Persian) text recognition (FATR) dataset, which includes challenging images captured in various indoor and outdoor environments. Additionally, we introduce FATR-Synth, the largest synthetic Persian text dataset, containing over 200,000 cropped word images designed for pre-training scene text recognition models. We evaluate five SOTA deep learning-based scene text recognition models using standard word recognition accuracy (WRA) metrics on the proposed datasets. We compare the performance of these recent architectures qualitatively on challenging sample images of the FATR dataset.
Results: Our experiments demonstrate that SOTA recognition models' performance declines significantly when tested on the FATR dataset. However, when trained on synthetic and real-world Persian text datasets, these models demonstrate improved performance on Persian scripts.
Conclusion: Introducing the FATR dataset enhances the resources available for Persian text recognition, improving model performance. The proposed datasets, trained models, and code is available at https://github.com/zobeirraisi/FATDR.

Keywords

Main Subjects

Deep Learning

Open Access

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit: http://creativecommons.org/licenses/by/4.0/

Publisher’s Note

JECEI Publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Publisher

Shahid Rajaee Teacher Training University

References

[1] Y. Zhu, C. Yao, X. Bai, “Scene text detection and recognition: Recent advances and future trends,” Front. Comput. Sci., 10(1): 19-36, 2016.

[2] H. Lin, P. Yang, F. Zhang, “Review of scene text detection and recognition,” Arch. Comput. Methods Eng., 27: 433-454, 2020.

[3] Z. Raisi, M. A. Naiel, P. Fieguth, S. Wardell, J. Zelek, “Text detection and recognition in the wild: A review,” arXiv preprint arXiv:2006.04305, 2020.

[4] Z. Raisi, J. Zelek, “Text detection and recognition for robot localization,” J. Electr. Comput. Eng. Innov., 12(1): 163-174, 2024.

[5] K. Wang, B. Babenko, S. Belongie, “End-to-end scene text recognition,” in Proc. 2011 International Conference on Computer Vision: 1457-1464, 2011.

[6] A. Bissacco, M. Cummins, Y. Netzer, H. Neven, “PhotoOCR: Reading text in uncontrolled conditions,” in Proc. 2013 IEEE International Conference on Computer Vision: 785-792, 2013.

[7] Z. Raisi, V. M. Nazarzehi, “A transformer-based approach with contextual position encoding for robust persian text recognition in the wild,” J. AI Data Min., 12(3): 455-464, 2024.

[8] Z. Raisi, G. Younes, J. Zelek, “Arbitrary shape text detection using transformers,” in Proc. 2022 26th International Conference on Pattern Recognition (ICPR): 3238-3245, 2022.

[9] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman., “Deep structured output learning for unconstrained text recognition,” arXiv:1412.5903v5, 2015.

[10] B. Shi, X. Bai, C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE Trans. Pattern Anal. Mach. Intell., 39(11): 2298-2304, 2016.

[11] B. Shi, X. Wang, P. Lyu, C. Yao, X. Bai, “Robust scene text recognition with automatic rectification,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 4168- 4176, 2016.

[12] W. Liu, C. Chen, K. Y. K. Wong, Z. Su, J. Han, “STARNet: A spatial attention residue network for scene text recognition,” in Proc. British Machine Vision Conference (BMVC): 43.1-43.13, 2016.

[13] F. Borisyuk, A. Gordo, V. Sivakumar, “Rosetta: Large scale system for text detection and recognition in images,” in Proc. 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining: 71-79, 2018.

[14] J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, H. Lee, “What is wrong with scene text recognition model comparisons? Dataset and model analysis,” in Proc. IEEE/CVF International Conference on Computer Vision (ICCV): 4715-4723, 2019.

[15] C. Ma, L. Sun, J. Wang, Q. Huo, “Dq-detr: Dynamic queries enhanced detection transformer for arbitrary shape text detection,” in Proc. International Conference on Document Analysis and Recognition: 243-260, 2023.

[16] A. Rahman, A. Ghosh, C. Arora, “Utrnet: Highresolution urdu text recognition in printed documents,” in Proc. International Conference on Document Analysis and Recognition: 305-324, 2023.

[17] F. Alimoradi, F. Rahmani, L. Rabiei, M. Khansari, M. Mazoochi, “Synthesizing an image dataset for text detection and recognition in images,” J. Inf. Commun. Technol., 53(53): 78, 2023 [In Farsi].

[18] A. Rashtehroudi, A. Ranjkesh, A. Shahbahrami, "PESTD: a large-scale Persian-English scene text dataset," Multimedia Tools Appl., 82: 34793-34808, 2023.

[19] S. Kheirinejad, N. Riaihi, R. Azmi, “Persian text-based traffic sign detection with convolutional neural network: A new dataset,” in Proc. 2020 10th International Conference on Computer and Knowledge Engineering (ICCKE): 060- 064, 2020.

[20] M. Rahmati, M. Fateh, M. Rezvani, A. Tajary, V. Abolghasemi, “Printed persian ocr system using deep learning,” IET Image Process., 14(15): 3920-3931, 2020.

[21] A. Fateh, M. Rezvani, A. Tajary, M. Fateh, “Persian printed text line detection based on font size,” Multimedia Tools Appl., 82(2): 2393-2418, 2023.

[22] T. E. De Campos, B. R. Babu, M. Varma, et al., “Character recognition in natural images,” in Proc. Fourth International Conference on Computer Vision Theory and Applications (VISAPP), 7: 273-280, 2009.

[23] K. Wang, S. Belongie, “Word spotting in the wild,” in Proc. European Conference on Computer Vision: 591-604, 2010.

[24] L. Neumann, J. Matas, “Real-time scene text localization and recognition,” in Proc. 2012 IEEE Conference on Computer Vision and Pattern Recognition: 3538-3545, 2012.

[25] F. Zhan, S. Lu, “Esir: End-to-end scene text recognition via iterative image rectification,” in Proc. 2019 IEEE Conference on Computer Vision and Pattern Recognition: 2059-2068, 2019.

[26] M. Sawaki, H. Murase, N. Hagita, “Automatic acquisition of context-based images templates for degraded character recognition in scene images,” in Proc. 15th International Conference on Pattern Recognition (ICPR), 4: 15-18, 2000.

[27] Y. F. Pan, X. Hou, C. L. Liu, “Text localization in natural scene images based on conditional random field,” in Proc. 2009 10th International Conference on Document Analysis and Recognition: 6-10, 2009.

[28] N. Dalal, B. Triggs, “Histograms of oriented gradients for human detection,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1: 886-893, 2005.

[29] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. of Comp. Vision, 60(2): 91-110, 2004.

[30] J. A. Suykens, J. Vandewalle, “Least squares support vector machine classifiers,” Neural Process. Lett., 9(3): 293-300, 1999.

[31] J. Almazan, A. Gordo, A. Forn´ es, E. Valveny, “Word´ spotting and recognition with embedded attributes,” IEEE Trans. Pattern Anal. Mach. Intell., 36(12): 2552-2566, 2014.

[32] A. Graves, S. Fernandez, F. Gomez, J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proc. 23rd International Conference on Machine Learning: 369-376, 2006.

[33] Z. Wan, F. Xie, Y. Liu, X. Bai, C. Yao, “2D-CTC for scene text recognition,” arXiv:1907.09705v1, 2019.

[34] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, X. Bai, “Aster: An attentional scene text recognizer with flexible rectification,” IEEE Trans. Pattern Anal. Mach. Intell., 41(9): 2035-2048, 2018.

[35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, “Attention is all you need,” in Proc. 31st Conference on Neural Information Processing Systems (NIPS 2017): 5998-6008, 2017.

[36] Z. Raisi, M. A. Naiel, G. Younes, S. Wardell, J. Zelek, “2lspe: 2d learnable sinusoidal positional encoding using transformer for scene text recognition,” in Proc. 2021 18th Conference on Robots and Vision (CRV): 119-126, 2021.

[37] Z. Qiao, Z. Ji, Y. Yuan, J. Bai, “Decoupling visual semantic features learning with dual masked autoencoder for self-supervised scene text recognition,” in Proc. International Conference on Document Analysis and Recognition: 261-279, 2023.

[38] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, L. P. De Las Heras, “ICDAR 2013 robust reading competition,” in Proc. 2013 12th International Conference on Document Analysis and Recognition: 1484-1493, 2013.

[39] A. Mishra, K. Alahari, C. V. Jawahar, “Scene text recognition using higher order language priors,” in Proc. BMVC, 2012.

[40] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, et al., “ICDAR 2015 competition on robust reading,” in Proc. 2015 13th International Conference on Document Analysis and Recognition (ICDAR), 2015.

[41] A. Risnumawan, P. Shivakumara, C. S. Chan, C. L. Tan, “A robust arbitrary text detection system for natural scene images,” Expert Syst. with Appl., 41(18): 8027- 8048, 2014.

[42] T. Quy Phan, P. Shivakumara, S. Tian, C. Lim Tan, “Recognizing text with perspective distortion in natural scenes,” in Proc. IEEE International Conference on Computer Vision (ICCV): 569-576, 2013.

[43] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, C. L. Zitnick, “Microsoft coco: Com-´ mon objects in context,” in Proc. 13th European Conference on Computer Vision: 740-755, 2014.

[44] A. Gupta, A. Vedaldi, A. Zisserman, “Synthetic data for text localisation in natural images,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition: 2315-2324, 2016.

[45] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, “Synthetic data and artificial neural networks for natural scene text recognition,” arXiv preprint arXiv:1406.2227, 2014.

[46] M. Iwamura, N. Morimoto, K. Tainaka, D. Bazazian, L. Gomez, D. Karatzas, “ICDAR2017 robust reading challenge on omnidirectional video,” in Proc. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 1: 1448-1453, 2017.

[47] Y. Sun, Z. Ni, C. K. Chng, Y. Liu, C. Luo, C. C. Ng, J. Han, E. Ding, J. Liu, D. Karatzas, et al., “ICDAR 2019 competition on large-scale street view text with partial labeling– RRC-LSVT,” 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019.

[48] W. Wu, Y. Zhao, Z. Li, J. Li, M. Z. Shou, U. Pal, D. Karatzas, X. Bai, “Icdar 2023 competition on video text reading for dense and small text,” in Proc. International Conference on Document Analysis and Recognition: 405–419, 2023.

[49] R. Zhang, Y. Zhou, Q. Jiang, Q. Song, N. Li, K. Zhou, L. Wang, D. Wang, M. Liao, M. Yang, et al., “ICDAR 2019 robust reading challenge on reading Chinese text on signboard,” in Proc. 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019.

[50] C. K. Chng, Y. Liu, Y. Sun, C. C. Ng, C. Luo, Z. Ni, C. Fang, S. Zhang, J. Han, E. Ding, J. Liu, D. Karatzas, C. Seng Chan, L. Jin, “Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art,” in Proc. 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019.

[51] Z. Wan, J. Zhang, L. Zhang, J. Luo, C. Yao, “On vocabulary reliance in scene text recognition,” in Proc. the IEEE/CVF Conference on Computer Vision and Pattern Recognition: 11425-11434, 2020.

[52] M. Tounsi, I. Moalla, A. M. Alimi, F. Lebouregois, “Arabic characters recognition in natural scenes using sparse coding for feature representations,” in Proc. 2015 13th International Conference on Document Analysis and Recognition (ICDAR): 1036-1040, 2015.

[53] M. Tounsi, I. Moalla, A. M. Alimi, “Arasti: A database for arabic scene text recognition,” in Proc. 2017 1st International Workshop on Arabic Script Analysis and Recognition (ASAR): 140-144, 2017.

[54] M. Jain, M. Mathew, C. Jawahar, “Unconstrained ocr for urdu using deep cnn-rnn hybrid networks,” in Proc. 2017 4th IAPR Asian Conference on Pattern Recognition (ACPR): 747- 752, 2017.

[55] N. Sabbour, F. Shafait, “A segmentation-free approach to arabic and urdu ocr,” in Proc. Document recognition and retrieval XX, 8658: 215-226, 2013.

[56] V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady, 10: 707-710, 1966.

[57] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 1(2): 3, 2022.

[58] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.

[59] G. Team, R. Anil, S. Borgeaud, Y. Wu, J. B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.

[60] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W. Y. Lo, et al., “Segment anything,” in Proc. the IEEE/CVF International Conference on Computer Vision: 4015- 4026, 2023.

[61] A. Kortylewski, Q. Liu, A. Wang, Y. Sun, A. Yuille, “Compositional convolutional neural networks: A robust and interpretable model for object recognition under occlusion,” arXiv preprint arXiv:2006.15538, 2020.

[62] Z. Raisi, J. Zelek, “Occluded text detection and recognition in the wild,” in Proc. 2022 19th Conference on Robots and Vision (CRV): 140-150, 2022.

[63] A. Faraji, M. Saeed, H. Nezamabadi-pour, "Introducing a database for Farsi document image understanding and segmentation," J. Mach. Vision Image Process., 10(2): 31-46, 2023 [In Persian].

LETTERS TO EDITOR

Journal of Electrical and Computer Engineering Innovations (JECEI) welcomes letters to the editor for the post-publication discussions and corrections which allows debate post publication on its site, through the Letters to Editor. Letters pertaining to manuscript published in JECEI should be sent to the editorial office of JECEI within three months of either online publication or before printed publication, except for critiques of original research. Following points are to be considering before sending the letters (comments) to the editor.

[1] Letters that include statements of statistics, facts, research, or theories should include appropriate references, although more than three are discouraged.

[2] Letters that are personal attacks on an author rather than thoughtful criticism of the author’s ideas will not be considered for publication.

[3] Letters can be no more than 300 words in length.

[4] Letter writers should include a statement at the beginning of the letter stating that it is being submitted either for publication or not.

[5] Anonymous letters will not be considered.

[6] Letter writers must include their city and state of residence or work.

[7] Letters will be edited for clarity and length.

Name *

Email Address *

Affiliation *

Comments *

Security Code *

Journal of Electrical and Computer Engineering Innovations (JECEI)

FATR: A Comprehensive Dataset and Evaluation Framework for Persian Text Recognition in Wild Images

References

References

Send comment about this article

Volume 13, Issue 2
July 2025
Pages 331-340

FATR: A Comprehensive Dataset and Evaluation Framework for Persian Text Recognition in Wild Images

References

References

Send comment about this article

Volume 13, Issue 2July 2025Pages 331-340

Volume 13, Issue 2
July 2025
Pages 331-340