Text Detection and Recognition for Robot Localization

Raisi, Z.; Zelek, J.

doi:10.22061/jecei.2023.9857.658

Document Type : Original Research Paper

Authors

Z. Raisi ¹
J. Zelek ²

¹ University of Waterloo, Waterloo, Canada and Chabahar Maritime University, Chabahar, Iran.

² Systems Design Engineering Department, University of Waterloo, Canada.

https://doi.org/10.22061/jecei.2023.9857.658

Abstract

Background and Objectives: Signage is everywhere, and a robot should be able to take advantage of signs to help it localize (including Visual Place Recognition (VPR)) and map. Robust text detection & recognition in the wild is challenging due to pose, irregular text instances, illumination variations, viewpoint changes, and occlusion factors.
Methods: This paper proposes an end-to-end scene text spotting model that simultaneously outputs the text string and bounding boxes. The proposed model leverages a pre-trained Vision Transformer based (ViT) architecture combined with a multi-task transformer-based text detector more suitable for the VPR task. Our central contribution is introducing an end-to-end scene text spotting framework to adequately capture the irregular and occluded text regions in different challenging places. We first equip the ViT backbone using a masked autoencoder (MAE) to capture partially occluded characters to address the occlusion problem. Then, we use a multi-task prediction head for the proposed model to handle arbitrary shapes of text instances with polygon bounding boxes.
Results: The evaluation of the proposed architecture's performance for VPR involved conducting several experiments on the challenging Self-Collected Text Place (SCTP) benchmark dataset. The well-known evaluation metric, Precision-Recall, was employed to measure the performance of the proposed pipeline. The final model achieved the following performances, Recall = 0.93 and Precision = 0.8, upon testing on this benchmark.
Conclusion: The initial experimental results show that the proposed model outperforms the state-of-the-art (SOTA) methods in comparison to the SCTP dataset, which confirms the robustness of the proposed end-to-end scene text detection and recognition model.

Keywords

Main Subjects

Computer Vision

Open Access

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit: http://creativecommons.org/licenses/by/4.0/

Publisher’s Note

JECEI Publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Publisher

Shahid Rajaee Teacher Training University

References

[1] A. Anoosheh, T. Sattler, R. Timofte, M. Pollefeys, L. Van Gool, “Night-to-day image translation for retrieval-based localization,” in Proc. 2019 International Conference on Robotics and Automation (ICRA): 5958–5964, 2019.

[2] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in Proc. IEEE/CVF International Conference on Computer Vision: 5297–5307, 2016.

[3] R. Atienza, “Vision transformer for fast and efficient scene text recognition,” Document Analysis and Recognition – ICDAR 2021. Springer International Publishing, pp. 319–334, 2021.

[4] J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, H. Lee, “What is wrong with scene text recognition model comparisons? dataset and model analysis,” in Proc. International Conference on Computer Vision (ICCV), 2019.

[5] Y. Baek, B. Lee, D. Han, S. Yun, H. Lee, “Character region awareness for text detection,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2019.

[6] Y. Baek, S. Shin, J. Baek, S. Park, J. Lee, D. Nam, H. Lee, “Character region attention for text spotting,” ArXiv, vol. abs/2007.09629, 2020.

[7] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, “End-to-end object detection with transformers,” arXiv preprint arXiv:2005.12872, 2020.

[8] W. Chan, C. Saharia, G. Hinton, M. Norouzi, N. Jaitly, “Imputer: Sequence modeling via imputation and dynamic programming,” arXiv preprint arXiv:2002.08926, 2020.

[9] C. K. Ch’ng, C. S. Chan, “Total-text: A comprehensive dataset for scene text detection and recognition,” in Proc. IAPR International Conference on Document Anal. and Recognition (ICDAR), 1: 935–942, 2017.

[10] K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, et al., “Rethinking attention with performers,” arXiv preprint arXiv:2009.14794, 2020.

[11] M. Cummins, P. Newman, “Fab-map: Probabilistic localization and mapping in the space of appearance,” Int. J. Rob. Res., 27(6): 647–665, 2008.

[12] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.

[13] S. Fang, H. Xie, Y. Wang, Z. Mao, Y. Zhang, “Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition: 7098–7107, 2021.

[14] W. Feng, W. He, F. Yin, X. Y. Zhang, C. L. Liu, “Textdragon: An end-to-end framework for arbitrarily shaped text spotting,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition: 9076–9085, 2019.

[15] S. Garg, T. Fischer, M. Milford, “Where is your place, visual place recognition?” arXiv preprint arXiv:2103.06443, 2021.

[16] A. Gupta, A. Vedaldi, A. Zisserman, “Synthetic data for text localization in natural images,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition: 2315–2324, 2016.

[17] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, et al., “A survey on the visual transformer,” arXiv preprint arXiv:2012.12556, 2020.

[18] K. He, X. Chen, S. Xie, Y. Li, P. Dollar, R. Girshick, “Masked autoencoders are scalable vision learners,” arXiv preprint arXiv:2111.06377, 2021.

[19] K. He, G. Gkioxari, P. Dollar, R. Girshick, “Mask R-CNN, ” in Proc. IEEE International Conference on Computer Vision: 2961–2969, 2017.

[20] K. He, X. Zhang, S. Ren, J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 770–778, 2015.

[21] S. Hochreiter, J. Schmidhuber, “Long short-term memory,” Neural Comput., 9(8): 1735–1780, 1997.

[22] Z. Hong, Y. Petillot, D. Lane, Y. Miao, S. Wang, “Textplace: Visual place recognition and topological localization through reading scene texts,” in Proc. IEEE/CVF International Conference on Computer Vision: 2861–2870, 2019.

[23] M. Iwamura, N. Morimoto, K. Tainaka, D. Bazazian, L. Gomez, D. Karatzas, “ICDAR2017 robust reading challenge on omnidirectional video,” in Proc. 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 1: 1448–1453, 2017.

[24] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, “Synthetic data and artificial neural networks for natural scene text recognition, ” arXiv preprint arXiv:1406.2227, 2014.

[25] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, et al., “ICDAR 2015 competition on robust reading,” in Proc. International Conference on Document Analysis and Recognition (ICDAR): 1156–1160, 2015.

[26] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. I Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, L. P. De Las Heras, “ICDAR 2013 robust reading competition,” in Proc. International Conference on Document Analysis and Recognition: 1484–1493, 2013.

[27] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, M. Shah, “Transformers in vision: A survey,” arXiv preprint arXiv:2101.01169, 2021.

[28] Y. Kittenplon, I. Lavi, S. Fogel, Y. Bar, R. Manmatha, P. Perona, “Towards weakly-supervised text spotting using a multi-task transformer,” arXiv preprint arXiv:2202.05508, 2022.

[29] A. B. Laguna, K. Mikolajczyk, “Key. net: Keypoint detection by handcrafted and learned CNN filters revisited,” IEEE Trans. Pattern Anal. Mach. Intell., 45(1): 698-711, 2022.

[30] J. Lee, S. Park, J. Baek, S. Joon Oh, S. Kim, H. Lee, “On recognizing texts of arbitrary shapes with 2D self-attention,” in Proc. IEEE CVPR: 546–547, 2020.

[31] H. Li, P. Wang, C. Shen, “Towards end-to-end text spotting with convolutional recurrent neural networks,” in Proc. 2017 IEEE International Conference on Computer Vision (ICCV): 5248–5256, 2017.

[32] Y. Li, S. Xie, X. Chen, P. Dollar, K. He, R. Girshick, “Bench-marking detection transfer learning with vision transformers,” arXiv preprint arXiv:2111.11429, 2021.

[33] M. Liao, G. Pang, J. Huang, T. Hassner, X. Bai, “Mask textspotter v3: Segmentation proposal network for robust scene text spotting,” in Proc. Computer Vision–ECCV 2020: 16th European Conference, Part XI 16: 706–722, 2020.

[34] M. Liao, B. Shi, X. Bai, “Textboxes++: A single-shot oriented scene text detector,” IEEE Trans. Image Process., 27(8): 3676–3690, 2018.

[35] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proc Eur. Conference on Computer Vision. Springer: 740–755, 2014.

[36] V. Nazarzehi, R. Damani, “Decentralised optimal deployment of mobile underwater sensors for covering layers of the ocean,” Indones. J. Electr. Eng. Comput. Sci., 25(2): 840–846, 2022.

[37] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, A. C. Berg, “SSD: Single shot multibox detector,” in Proc. Eur. Conference on Computer Vision. Springer: 21–37, 2016.

[38] X. Liu, D. Liang, S. Yan, D. Chen, Y. Qiao, J. Yan, “FOTS: Fast oriented text spotting with a unified network,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition: 5676–5685, 2018.

[39] Y. Liu, H. Chen, C. Shen, T. He, L. Jin, L. Wang, “Abcnet: Real-time scene text spotting with adaptive bezier-curve network,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition: 9809–9818, 2020.

[40] Y. Liu, C. Shen, L. Jin, T. He, P. Chen, C. Liu, H. Chen, “Abcnet v2: Adaptive bezier-curve network for real-time end-to-e end text spotting,” arXiv preprint arXiv:2105.03620, 2021.

[41] S. Lowry, N. S. Underhauf, P. Newman, J. J. Leonard, D. Cox, P. Corke, M. J. Milford, “Visual place recognition: A survey,” IEEE Trans. Rob., 32(1): 1–19, 2015.

[42] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, R. Young, “ICDAR 2003 robust reading competitions,” in Proc. Seventh Int. Conference on Document Analysis and Recognition: 682– 687, 2023.

[43] P. Lyu, M. Liao, C. Yao, W. Wu, X. Bai, “Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes,” in Proc. Eur. Conference on Computer Vision (ECCV) : 67– 83, 2018.

[44] C. Masone, B. Caputo, “A survey on deep visual place recognition,” IEEE Access, 9: 19516–19547, 2021.

[45] M. J. Milford, G. F. Wyeth, “Seqslam: Visual route-based navigation for sunny summer days and stormy winter nights,” in Proc. IEEE International Conference on Robotics and Automation: 1643–1649, 2012.

[46] A. Mishra, K. Alahari, C. V. Jawahar, “Scene text recognition using higher order language priors,” in BMVC, 2012.

[47] S. Qin, A. Bissacco, M. Raptis, Y. Fujii, Y. Xiao, “Towards unconstrained end-to-end text spotting,” in Proc. IEEE/CVF International Conference on Computer Vision: 4704–4714, 2019.

[48] T. Q. Phan, P. Shivakumara, S. Tian, C. Lim Tan, “Recognizing text with perspective distortion in natural scenes,” in Proc. IEEE International Conference on Computer Vision: 569–576, 2013.

[49] Z. Raisi, M. Naiel, P. Fieguth, S. Wardell, J. Zelek, “2d positional embedding-based transformer for scene text recognition,” J. Comput. Vision Imaging Syst., 6(1): 1–4, 2021.

[50] Z. Raisi, M. A. Naiel, P. Fieguth, S. Wardell, J. Zelek, “Text detection and recognition in the wild: A review,” arXiv preprint arXiv:2006.04305, 2020.

[51] Z. Raisi, M. A. Naiel, G. Younes, S. Wardell, J. Zelek, “2lspe: 2d learnable sinusoidal positional encoding using a transformer for scene text recognition,” in Proc. Conference on Robots and Vision (CRV): 119–126, 2021.

[52] Z. Raisi, M. A. Naiel, G. Younes, S. Wardell, J. S. Zelek, “Transformer-based text detection in the wild,” in Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops: 3162–3171, 2021.

[53] Z. Raisi, G. Younes, J. Zelek, “Arbitrary shape text detection using transformers,” in Proc. IEEE International Conference on Pattern Recognition (ICPR): 3238-3245, 2022.

[54] Z. Raisi, J. Zelek, “Occluded text detection and recognition in the wild,” in IEEE Proceeding Conference on Robots and Vision (CRV): 140-150, 2022.

[55] Z. Raisi, J. S. Zelek, “End-to-end scene text spotting at character level,” J. Comput. Vision Imaging Syst., 7(1): 25-27, 2021.

[56] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition: 779–788, 2016.

[57] S. Ren, K. He, R. Girshick, J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Proc. Adv. in Neural Info. Process. Syst.: 91–99, 2015.

[58] A. Risnumawan, P. Shivakumara, C. S. Chan, C. L. Tan, “A robust arbitrary text detection system for natural scene images,” Expert Syst. Appl., 41(18): 8027–8048, 2014.

[59] D. E. Rumelhart, G. E. Hinton, R. J. Williams, “Learning representations by back-propagating errors,” Nature, 323(6088): 533–536, 1986.

[60] A. Shahab, F. Shafait, A. Dengel, “ICDAR 2011 robust reading competition challenge 2: Reading text in scene images,” in Proc. International Conference on Doc. Anal. and Recognition: 1491–1496, 2011.

[61] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, X. Bai, “Aster: An attentional scene text recognizer with flexible rectification,” IEEE Trans. Pattern Anal. Mach. Intell., 41(9): 2035-2048, 2018.

[62] Y. Sun, Z. Ni, C.-K. Chng, Y. Liu, C. Luo, C. C. Ng, J. Han, E. Ding, J. Liu, D. Karatzas, et al., “ICDAR 2019 competition on large-scale street view text with partial labeling –RRC-LSVT,” arXiv preprint arXiv:1909.07741, 2019.

[63] Y. Tay, M. Dehghani, D. Bahri, D. Metzler, “Efficient transform- ers: A survey,” arXiv preprint arXiv:2009.06732, 2020.

[64] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, “Attention is all you need,” in Proc. Advances in Neural Information Processing Systems (NIPS 2017): 5998– 6008, 2017.

[65] K. Wang, S. Belongie, “Word spotting in the wild,” in Proc. Eur. Conference on Computer Vision. Springer: 591–604, 2010.

[66] C. Yao, X. Bai, W. Liu, Y. Ma, Z. Tu, “Detecting texts of arbitrary orientations in natural images,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition: 1083–1090, 2012.

[67] L. Yuliang, J. Lianwen, Z. Shuaitao, Z. Sheng, “Detecting curve text in the wild: New dataset and new solution,” in arXiv preprint arXiv:1712.02170, 2017.

[68] X. Zhang, Y. Su, S. Tripathi, Z. Tu, “Text spotting transformers,” arXiv preprint arXiv:2204.01918, 2022.

[69] X. Zhang, L. Wang, Y. Su, “Visual place recognition: A survey from deep learning perspective,” Pattern Recognit., 113: 107760, 2021.

[70] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” arXiv preprint arXiv:2010.04159, 2020.

[71] S. X. Zhang, X. Zhu, J. B. Hou, C. Liu, C. Yang, H. Wang, X. C. Yin, “Deep relational reasoning graph network for arbitrary shape text detection,” in Proc. the IEEE/CVF Conference on Computer Vision and Pattern Recognition: 9699-9708, 2020.

[72] L. Xing, Z. Tian, W. Huang, M. R. Scott, “Convolutional character networks,” in Proc. the IEEE/CVF International Conference on Computer Vision: 9126-9136, 2019.

[73] I. Loshchilov, F. Hutter, “Decoupled weight decay regularization,” in Proc. International Conference on Learning Representations, 2018.

[74] G. Liao, Z. Zhu, Y. Bai, T. Liu, Z. Xie, “PSENet-based efficient scene text detection,” EURASIP J. Adv. Signal Process., 97(1), 1-13, 2021.

[75] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou,W. He, J. Liang, “East: an efficient and accurate scene text detector,” in Proc. the IEEE Conference on Computer Vision and Pattern Recognition: 5551-5560, 2017.

[76] C. K. Ch'ng, C. S. Chan, “TotalText: A comprehensive dataset for scene text detection and recognition,” in Proc. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 1: 935-942, 2017.

[77] L. Yuliang, J. Lianwen, Z. Shuaitao, Z. Sheng, “Detecting curve text in the wild: New dataset and new solution,” in arXiv preprint arXiv:1712.02170, 2017.

[78] D. M. Katz, M. J. Bommarito, S. Gao, P. Arredondo, Gpt-4 passes the bar exam. Available at SSRN 4389233.

[79] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, S. J. Gershman, “Building machines that learn and think like people,” Behav. Brain sci., 40, 2017.

LETTERS TO EDITOR

Journal of Electrical and Computer Engineering Innovations (JECEI) welcomes letters to the editor for the post-publication discussions and corrections which allows debate post publication on its site, through the Letters to Editor. Letters pertaining to manuscript published in JECEI should be sent to the editorial office of JECEI within three months of either online publication or before printed publication, except for critiques of original research. Following points are to be considering before sending the letters (comments) to the editor.

[1] Letters that include statements of statistics, facts, research, or theories should include appropriate references, although more than three are discouraged.

[2] Letters that are personal attacks on an author rather than thoughtful criticism of the author’s ideas will not be considered for publication.

[3] Letters can be no more than 300 words in length.

[4] Letter writers should include a statement at the beginning of the letter stating that it is being submitted either for publication or not.

[5] Anonymous letters will not be considered.

[6] Letter writers must include their city and state of residence or work.

[7] Letters will be edited for clarity and length.

Name *

Email Address *

Affiliation *

Comments *

Security Code *

Journal of Electrical and Computer Engineering Innovations (JECEI)

Text Detection and Recognition for Robot Localization

References

References

Send comment about this article

Volume 12, Issue 1
January 2024
Pages 163-174

Text Detection and Recognition for Robot Localization

References

References

Send comment about this article

Volume 12, Issue 1January 2024Pages 163-174

Volume 12, Issue 1
January 2024
Pages 163-174