|
A.E. Marchenko, E.I. Ershov, S.A. Gladilin System of parsing of documents specified by structure item attributes and relations between the items |
|
Abstract. Within the problem of document recognition with computer vision technologies the problem of finding the correspondence between the structure items of a document and their printed images that have no strict locations is concerned. An approach based on document description with attributes of its structure items and relations between the items is proposed. An algorithm of document parsing using this approach is proposed. A system implementing document parsing based on this approach is described. Keywords: document parsing, structure item, relations between items, item attributes, parsing algorithm. PP. 87-97. References1. Usilin S.A., Nikolaev D.P. and Postnikov V.V. 2008. Bystryy algoritm sovmeshcheniya izobrazheniy dokumentov v proizvol’noy geometricheskoy modeli [A fast algorithm of document image superposition in an arbitrary geometrical model]. Trudy konferentsii “Informatsionnye tehnologii i sistemy [Conference “Information Technologies and Systems” Preceedings]. Gelendzhik. 471 – 477. 2. Bezmaternyh P.V., Nikolaev D.P. and Postnikov V.V. 2008. Metod identifikatsii tipa documenta po strukture ego proektsiy na koordinatnye osi [Method of identifying of type of a document by structure of its projections to coordinate axes]. Trudy konferentsii “Informatsionnye tehnologii i sistemy [Conference “Information Technologies and Systems” Preceedings]. Gelendzhik. 498 – 501. 3. Postnikov V.V., Marchenko A.E. and Sholomov D.L. 2004. Razbor strukturirovannogo dokumenta v modeli s nechetkoy logikoy [Structured document parsing in a model with fuzzy logic]. Dokumentooborot. Kontseptsii i instrumentariy [Document Flow: Concepts and Toolkits]. 4. Postnikov V.V. 2001. Avtomaticheskaya identifikatsiya i raspoznavanie strukturirovannyh dokumentov [Automatic structured documents identification and recognition]. C. Sc. Diss. Moscow. 126 p. 5. Postnikov V.V. and Marchenko A. E. 2005. CFML: yazyk opisaniya mnogostranichnyh strukturirovannyh dokumentov dlya ih identifikatsii i raspoznavaniya [CFML: a language of description of multipage structured documents for their identification and recognition]. Matematicheskie metody raspoznavaniya obrazov (MMRO-12): Sbornik dokladov 12-y Vserossiyskoy Konferentsii [Mathematical Methods of Pattern Recognition: The 12th All-Russian Conference Preceedings]. 6. Eugene Borovikov, “A survey of modern optical character recognition techniques” arXiv preprint arXiv:1412.4183, 2014. 7. Olivier Augereau, Nicholas Journet, Jean-Philippe Domenger, “Semi-structured document image matching and recognition”, Proc. SPIE 8658, Document Recognition and Retrieval XX, 865804 (4 February 2013). 8. Bertrand Coüasnon, Aurélie Lemaitre, “Recognition of Tables and Forms”, Handbook of Document Image Processing and Recognition, 2014. 9. Cattoni R., Coianiz T., Messelodi S., Modena C.M., “Geometric Layout Analysis Techniquesfor Document Image Understanding: a Review”, Technical Report, IRST, Trento, Italy, 1998. 10. Thomas M Breuel, “High performance document layout analysis”, Proceedings of the Symposium on Document Image Understanding Technology, 2003. 11. Xiao Yang, Ersin Yumer, Paul Asente, Mike Kraley, Daniel Kifer, C. Lee Giles. “Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks”, IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017). 12. Tatsuhiko Kagehiro, Hiromichi Fujisawa, “Multiple Hypotheses Document Analysis”, Machine Learning in Document Analysis and Recognition, 2008. 13. Cartic Ramakrishnan, Abhishek Patnia, Eduard Hovy, Gully APC Burns, “Layout-aware text extraction from full-text PDF of scientific articles”, Source Code for Biology and Medicine 7(1), 2012. 14. Hui Chao, Jian Fan, “Layout and Content Extraction for PDF Documents. 2004. Layout and content extraction for pdf documents”, International Workshop on Document Analysis Systems. Springer, 2004. 15. Niyogi D. and Srihari S.N., “Knowledge-based derivation of document logical structure”, Proceedings of the 3rd International Conference on Document Analysis and Recognition – ICDAR, 1995. 16. Golubev S.V, Raspoznavanie strukturirovannyh dokumentov na osnove mashinnogo obucheniya [Recognition of Structured Documents Based on Machine Learning]. Biznes-informatika [Business Informatics]. – № 2 (16), 2011. 17. “Regular Expressions”. The Single UNIX ® Specification, Version 2. [electronic resource] // The Open Group [official website]. URL: http:// pubs.opengroup.org/onlinepubs/007908799/xbd/re.html (accessed: 1.09.2017) 18. Perl-compatible Regular Expressions (revised API: PCRE2) [electronic resource] // PCRE - Perl Compatible Regular Expressions [official website]. URL: http://pcre.org/current/doc/html/ (accessed: 1.09.2017)
|