Applied aspects in informatics
Mathematical models of socio-economic processes
Dynamic systems
Scientometrics and management science
Recognition of images
A.E. Marchenko, E.I. Ershov, S.A. Gladilin System of parsing of documents specified by structure item attributes and relations between the items
A.E. Marchenko, E.I. Ershov, S.A. Gladilin System of parsing of documents specified by structure item attributes and relations between the items

Abstract.

Within the problem of document recognition with computer vision technologies the problem of finding the correspondence between the structure items of a document and their printed images that have no strict locations is concerned. An approach based on document description with attributes of its structure items and relations between the items is proposed. An algorithm of document parsing using this approach is proposed. A system implementing document parsing based on this approach is described.

Keywords:

document parsing, structure item, relations between items, item attributes, parsing algorithm.

PP. 87-97.

References

1. Usilin S.A., Nikolaev D.P. and Postnikov V.V. 2008. Bystryy algoritm sovmeshcheniya izobrazheniy dokumentov v  proizvol’noy geometricheskoy modeli [A fast algorithm of document image superposition in an arbitrary geometrical model].
Trudy konferentsii “Informatsionnye tehnologii i sistemy [Conference “Information Technologies and Systems” Preceedings].  Gelendzhik. 471 – 477.
2. Bezmaternyh P.V., Nikolaev D.P. and Postnikov V.V. 2008. Metod identifikatsii tipa documenta po strukture ego proektsiy na  koordinatnye osi [Method of identifying of type of a document by structure of its projections to coordinate axes]. Trudy  konferentsii “Informatsionnye tehnologii i sistemy [Conference “Information Technologies and Systems” Preceedings]. Gelendzhik. 498 – 501.
3. Postnikov V.V., Marchenko A.E. and Sholomov D.L. 2004. Razbor strukturirovannogo dokumenta v modeli s nechetkoy  logikoy [Structured document parsing in a model with fuzzy logic]. Dokumentooborot. Kontseptsii i instrumentariy [Document  Flow: Concepts and Toolkits].
4. Postnikov V.V. 2001. Avtomaticheskaya identifikatsiya i raspoznavanie strukturirovannyh dokumentov [Automatic structured  documents identification and recognition]. C. Sc. Diss. Moscow. 126 p.
5. Postnikov V.V. and Marchenko A. E. 2005. CFML: yazyk opisaniya mnogostranichnyh strukturirovannyh dokumentov dlya ih  identifikatsii i raspoznavaniya [CFML: a language of description of multipage structured documents for their identification and  recognition]. Matematicheskie metody raspoznavaniya obrazov (MMRO-12): Sbornik dokladov 12-y Vserossiyskoy Konferentsii
[Mathematical Methods of Pattern Recognition: The 12th All-Russian Conference Preceedings].
6. Eugene Borovikov, “A survey of modern optical character recognition techniques” arXiv preprint arXiv:1412.4183, 2014.
7. Olivier Augereau, Nicholas Journet, Jean-Philippe Domenger, “Semi-structured document image matching and recognition”,  Proc. SPIE 8658, Document Recognition and Retrieval XX, 865804 (4 February 2013).
8. Bertrand Coüasnon, Aurélie Lemaitre, “Recognition of Tables and Forms”, Handbook of Document Image Processing and  Recognition, 2014.
9. Cattoni R., Coianiz T., Messelodi S., Modena C.M., “Geometric Layout Analysis Techniquesfor Document Image  Understanding: a Review”, Technical Report, IRST, Trento, Italy, 1998.
10. Thomas M Breuel, “High performance document layout analysis”, Proceedings of the Symposium on Document Image  Understanding Technology, 2003.
11. Xiao Yang, Ersin Yumer, Paul Asente, Mike Kraley, Daniel Kifer, C. Lee Giles. “Learning to Extract Semantic Structure from  Documents Using Multimodal Fully Convolutional Neural Networks”, IEEE Conference on Computer Vision and Pattern  Recognition (CVPR 2017).
12. Tatsuhiko Kagehiro, Hiromichi Fujisawa, “Multiple Hypotheses Document Analysis”, Machine Learning in Document Analysis  and Recognition, 2008.
13. Cartic Ramakrishnan, Abhishek Patnia, Eduard Hovy, Gully APC Burns, “Layout-aware text extraction from full-text PDF of  scientific articles”, Source Code for Biology and Medicine 7(1), 2012.
14. Hui Chao, Jian Fan, “Layout and Content Extraction for PDF Documents. 2004. Layout and content extraction for pdf  documents”, International Workshop on Document Analysis Systems. Springer, 2004.
15. Niyogi D. and Srihari S.N., “Knowledge-based derivation of document logical structure”, Proceedings of the 3rd International  Conference on Document Analysis and Recognition – ICDAR, 1995.
16. Golubev S.V, Raspoznavanie strukturirovannyh dokumentov na osnove mashinnogo obucheniya [Recognition of Structured  Documents Based on Machine Learning]. Biznes-informatika [Business Informatics]. – № 2 (16), 2011.
17. “Regular Expressions”. The Single UNIX ® Specification, Version 2. [electronic resource] // The Open Group [official  website]. URL: http:// pubs.opengroup.org/onlinepubs/007908799/xbd/re.html (accessed: 1.09.2017)
18. Perl-compatible Regular Expressions (revised API: PCRE2) [electronic resource] // PCRE - Perl Compatible Regular  Expressions [official website]. URL: http://pcre.org/current/doc/html/ (accessed: 1.09.2017)
 

2024-74-3
2024-74-2
2024-74-1
2023-73-4

© ФИЦ ИУ РАН 2008-2018. Создание сайта "РосИнтернет технологии".