Data mining and image recognition
O.A. Slavin, V.L. Arlazarov Method for classifying recognized pages of administrative documents on the basis of text key points
Intellectual systems and technologies
Image and signal processing
MACHINE LEARNING
O.A. Slavin, V.L. Arlazarov Method for classifying recognized pages of administrative documents on the basis of text key points

Abstract.

The paper considers the problem of classification of recognized pages of business documents. Administrative documents used in document circulation, including in the exchange of documents between organizations, have a certain standardization, they can be both unstructured and structured. In banks or insurance companies, such documents as a power of attorney, a contract, a card with samples of signatures and seals, a charter, a contract, an account, registration certificates, etc. are often needed. When creating and maintaining electronic archives, paper documents are digitized, and digital images of pages (page scans) can be recognized and analyzed. One of the tasks of the analysis is the classification of the page image, which consists in verifying that the page image belongs to a particular class. A simple method for classifying administrative documents that yields acceptable results is proposed.

Keywords:

classification of texts; recognition of documents; OCR; recognition error; template matching.

PP. 32-42.

References

1. Awal A.M., Ghanmi N., Sicre R., Furon T. Complex Document Classification and Localization Application on Identity Document Images // Proc. 14th IAPR International Conference on Document Analysis and Recognition. – 2017. – P. 427-432. doi 10.1109/ICDAR.2017.77
2. Ondrej Chum, Jiri Matas and Josef Kittler. “Locally Optimized RANSAC”. In: DAGM-Symposium. Vol. 2781. Lecture Notes in Computer Science. 2003, P. 236–243
3. Shemyakina Y.A., Zhukovsky A.E., Faradjev I.A. Issledovaniye algoritmov vychisleniya proyektivnogo preobrazovaniya v zadache navedeniya na planarnyy ob”yekt po osobym tochkam [Investigation of algorithms for calculating a projective transformation in the problem of targeting to a planar object from feature points], Iskusstvennyy Intellekt i Prinyatiye Resheniy [Artificial Intelligence and Decision Making], vol. 1, 2017, pp. 43-49.
4. Rusiñol M., Frinken V., Karatzas D., Bagdanov A.D., Lladós J. Multimodal page classification inadministrative document image streams // In: IJDAR 17.4 (2014), pp. 331–341.
5. Rubin T.N.,Chambers A., Smyth P., Steyvers M. Statistical topic models for multi-label document classification // Machine Learning. – 2012. – Vol.88,no.1-2. – P.157208.
6. Zhou S., Li K., Liu Y. Text categorization based on topiс model//International Journal of Computational Intelligence Systems. – 2009. – Vol.2, no.4. – P.398409
7. Vorontsov K.V., Potapenko A.A. Tutorial on probabilistic topic modeling: Additive regularization for stochastic matrix factorization // AIST’2014, AnalysisofImages, Social networks and Texts.- Vol.436. – Springer International Publishing Switzerland, Communications in Computer and Information Science (CCIS), 2014. – P.29-46.
8. Vorontsov K.V. Additive regularization of thematic models of collections of text documents // Doklady RAS. 2014. V. 456, № 3. P. 268-271.
9. El-Kishky A., Song Y., Wang C., Voss C. R., Han J. Scalable topical phrase mining from text corpora // Proc. VLDB Endowment. — 2014. — Vol. 8, no. 3. — Pp. 305-316.
10. Liu J., Shang J., Wang C., Ren X., Han J. Mining quality phrases from massive text corpora // Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. — SIGMOD 45. — New York, NY, USA: ACM, 2015. Pp. 1729-1744.
11. Yarn X., Guo J., Lan Y., Cheng X. A biterm topic model for short texts // Proceedings of the 22Nd International Conference on World Wide Web. — WWW ’13.- Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee, 2013.- P. 1445-1456.
12. Smirnov S.V. Technology and system of automatic adjustment of results under recognition of archival documents. Dissertation for the degree of candidate of technical sciences, Spb:, 2015. – 130 P.
13. Breiman L., Friedman J.H., Olshen R.A. & Stone C.J. Classification and regression trees. Monterey // CA: Wadsworth & Brooks/Cole Advanced Books & Software, 1984. – 368 p.
 

2024-74-2
2024-74-1
2023-73-4
2023-73-3

© ФИЦ ИУ РАН 2008-2018. Создание сайта "РосИнтернет технологии".