Abstract.
The issues of automatic text documents classification of the university in the electronic document management system are considered. A two-stage classification method based on machine learning and a numerical representation of documents is presented. It is proposed at the first stage of the method to reduce the collection size by screening out documents that do not belong to accepted classes (according to the probability of novelty of documents). At the second stage, the selection of documents with the highest occurrence frequencies of words characteristic of accepted classes documents is carried out (the formation of support vectors). The document is assigned a class to which most of the closest documents belong in accordance with the accepted distance metric. A set of programs for the text documents classification has been implemented, which is the basis for the information support of the university electronic document management system, and studies have been carried out confirming the effectiveness of the proposed method.
Keywords:
document classification, the novelty of text documents, probabilistic thematic model, support vector machine, k-nearest neighbors.
PP. 3-19.
DOI 10.14357/20718632230101 References
1. Wan Ch. H. et al. A Hybrid text classification approach with low dependency on parameter by integrating Knearest neighbour and support vector machine // Expert Systems with Applications, elsevier journal. – 2012. – Vol. 39. – no. 15. – P. 11880–11888. 2. Su Y., Huang Y., Kuo Jay C.-C. Efficient Text Classification Using Tree-structured Multi-linear Principal Component Analysis // 24th International Conference on Pattern Recognition. – 2018. – С. 585-590. 3. Nguyen L. Text classification based on support vector machine // Dalat University Journal Of Science. – 2019. – Vol. 9. – no. 2. – P. 3–19. 4. Shah K. et al. A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification // Augmented Human Research. – 2020. – Vol. 5. – № 1. – P. 1-12. 5. Tkachenko A. L., Denisova L. A. Designing an information system for the electronic document management of a university: Automatic classification of documents // Journal of Physics: Conference Series. – 2022. – P. 012035. 6. Shichao Z. Efficient kNN Classification With Different Numbers of Nearest Neighbors // IEEE Transactions on Neural Networks and Learning Systems. – 2018. – Vol. 29. – no. 5. – P. 1774–1785. 7. Wahdan A. et al. A systematic review of text classification research based on deep learning models in Arabic language // International Journal of Electrical and Computer Engineering (IJECE). – 2020. – Vol. 10. – no. 6. – P. 6629–6643. 8. Zulqarnain M. et al. A comparative review on deep learning models for text classification // Indonesian Journal of Electrical Engineering and Computer Science. – 2020. – Vol. 19. – no. 1. – P. 325-335. 9. Vorontsov K. V., Potapenko A. A. 2012. Regulyarizaciya, robastnost' i razrezhennost' veroyatnostnyh tematicheskih modelej [Regularization, robustness and sparsity of probabilistic topic models]. Komp'yuternye issledovaniya i modelirovanie [Computer research and modeling]. 4(4): 693–706. 10. Karpovich S. N., Smirnov A. V., Teslya N. N. 2020. Uchet neizvestnyh slov v veroyatnostnoj tematicheskoj modeli [Penalty for Unknown Words in Topic Model]. Informacionnye tekhnologii i vychislitel'nye sistemy [Information technologies and computing systems]. 4: 111-124. 11. Certificate No. 2022612195. The program of two-stage classification of text documents of a higher educational institution: computer program / A.L. Tkachenko ; copyright holder of SibADI (RU). Application. 24.01.2022; publ. 25.01.2022, Bul. No. 2 2022, 1.43 Kb. 12. Tkachenko A. L., Meshcheryakov V. A., Denisova L. A. Proektirovanie informacionno-analiticheskoj sistemy dlya podderzhki obrazovatel'nogo processa tekhnicheskogo vuza // Avtomatizaciya v promyshlennosti. – 2022. – № 4. – P. 7-14. 13. Morfologicheskij analizator pymorphy2. URL: https://pymorphy2.readthedocs.io/en/stable/index.html (дата обращения: 30.05.2022). 14. Kostrov B. V., Baranchikov A. I., Klyueva I. A. 2021. Ansamblevye metody v zadache mnokoklassovoj SVMklassifikacii [The ensemble methods in the multi-class SVM classification problem]. XXI vek: itogi proshlogo i problemy nastoyashchego 15. Tkachenko A. L. 2021. Reshenie zadachi klassifikacii dokumentov vuza na osnove metodov intellektual'nogo analiza [Solving the problem of university documents classification based on intellectual analysis methods]. Vestnik kibernetiki [Bulletin of Cybernetics]. 1 (41): 12-19. 16. Russian News 2020. News in Russian, collected from four sources. URL: https://www.kaggle.com/datasets/vfomenko/russian-news-2020 (date of access: 30.05.2022).
|